-
FAIR-SIGHT: Fairness Assurance in Image Recognition via Simultaneous Conformal Thresholding and Dynamic Output Repair
Authors:
Arya Fayyazi,
Mehdi Kamal,
Massoud Pedram
Abstract:
We introduce FAIR-SIGHT, an innovative post-hoc framework designed to ensure fairness in computer vision systems by combining conformal prediction with a dynamic output repair mechanism. Our approach calculates a fairness-aware non-conformity score that simultaneously assesses prediction errors and fairness violations. Using conformal prediction, we establish an adaptive threshold that provides ri…
▽ More
We introduce FAIR-SIGHT, an innovative post-hoc framework designed to ensure fairness in computer vision systems by combining conformal prediction with a dynamic output repair mechanism. Our approach calculates a fairness-aware non-conformity score that simultaneously assesses prediction errors and fairness violations. Using conformal prediction, we establish an adaptive threshold that provides rigorous finite-sample, distribution-free guarantees. When the non-conformity score for a new image exceeds the calibrated threshold, FAIR-SIGHT implements targeted corrective adjustments, such as logit shifts for classification and confidence recalibration for detection, to reduce both group and individual fairness disparities, all without the need for retraining or having access to internal model parameters. Comprehensive theoretical analysis validates our method's error control and convergence properties. At the same time, extensive empirical evaluations on benchmark datasets show that FAIR-SIGHT significantly reduces fairness disparities while preserving high predictive performance.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
FACTER: Fairness-Aware Conformal Thresholding and Prompt Engineering for Enabling Fair LLM-Based Recommender Systems
Authors:
Arya Fayyazi,
Mehdi Kamal,
Massoud Pedram
Abstract:
We propose FACTER, a fairness-aware framework for LLM-based recommendation systems that integrates conformal prediction with dynamic prompt engineering. By introducing an adaptive semantic variance threshold and a violation-triggered mechanism, FACTER automatically tightens fairness constraints whenever biased patterns emerge. We further develop an adversarial prompt generator that leverages histo…
▽ More
We propose FACTER, a fairness-aware framework for LLM-based recommendation systems that integrates conformal prediction with dynamic prompt engineering. By introducing an adaptive semantic variance threshold and a violation-triggered mechanism, FACTER automatically tightens fairness constraints whenever biased patterns emerge. We further develop an adversarial prompt generator that leverages historical violations to reduce repeated demographic biases without retraining the LLM. Empirical results on MovieLens and Amazon show that FACTER substantially reduces fairness violations (up to 95.5%) while maintaining strong recommendation accuracy, revealing semantic variance as a potent proxy of bias.
△ Less
Submitted 5 February, 2025;
originally announced February 2025.
-
CHOSEN: Compilation to Hardware Optimization Stack for Efficient Vision Transformer Inference
Authors:
Mohammad Erfan Sadeghi,
Arash Fayyazi,
Suhas Somashekar,
Massoud Pedram
Abstract:
Vision Transformers (ViTs) represent a groundbreaking shift in machine learning approaches to computer vision. Unlike traditional approaches, ViTs employ the self-attention mechanism, which has been widely used in natural language processing, to analyze image patches. Despite their advantages in modeling visual tasks, deploying ViTs on hardware platforms, notably Field-Programmable Gate Arrays (FP…
▽ More
Vision Transformers (ViTs) represent a groundbreaking shift in machine learning approaches to computer vision. Unlike traditional approaches, ViTs employ the self-attention mechanism, which has been widely used in natural language processing, to analyze image patches. Despite their advantages in modeling visual tasks, deploying ViTs on hardware platforms, notably Field-Programmable Gate Arrays (FPGAs), introduces considerable challenges. These challenges stem primarily from the non-linear calculations and high computational and memory demands of ViTs. This paper introduces CHOSEN, a software-hardware co-design framework to address these challenges and offer an automated framework for ViT deployment on the FPGAs in order to maximize performance. Our framework is built upon three fundamental contributions: multi-kernel design to maximize the bandwidth, mainly targeting benefits of multi DDR memory banks, approximate non-linear functions that exhibit minimal accuracy degradation, and efficient use of available logic blocks on the FPGA, and efficient compiler to maximize the performance and memory-efficiency of the computing kernels by presenting a novel algorithm for design space exploration to find optimal hardware configuration that achieves optimal throughput and latency. Compared to the state-of-the-art ViT accelerators, CHOSEN achieves a 1.5x and 1.42x improvement in the throughput on the DeiT-S and DeiT-B models.
△ Less
Submitted 24 July, 2024; v1 submitted 17 July, 2024;
originally announced July 2024.
-
Dynamic Co-Optimization Compiler: Leveraging Multi-Agent Reinforcement Learning for Enhanced DNN Accelerator Performance
Authors:
Arya Fayyazi,
Mehdi Kamal,
Massoud Pedram
Abstract:
This paper introduces a novel Dynamic Co-Optimization Compiler (DCOC), which employs an adaptive Multi-Agent Reinforcement Learning (MARL) framework to enhance the efficiency of mapping machine learning (ML) models, particularly Deep Neural Networks (DNNs), onto diverse hardware platforms. DCOC incorporates three specialized actor-critic agents within MARL, each dedicated to different optimization…
▽ More
This paper introduces a novel Dynamic Co-Optimization Compiler (DCOC), which employs an adaptive Multi-Agent Reinforcement Learning (MARL) framework to enhance the efficiency of mapping machine learning (ML) models, particularly Deep Neural Networks (DNNs), onto diverse hardware platforms. DCOC incorporates three specialized actor-critic agents within MARL, each dedicated to different optimization facets: one for hardware and two for software. This cooperative strategy results in an integrated hardware/software co-optimization approach, improving the precision and speed of DNN deployments. By focusing on high-confidence configurations, DCOC effectively reduces the search space, achieving remarkable performance over existing methods. Our results demonstrate that DCOC enhances throughput by up to 37.95% while reducing optimization time by up to 42.2% across various DNN models, outperforming current state-of-the-art frameworks.
△ Less
Submitted 21 February, 2025; v1 submitted 11 July, 2024;
originally announced July 2024.
-
PEANO-ViT: Power-Efficient Approximations of Non-Linearities in Vision Transformers
Authors:
Mohammad Erfan Sadeghi,
Arash Fayyazi,
Seyedarmin Azizi,
Massoud Pedram
Abstract:
The deployment of Vision Transformers (ViTs) on hardware platforms, specially Field-Programmable Gate Arrays (FPGAs), presents many challenges, which are mainly due to the substantial computational and power requirements of their non-linear functions, notably layer normalization, softmax, and Gaussian Error Linear Unit (GELU). These critical functions pose significant obstacles to efficient hardwa…
▽ More
The deployment of Vision Transformers (ViTs) on hardware platforms, specially Field-Programmable Gate Arrays (FPGAs), presents many challenges, which are mainly due to the substantial computational and power requirements of their non-linear functions, notably layer normalization, softmax, and Gaussian Error Linear Unit (GELU). These critical functions pose significant obstacles to efficient hardware implementation due to their complex mathematical operations and the inherent resource count and architectural limitations of FPGAs. PEANO-ViT offers a novel approach to streamlining the implementation of the layer normalization layer by introducing a division-free technique that simultaneously approximates the division and square root function. Additionally, PEANO-ViT provides a multi-scale division strategy to eliminate division operations in the softmax layer, aided by a Pade-based approximation for the exponential function. Finally, PEANO-ViT introduces a piece-wise linear approximation for the GELU function, carefully designed to bypass the computationally intensive operations associated with GELU. In our comprehensive evaluations, PEANO-ViT exhibits minimal accuracy degradation (<= 0.5% for DeiT-B) while significantly enhancing power efficiency, achieving improvements of 1.91x, 1.39x, 8.01x for layer normalization, softmax, and GELU, respectively. This improvement is achieved through substantial reductions in DSP, LUT, and register counts for these non-linear operations. Consequently, PEANO-ViT enables efficient deployment of Vision Transformers on resource- and power-constrained FPGA platforms.
△ Less
Submitted 16 August, 2024; v1 submitted 20 June, 2024;
originally announced June 2024.
-
Scalable Superconductor Neuron with Ternary Synaptic Connections for Ultra-Fast SNN Hardware
Authors:
Mustafa Altay Karamuftuoglu,
Beyza Zeynep Ucpinar,
Arash Fayyazi,
Sasan Razmkhah,
Mehdi Kamal,
Massoud Pedram
Abstract:
A novel high-fan-in differential superconductor neuron structure designed for ultra-high-performance Spiking Neural Network (SNN) accelerators is presented. Utilizing a high-fan-in neuron structure allows us to design SNN accelerators with more synaptic connections, enhancing the overall network capabilities. The proposed neuron design is based on superconductor electronics fabric, incorporating m…
▽ More
A novel high-fan-in differential superconductor neuron structure designed for ultra-high-performance Spiking Neural Network (SNN) accelerators is presented. Utilizing a high-fan-in neuron structure allows us to design SNN accelerators with more synaptic connections, enhancing the overall network capabilities. The proposed neuron design is based on superconductor electronics fabric, incorporating multiple superconducting loops, each with two Josephson Junctions. This arrangement enables each input data branch to have positive and negative inductive coupling, supporting excitatory and inhibitory synaptic data. Compatibility with synaptic devices and thresholding operation is achieved using a single flux quantum (SFQ) pulse-based logic style. The neuron design, along with ternary synaptic connections, forms the foundation for a superconductor-based SNN inference. To demonstrate the capabilities of our design, we train the SNN using snnTorch, augmenting the PyTorch framework. After pruning, the demonstrated SNN inference achieves an impressive 96.1% accuracy on MNIST images. Notably, the network exhibits a remarkable throughput of 8.92 GHz while consuming only 1.5 nJ per inference, including the energy consumption associated with cooling to 4K. These results underscore the potential of superconductor electronics in developing high-performance and ultra-energy-efficient neural network accelerator architectures.
△ Less
Submitted 27 February, 2024; v1 submitted 26 February, 2024;
originally announced February 2024.
-
Sensitivity-Aware Mixed-Precision Quantization and Width Optimization of Deep Neural Networks Through Cluster-Based Tree-Structured Parzen Estimation
Authors:
Seyedarmin Azizi,
Mahdi Nazemi,
Arash Fayyazi,
Massoud Pedram
Abstract:
As the complexity and computational demands of deep learning models rise, the need for effective optimization methods for neural network designs becomes paramount. This work introduces an innovative search mechanism for automatically selecting the best bit-width and layer-width for individual neural network layers. This leads to a marked enhancement in deep neural network efficiency. The search do…
▽ More
As the complexity and computational demands of deep learning models rise, the need for effective optimization methods for neural network designs becomes paramount. This work introduces an innovative search mechanism for automatically selecting the best bit-width and layer-width for individual neural network layers. This leads to a marked enhancement in deep neural network efficiency. The search domain is strategically reduced by leveraging Hessian-based pruning, ensuring the removal of non-crucial parameters. Subsequently, we detail the development of surrogate models for favorable and unfavorable outcomes by employing a cluster-based tree-structured Parzen estimator. This strategy allows for a streamlined exploration of architectural possibilities and swift pinpointing of top-performing designs. Through rigorous testing on well-known datasets, our method proves its distinct advantage over existing methods. Compared to leading compression strategies, our approach records an impressive 20% decrease in model size without compromising accuracy. Additionally, our method boasts a 12x reduction in search time relative to the best search-focused strategies currently available. As a result, our proposed method represents a leap forward in neural network design optimization, paving the way for quick model design and implementation in settings with limited resources, thereby propelling the potential of scalable deep learning solutions.
△ Less
Submitted 9 August, 2024; v1 submitted 11 August, 2023;
originally announced August 2023.
-
NeuroBlend: Towards Low-Power yet Accurate Neural Network-Based Inference Engine Blending Binary and Fixed-Point Convolutions
Authors:
Arash Fayyazi,
Mahdi Nazemi,
Arya Fayyazi,
Massoud Pedram
Abstract:
This paper introduces NeuroBlend, a novel neural network architecture featuring a unique building block known as the Blend module. This module incorporates binary and fixed-point convolutions in its main and skip paths, respectively. There is a judicious deployment of batch normalizations on both main and skip paths inside the Blend module and in between consecutive Blend modules. Additionally, we…
▽ More
This paper introduces NeuroBlend, a novel neural network architecture featuring a unique building block known as the Blend module. This module incorporates binary and fixed-point convolutions in its main and skip paths, respectively. There is a judicious deployment of batch normalizations on both main and skip paths inside the Blend module and in between consecutive Blend modules. Additionally, we present a compiler and hardware architecture designed to map NeuroBlend models onto FPGA devices, aiming to minimize inference latency while maintaining high accuracy. Our NeuroBlend-20 (NeuroBlend-18) model, derived from ResNet-20 (ResNet-18) trained on CIFAR-10 (CIFAR-100), achieves 88.0\% (73.73\%) classification accuracy, outperforming state-of-the-art binary neural networks by 0.8\% (1.33\%), with an inference time of 0.38ms per image, 1.4x faster than previous FPGA implementation for BNNs. Similarly, our BlendMixer model for CIFAR-10 attains 90.6\% accuracy(1.59\% less than full precision MLPMixer), with a 3.5x reduction in model size compared to full precision MLPMixer. Furthermore, leveraging DSP blocks for 48-bit bitwise logic operations enables low-power FPGA implementation, yielding a 2.5x reduction in power consumption.
△ Less
Submitted 1 May, 2024; v1 submitted 7 July, 2023;
originally announced July 2023.
-
CrAFT: Compression-Aware Fine-Tuning for Efficient Visual Task Adaptation
Authors:
Jung Hwan Heo,
Seyedarmin Azizi,
Arash Fayyazi,
Massoud Pedram
Abstract:
Transfer learning has become a popular task adaptation method in the era of foundation models. However, many foundation models require large storage and computing resources, which makes off-the-shelf deployment impractical. Post-training compression techniques such as pruning and quantization can help lower deployment costs. Unfortunately, the resulting performance degradation limits the usability…
▽ More
Transfer learning has become a popular task adaptation method in the era of foundation models. However, many foundation models require large storage and computing resources, which makes off-the-shelf deployment impractical. Post-training compression techniques such as pruning and quantization can help lower deployment costs. Unfortunately, the resulting performance degradation limits the usability and benefits of such techniques. To close this performance gap, we propose CrAFT, a simple fine-tuning framework that enables effective post-training network compression. In CrAFT, users simply employ the default fine-tuning schedule along with sharpness minimization objective, simultaneously facilitating task adaptation and compression-friendliness. Contrary to the conventional sharpness minimization techniques, which are applied during pretraining, the CrAFT approach adds negligible training overhead as fine-tuning is done in under a couple of minutes or hours with a single GPU. The effectiveness of CrAFT, which is a general-purpose tool that can significantly boost one-shot pruning and post-training quantization, is demonstrated on both convolution-based and attention-based vision foundation models on a variety of target tasks. The code will be made publicly available.
△ Less
Submitted 8 July, 2023; v1 submitted 8 May, 2023;
originally announced May 2023.
-
Algorithms and Hardware for Efficient Processing of Logic-based Neural Networks
Authors:
Jingkai Hong,
Arash Fayyazi,
Amirhossein Esmaili,
Mahdi Nazemi,
Massoud Pedram
Abstract:
Recent efforts to improve the performance of neural network (NN) accelerators that meet today's application requirements have given rise to a new trend of logic-based NN inference relying on fixed-function combinational logic (FFCL). This paper presents an innovative optimization methodology for compiling and mapping NNs utilizing FFCL into a logic processor. The presented method maps FFCL blocks…
▽ More
Recent efforts to improve the performance of neural network (NN) accelerators that meet today's application requirements have given rise to a new trend of logic-based NN inference relying on fixed-function combinational logic (FFCL). This paper presents an innovative optimization methodology for compiling and mapping NNs utilizing FFCL into a logic processor. The presented method maps FFCL blocks to a set of Boolean functions where Boolean operations in each function are mapped to high-performance, low-latency, parallelized processing elements. Graph partitioning and scheduling algorithms are presented to handle FFCL blocks that cannot straightforwardly fit the logic processor. Our experimental evaluations across several datasets and NNs demonstrate the superior performance of our framework in terms of the inference throughput compared to prior art NN accelerators. We achieve 25x higher throughput compared with the XNOR-based accelerator for VGG16 model that can be amplified 5x deploying the graph partitioning and merging algorithms.
△ Less
Submitted 13 April, 2023;
originally announced April 2023.
-
Training-Free Acceleration of ViTs with Delayed Spatial Merging
Authors:
Jung Hwan Heo,
Seyedarmin Azizi,
Arash Fayyazi,
Massoud Pedram
Abstract:
Token merging has emerged as a new paradigm that can accelerate the inference of Vision Transformers (ViTs) without any retraining or fine-tuning. To push the frontier of training-free acceleration in ViTs, we improve token merging by adding the perspectives of 1) activation outliers and 2) hierarchical representations. Through a careful analysis of the attention behavior in ViTs, we characterize…
▽ More
Token merging has emerged as a new paradigm that can accelerate the inference of Vision Transformers (ViTs) without any retraining or fine-tuning. To push the frontier of training-free acceleration in ViTs, we improve token merging by adding the perspectives of 1) activation outliers and 2) hierarchical representations. Through a careful analysis of the attention behavior in ViTs, we characterize a delayed onset of the convergent attention phenomenon, which makes token merging undesirable in the bottom blocks of ViTs. Moreover, we augment token merging with a hierarchical processing scheme to capture multi-scale redundancy between visual tokens. Combining these two insights, we build a unified inference framework called DSM: Delayed Spatial Merging. We extensively evaluate DSM on various ViT model scales (Tiny to Huge) and tasks (ImageNet-1k and transfer learning), achieving up to 1.8$\times$ FLOP reduction and 1.6$\times$ throughput speedup at a negligible loss while being two orders of magnitude faster than existing methods.
△ Less
Submitted 1 July, 2024; v1 submitted 4 March, 2023;
originally announced March 2023.
-
Better Than Worst-Case Decoding for Quantum Error Correction
Authors:
Gokul Subramanian Ravi,
Jonathan M. Baker,
Arash Fayyazi,
Sophia Fuhui Lin,
Ali Javadi-Abhari,
Massoud Pedram,
Frederic T. Chong
Abstract:
The overheads of classical decoding for quantum error correction on superconducting quantum systems grow rapidly with the number of logical qubits and their correction code distance. Decoding at room temperature is bottle-necked by refrigerator I/O bandwidth while cryogenic on-chip decoding is limited by area/power/thermal budget.
To overcome these overheads, we are motivated by the observation…
▽ More
The overheads of classical decoding for quantum error correction on superconducting quantum systems grow rapidly with the number of logical qubits and their correction code distance. Decoding at room temperature is bottle-necked by refrigerator I/O bandwidth while cryogenic on-chip decoding is limited by area/power/thermal budget.
To overcome these overheads, we are motivated by the observation that in the common case, error signatures are fairly trivial with high redundancy/sparsity, since the error correction codes are over-provisioned to correct for uncommon worst-case complex scenarios (to ensure substantially low logical error rates). If suitably exploited, these trivial signatures can be decoded and corrected with insignificant overhead, thereby alleviating the bottlenecks described above, while still handling the worst-case complex signatures by state-of-the-art means.
Our proposal, targeting Surface Codes, consists of:
1) Clique: A lightweight decoder for decoding and correcting trivial common-case errors, designed for the cryogenic domain. The decoder is implemented for SFQ logic.
2) A statistical confidence-based technique for off-chip decoding bandwidth allocation, to efficiently handle rare complex decodes which are not covered by the on-chip decoder.
3) A method for stalling circuit execution, for the worst-case scenarios in which the provisioned off-chip bandwidth is insufficient to complete all requested off-chip decodes.
In all, our proposal enables 70-99+% off-chip bandwidth elimination across a range of logical and physical error rates, without significantly sacrificing the accuracy of state-of-the-art off-chip decoding. By doing so, it achieves 10-10000x bandwidth reduction over prior off-chip bandwidth reduction techniques. Furthermore, it achieves a 15-37x resource overhead reduction compared to prior on-chip-only decoding.
△ Less
Submitted 25 October, 2022; v1 submitted 17 August, 2022;
originally announced August 2022.
-
Efficient Compilation and Mapping of Fixed Function Combinational Logic onto Digital Signal Processors Targeting Neural Network Inference and Utilizing High-level Synthesis
Authors:
Soheil Nazar Shahsavani,
Arash Fayyazi,
Mahdi Nazemi,
Massoud Pedram
Abstract:
Recent efforts for improving the performance of neural network (NN) accelerators that meet today's application requirements have given rise to a new trend of logic-based NN inference relying on fixed function combinational logic. Mapping such large Boolean functions with many input variables and product terms to digital signal processors (DSPs) on Field-programmable gate arrays (FPGAs) needs a nov…
▽ More
Recent efforts for improving the performance of neural network (NN) accelerators that meet today's application requirements have given rise to a new trend of logic-based NN inference relying on fixed function combinational logic. Mapping such large Boolean functions with many input variables and product terms to digital signal processors (DSPs) on Field-programmable gate arrays (FPGAs) needs a novel framework considering the structure and the reconfigurability of DSP blocks during this process. The proposed methodology in this paper maps the fixed function combinational logic blocks to a set of Boolean functions where Boolean operations corresponding to each function are mapped to DSP devices rather than look-up tables (LUTs) on the FPGAs to take advantage of the high performance, low latency, and parallelism of DSP blocks. %
This paper also presents an innovative design and optimization methodology for compilation and mapping of NNs, utilizing fixed function combinational logic to DSPs on FPGAs employing high-level synthesis flow. %
Our experimental evaluations across several \REVone{datasets} and selected NNs demonstrate the comparable performance of our framework in terms of the inference latency and output accuracy compared to prior art FPGA-based NN accelerators employing DSPs.
△ Less
Submitted 30 July, 2022;
originally announced August 2022.
-
Sparse Periodic Systolic Dataflow for Lowering Latency and Power Dissipation of Convolutional Neural Network Accelerators
Authors:
Jung Hwan Heo,
Arash Fayyazi,
Amirhossein Esmaili,
Massoud Pedram
Abstract:
This paper introduces the sparse periodic systolic (SPS) dataflow, which advances the state-of-the-art hardware accelerator for supporting lightweight neural networks. Specifically, the SPS dataflow enables a novel hardware design approach unlocked by an emergent pruning scheme, periodic pattern-based sparsity (PPS). By exploiting the regularity of PPS, our sparsity-aware compiler optimally reorde…
▽ More
This paper introduces the sparse periodic systolic (SPS) dataflow, which advances the state-of-the-art hardware accelerator for supporting lightweight neural networks. Specifically, the SPS dataflow enables a novel hardware design approach unlocked by an emergent pruning scheme, periodic pattern-based sparsity (PPS). By exploiting the regularity of PPS, our sparsity-aware compiler optimally reorders the weights and uses a simple indexing unit in hardware to create matches between the weights and activations. Through the compiler-hardware codesign, SPS dataflow enjoys higher degrees of parallelism while being free of the high indexing overhead and without model accuracy loss. Evaluated on popular benchmarks such as VGG and ResNet, the SPS dataflow and accompanying neural network compiler outperform prior work in convolutional neural network (CNN) accelerator designs targeting FPGA devices. Against other sparsity-supporting weight storage formats, SPS results in 4.49x energy efficiency gain while lowering storage requirements by 3.67x for total weight storage (non-pruned weights plus indexing) and 22,044x for indexing memory.
△ Less
Submitted 30 June, 2022;
originally announced July 2022.
-
NullaNet Tiny: Ultra-low-latency DNN Inference Through Fixed-function Combinational Logic
Authors:
Mahdi Nazemi,
Arash Fayyazi,
Amirhossein Esmaili,
Atharva Khare,
Soheil Nazar Shahsavani,
Massoud Pedram
Abstract:
While there is a large body of research on efficient processing of deep neural networks (DNNs), ultra-low-latency realization of these models for applications with stringent, sub-microsecond latency requirements continues to be an unresolved, challenging problem. Field-programmable gate array (FPGA)-based DNN accelerators are gaining traction as a serious contender to replace graphics processing u…
▽ More
While there is a large body of research on efficient processing of deep neural networks (DNNs), ultra-low-latency realization of these models for applications with stringent, sub-microsecond latency requirements continues to be an unresolved, challenging problem. Field-programmable gate array (FPGA)-based DNN accelerators are gaining traction as a serious contender to replace graphics processing unit/central processing unit-based platforms considering their performance, flexibility, and energy efficiency. This paper presents NullaNet Tiny, an across-the-stack design and optimization framework for constructing resource and energy-efficient, ultra-low-latency FPGA-based neural network accelerators. The key idea is to replace expensive operations required to compute various filter/neuron functions in a DNN with Boolean logic expressions that are mapped to the native look-up tables (LUTs) of the FPGA device (examples of such operations are multiply-and-accumulate and batch normalization). At about the same level of classification accuracy, compared to Xilinx's LogicNets, our design achieves 2.36$\times$ lower latency and 24.42$\times$ lower LUT utilization.
△ Less
Submitted 6 April, 2021;
originally announced April 2021.
-
SynergicLearning: Neural Network-Based Feature Extraction for Highly-Accurate Hyperdimensional Learning
Authors:
Mahdi Nazemi,
Amirhossein Esmaili,
Arash Fayyazi,
Massoud Pedram
Abstract:
Machine learning models differ in terms of accuracy, computational/memory complexity, training time, and adaptability among other characteristics. For example, neural networks (NNs) are well-known for their high accuracy due to the quality of their automatic feature extraction while brain-inspired hyperdimensional (HD) learning models are famous for their quick training, computational efficiency,…
▽ More
Machine learning models differ in terms of accuracy, computational/memory complexity, training time, and adaptability among other characteristics. For example, neural networks (NNs) are well-known for their high accuracy due to the quality of their automatic feature extraction while brain-inspired hyperdimensional (HD) learning models are famous for their quick training, computational efficiency, and adaptability. This work presents a hybrid, synergic machine learning model that excels at all the said characteristics and is suitable for incremental, on-line learning on a chip. The proposed model comprises an NN and a classifier. The NN acts as a feature extractor and is specifically trained to work well with the classifier that employs the HD computing framework. This work also presents a parameterized hardware implementation of the said feature extraction and classification components while introducing a compiler that maps any arbitrary NN and/or classifier to the aforementioned hardware. The proposed hybrid machine learning model has the same level of accuracy (i.e. $\pm$1%) as NNs while achieving at least 10% improvement in accuracy compared to HD learning models. Additionally, the end-to-end hardware realization of the hybrid model improves power efficiency by 1.60x compared to state-of-the-art, high-performance HD learning implementations while improving latency by 2.13x. These results have profound implications for the application of such synergic models in challenging cognitive tasks.
△ Less
Submitted 4 August, 2020; v1 submitted 30 July, 2020;
originally announced July 2020.
-
HIPE-MAGIC: A Technology-Aware Synthesis and Mapping Flow for HIghly Parallel Execution of Memristor-Aided LoGIC
Authors:
Arash Fayyazi,
Amirhossein Esmaili,
Massoud Pedram
Abstract:
Recent efforts for finding novel computing paradigms that meet today's design requirements have given rise to a new trend of processing-in-memory relying on non-volatile memories. In this paper, we present HIPE-MAGIC, a technology-aware synthesis and mapping flow for highly parallel execution of the memristor-based logic. Our framework is built upon two fundamental contributions: balancing techniq…
▽ More
Recent efforts for finding novel computing paradigms that meet today's design requirements have given rise to a new trend of processing-in-memory relying on non-volatile memories. In this paper, we present HIPE-MAGIC, a technology-aware synthesis and mapping flow for highly parallel execution of the memristor-based logic. Our framework is built upon two fundamental contributions: balancing techniques during the logic synthesis, mainly targeting benefits of the parallelism offered by memristive crossbar arrays (MCAs), and an efficient technology mapping framework to maximize the performance and area-efficiency of the memristor-based logic. Our experimental evaluations across several benchmark suites demonstrate the superior performance of HIPE-MAGIC in terms of throughput and energy efficiency compared to recently developed synthesis and mapping flows targeting MCAs, as well as the conventional CPU computing.
△ Less
Submitted 5 June, 2020;
originally announced June 2020.
-
Logic Verification of Ultra-Deep Pipelined Beyond-CMOS Technologies
Authors:
Arash Fayyazi,
Shahin Nazarian,
Massoud Pedram
Abstract:
Traditional logical equivalence checking (LEC) which plays a major role in entire chip design process faces challenges of meeting the requirements demanded by the many emerging technologies that are based on logic models different from standard complementary metal oxide semiconductor (CMOS). In this paper, we propose a LEC framework to be employed in the verification process of beyond-CMOS circuit…
▽ More
Traditional logical equivalence checking (LEC) which plays a major role in entire chip design process faces challenges of meeting the requirements demanded by the many emerging technologies that are based on logic models different from standard complementary metal oxide semiconductor (CMOS). In this paper, we propose a LEC framework to be employed in the verification process of beyond-CMOS circuits. Our LEC framework is compatible with existing CMOS technologies, but, also able to check features and capabilities that are unique to beyond-CMOS technologies. For instance, the performance of some emerging technologies benefits from ultra-deep pipelining and verification of such circuits requires new models and algorithms. We, therefore, present the Multi-Cycle Input Dependency (MCID) circuit model which is a novel model representation of design to explicitly capture the dependency of primary outputs of the circuit on sequences of internal signals and inputs. Embedding the proposed circuit model and several structural checking modules, the process of verification can be independent of the underlying technology and signaling. We benchmark the proposed framework on post-synthesis rapid single-flux-quantum (RSFQ) netlists. Results show a comparative verification time of RSFQ circuit benchmark including 32-bit Kogge-Stone adder, 16-bit integer divider, and ISCAS'85 circuits with respect to ABC tool for similar CMOS circuits.
△ Less
Submitted 27 May, 2020;
originally announced May 2020.
-
VeriSFQ - A Semi-formal Verification Framework and Benchmark for Single Flux Quantum Technology
Authors:
Alvin D. Wong,
Kevin Su,
Hang Sun,
Arash Fayyazi,
Massoud Pedram,
Shahin Nazarian
Abstract:
In this paper, we propose a semi-formal verification framework for single-flux quantum (SFQ) circuits called VeriSFQ, using the Universal Verification Methodology (UVM) standard. The considered SFQ technology is superconducting digital electronic devices that operate at cryogenic temperatures with active circuit elements called the Josephson junction, which operate at high switching speeds and low…
▽ More
In this paper, we propose a semi-formal verification framework for single-flux quantum (SFQ) circuits called VeriSFQ, using the Universal Verification Methodology (UVM) standard. The considered SFQ technology is superconducting digital electronic devices that operate at cryogenic temperatures with active circuit elements called the Josephson junction, which operate at high switching speeds and low switching energy - allowing SFQ circuits to operate at frequencies over 300 gigahertz. Due to key differences between SFQ and CMOS logic, verification techniques for the former are not as advanced as the latter. Thus, it is crucial to develop efficient verification techniques as the complexity of SFQ circuits scales. The VeriSFQ framework focuses on verifying the key circuit and gate-level properties of SFQ logic: fanout, gate-level pipeline, path balancing, and input-to-output latency. The combinational circuits considered in analyzing the performance of VeriSFQ are: Kogge-Stone adders (KSA), array multipliers, integer dividers, and select ISCAS'85 combinational benchmark circuits. Methods of introducing bugs into SFQ circuit designs for verification detection were experimented with - including stuck-at faults, fanout errors, unbalanced paths, and functional bugs like incorrect logic gates. In addition, we propose an SFQ verification benchmark consisting of combinational SFQ circuits that exemplify SFQ logic properties and present the performance of the VeriSFQ framework on these benchmark circuits. The portability and reusability of the UVM standard allows the VeriSFQ framework to serve as a foundation for future SFQ semi-formal verification techniques.
△ Less
Submitted 17 March, 2019;
originally announced March 2019.
-
SpRRAM: A Predefined Sparsity Based Memristive Neuromorphic Circuit for Low Power Application
Authors:
Arash Fayyazi,
Souvik Kundu,
Shahin Nazarian,
Peter A. Beerel,
Massoud Pedram
Abstract:
In this paper, we propose an efficient predefined structured sparsity-based ex-situ training framework for a hybrid CMOS-memristive neuromorphic hardware for deep neural network to significantly lower the power consumption and computational complexity and improve scalability. The structure is verified on a wide range of datasets including MNIST handwritten recognition, breast cancer prediction, an…
▽ More
In this paper, we propose an efficient predefined structured sparsity-based ex-situ training framework for a hybrid CMOS-memristive neuromorphic hardware for deep neural network to significantly lower the power consumption and computational complexity and improve scalability. The structure is verified on a wide range of datasets including MNIST handwritten recognition, breast cancer prediction, and mobile health monitoring. The results of this study show that compared to its fully connected version, the proposed structure provides significant power reduction while maintaining high classification accuracy.
△ Less
Submitted 10 September, 2018;
originally announced September 2018.