-
SAF: Scalable Acceleration Framework for dynamic and flexible scaling of FPGAs
Authors:
Masudul Hassan Quraishi,
Michael Riera,
Fengbo Ren,
Aman Arora,
Aviral Shrivastava
Abstract:
FPGAs are increasingly gaining traction in cloud and edge computing environments due to their hardware flexibility, low latency, and low energy consumption. However, the existing hardware stack of FPGA and the host-FPGA connectivity does not allow flexible scaling and simultaneous reconfiguration of multiple devices, which limits the adoption of FPGA at scale. In this paper, we present SAF -- an E…
▽ More
FPGAs are increasingly gaining traction in cloud and edge computing environments due to their hardware flexibility, low latency, and low energy consumption. However, the existing hardware stack of FPGA and the host-FPGA connectivity does not allow flexible scaling and simultaneous reconfiguration of multiple devices, which limits the adoption of FPGA at scale. In this paper, we present SAF -- an Ethernet-based scalable acceleration framework that allows FPGA to be hot-plugged into a network in a stand-alone fashion without connecting to a local host CPU, which enables flexible scalability. SAF provides a custom FPGA shell and a set of Ethernet protocols that allow FPGAs to connect with a remote host to accelerate application kernels. SAF can configure multiple FPGAs simultaneously, which significantly reduces the reconfiguration time in scaling effort. We implemented the SAF framework using Intel FPGA SDK for OpenCL and 20 Bittware 385A cards with Arria-10 FPGAs. We analyze a case study and conduct experiments to compare SAF with state-of-the-art multi-FPGA clusters. Results show that SAF provides 13X faster reconfiguration than sequential PCIe programming, reduces the hardware setup costs by 38%, application runtime by 25%, and energy consumption by 27%. We evaluated the performance scalability of SAF using the PTRANS benchmark of the HPCC FPGA benchmark suite and showed an almost linear speedup for strong and weak scaling scenarios.
△ Less
Submitted 2 March, 2025;
originally announced March 2025.
-
Towards Efficient LUT-based PIM: A Scalable and Low-Power Approach for Modern Workloads
Authors:
Bahareh Khabbazan,
Marc Riera,
Antonio González
Abstract:
Data movement in memory-intensive workloads, such as deep learning, incurs energy costs that are over three orders of magnitude higher than the cost of computation. Since these workloads involve frequent data transfers between memory and processing units, addressing data movement overheads is crucial for improving performance. Processing-using-memory (PuM) offers an effective solution by enabling…
▽ More
Data movement in memory-intensive workloads, such as deep learning, incurs energy costs that are over three orders of magnitude higher than the cost of computation. Since these workloads involve frequent data transfers between memory and processing units, addressing data movement overheads is crucial for improving performance. Processing-using-memory (PuM) offers an effective solution by enabling in-memory computation, thereby minimizing data transfers. In this paper we propose Lama, a LUT-based PuM architecture designed to efficiently execute SIMD operations by supporting independent column accesses within each mat of a DRAM subarray. Lama exploits DRAM's mat-level parallelism and open-page policy to significantly reduce the number of energy-intensive memory activation (ACT) commands, which are the primary source of overhead in most PuM architectures. Unlike prior PuM solutions, Lama supports up to 8-bit operand precision without decomposing computations, while incurring only a 2.47% area overhead. Our evaluation shows Lama achieves an average performance improvement of 8.5x over state-of-the-art PuM architectures and a 3.8x improvement over CPU, along with energy efficiency gains of 6.9x/8x, respectively, for bulk 8-bit multiplication.
We also introduce LamaAccel, an HBM-based PuM accelerator that utilizes Lama to accelerate the inference of attention-based models. LamaAccel employs exponential quantization to optimize product/accumulation in dot-product operations, transforming them into simpler tasks like addition and counting. LamaAccel delivers up to 9.3x/19.2x reduction in energy and 4.8x/9.8x speedup over TPU/GPU, along with up to 5.8x energy reduction and 2.1x speedup over a state-of-the-art PuM baseline.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
Hamun: An Approximate Computation Method to Prolong the Lifespan of ReRAM-Based Accelerators
Authors:
Mohammad Sabri,
Marc Riera,
Antonio Gonzalez
Abstract:
ReRAM-based accelerators exhibit enormous potential to increase computational efficiency for DNN inference tasks, delivering significant performance and energy savings over traditional platforms. By incorporating adaptive scheduling, these accelerators dynamically adjust to DNN requirements, optimizing allocation of constrained hardware resources. However, ReRAM cells have limited endurance cycles…
▽ More
ReRAM-based accelerators exhibit enormous potential to increase computational efficiency for DNN inference tasks, delivering significant performance and energy savings over traditional platforms. By incorporating adaptive scheduling, these accelerators dynamically adjust to DNN requirements, optimizing allocation of constrained hardware resources. However, ReRAM cells have limited endurance cycles due to wear-out from multiple updates for each inference execution, which shortens the lifespan of ReRAM-based accelerators and presents a practical challenge in positioning them as alternatives to conventional platforms like TPUs. Addressing these endurance limitations is essential for making ReRAM-based solutions viable for long-term, high-performance DNN inference. To address the lifespan limitations of ReRAM-based accelerators, we introduce Hamun, an approximate computing method designed to extend the lifespan of ReRAM-based accelerators through a range of optimizations. Hamun incorporates a novel mechanism that detects faulty cell due to wear-out and retires them, avoiding in this way their otherwise adverse impact on DNN accuracy. Moreover, Hamun extends the lifespan of ReRAM-based accelerators by adapting wear-leveling techniques across various abstraction levels of the accelerator and implementing a batch execution scheme to maximize ReRAM cell usage for multiple inferences. On average, evaluated on a set of popular DNNs, Hamun demonstrates an improvement in lifespan of 13.2x over a state-of-the-art baseline. The main contributors to this improvement are the fault handling and batch execution schemes, which provide 4.6x and 2.6x lifespan improvements respectively.
△ Less
Submitted 4 February, 2025; v1 submitted 3 February, 2025;
originally announced February 2025.
-
ARAS: An Adaptive Low-Cost ReRAM-Based Accelerator for DNNs
Authors:
Mohammad Sabri,
Marc Riera,
Antonio González
Abstract:
Processing Using Memory (PUM) accelerators have the potential to perform Deep Neural Network (DNN) inference by using arrays of memory cells as computation engines. Among various memory technologies, ReRAM crossbars show promising performance in computing dot-product operations in the analog domain. Nevertheless, the expensive writing procedure of ReRAM cells has led researchers to design accelera…
▽ More
Processing Using Memory (PUM) accelerators have the potential to perform Deep Neural Network (DNN) inference by using arrays of memory cells as computation engines. Among various memory technologies, ReRAM crossbars show promising performance in computing dot-product operations in the analog domain. Nevertheless, the expensive writing procedure of ReRAM cells has led researchers to design accelerators whose crossbars have enough capacity to store the full DNN. Given the tremendous and continuous increase in DNN model sizes, this approach is unfeasible for some networks, or inefficient due to the huge hardware requirements. Those accelerators lack the flexibility to adapt to any given DNN model, facing an challenge.
To address this issue we introduce ARAS, a cost-effective ReRAM-based accelerator that employs a smart scheduler to adapt different DNNs to the resource-limited hardware. ARAS also overlaps the computation of a layer with the weight writing of several layers to mitigate the high writing latency of ReRAM. Furthermore, ARAS introduces three optimizations aimed at reducing the energy overheads of writing in ReRAM. Our key optimization capitalizes on the observation that DNN weights can be re-encoded to augment their similarity between layers, increasing the amount of bitwise values that are equal or similar when overwriting ReRAM cells and, hence, reducing the amount of energy required to update the cells. Overall, ARAS greatly reduces the ReRAM writing activity. We evaluate ARAS on a popular set of DNNs. ARAS provides up to 2.2x speedup and 45% energy savings over a baseline PUM accelerator without any optimization. Compared to a TPU-like accelerator, ARAS provides up to 1.5x speedup and 61% energy savings.
△ Less
Submitted 23 October, 2024;
originally announced October 2024.
-
An Energy-Efficient Near-Data Processing Accelerator for DNNs that Optimizes Data Accesses
Authors:
Bahareh Khabbazan,
Marc Riera,
Antonio González
Abstract:
The constant growth of DNNs makes them challenging to implement and run efficiently on traditional compute-centric architectures. Some accelerators have attempted to add more compute units and on-chip buffers to solve the memory wall problem without much success, and sometimes even worsening the issue since more compute units also require higher memory bandwidth. Prior works have proposed the desi…
▽ More
The constant growth of DNNs makes them challenging to implement and run efficiently on traditional compute-centric architectures. Some accelerators have attempted to add more compute units and on-chip buffers to solve the memory wall problem without much success, and sometimes even worsening the issue since more compute units also require higher memory bandwidth. Prior works have proposed the design of memory-centric architectures based on the Near-Data Processing (NDP) paradigm. NDP seeks to break the memory wall by moving the computations closer to the memory hierarchy, reducing the data movements and their cost as much as possible. The 3D-stacked memory is especially appealing for DNN accelerators due to its high-density/low-energy storage and near-memory computation capabilities to perform the DNN operations massively in parallel. However, memory accesses remain as the main bottleneck for running modern DNNs efficiently.
To improve the efficiency of DNN inference we present QeiHaN, a hardware accelerator that implements a 3D-stacked memory-centric weight storage scheme to take advantage of a logarithmic quantization of activations. In particular, since activations of FC and CONV layers of modern DNNs are commonly represented as powers of two with negative exponents, QeiHaN performs an implicit in-memory bit-shifting of the DNN weights to reduce memory activity. Only the meaningful bits of the weights required for the bit-shift operation are accessed. Overall, QeiHaN reduces memory accesses by 25\% compared to a standard memory organization. We evaluate QeiHaN on a popular set of DNNs. On average, QeiHaN provides $4.3x$ speedup and $3.5x$ energy savings over a Neurocube-like accelerator.
△ Less
Submitted 27 October, 2023;
originally announced October 2023.
-
DNA-TEQ: An Adaptive Exponential Quantization of Tensors for DNN Inference
Authors:
Bahareh Khabbazan,
Marc Riera,
Antonio González
Abstract:
Quantization is commonly used in Deep Neural Networks (DNNs) to reduce the storage and computational complexity by decreasing the arithmetical precision of activations and weights, a.k.a. tensors. Efficient hardware architectures employ linear quantization to enable the deployment of recent DNNs onto embedded systems and mobile devices. However, linear uniform quantization cannot usually reduce th…
▽ More
Quantization is commonly used in Deep Neural Networks (DNNs) to reduce the storage and computational complexity by decreasing the arithmetical precision of activations and weights, a.k.a. tensors. Efficient hardware architectures employ linear quantization to enable the deployment of recent DNNs onto embedded systems and mobile devices. However, linear uniform quantization cannot usually reduce the numerical precision to less than 8 bits without sacrificing high performance in terms of model accuracy. The performance loss is due to the fact that tensors do not follow uniform distributions. In this paper, we show that a significant amount of tensors fit into an exponential distribution. Then, we propose DNA-TEQ to exponentially quantize DNN tensors with an adaptive scheme that achieves the best trade-off between numerical precision and accuracy loss. The experimental results show that DNA-TEQ provides a much lower quantization bit-width compared to previous proposals, resulting in an average compression ratio of 40% over the linear INT8 baseline, with negligible accuracy loss and without retraining the DNNs. Besides, DNA-TEQ leads the way in performing dot-product operations in the exponential domain, which saves 66% of energy consumption on average for a set of widely used DNNs.
△ Less
Submitted 22 November, 2023; v1 submitted 28 June, 2023;
originally announced June 2023.
-
ReDy: A Novel ReRAM-centric Dynamic Quantization Approach for Energy-efficient CNN Inference
Authors:
Mohammad Sabri,
Marc Riera,
Antonio González
Abstract:
The primary operation in DNNs is the dot product of quantized input activations and weights. Prior works have proposed the design of memory-centric architectures based on the Processing-In-Memory (PIM) paradigm. Resistive RAM (ReRAM) technology is especially appealing for PIM-based DNN accelerators due to its high density to store weights, low leakage energy, low read latency, and high performance…
▽ More
The primary operation in DNNs is the dot product of quantized input activations and weights. Prior works have proposed the design of memory-centric architectures based on the Processing-In-Memory (PIM) paradigm. Resistive RAM (ReRAM) technology is especially appealing for PIM-based DNN accelerators due to its high density to store weights, low leakage energy, low read latency, and high performance capabilities to perform the DNN dot-products massively in parallel within the ReRAM crossbars. However, the main bottleneck of these architectures is the energy-hungry analog-to-digital conversions (ADCs) required to perform analog computations in-ReRAM, which penalizes the efficiency and performance benefits of PIM. To improve energy-efficiency of in-ReRAM analog dot-product computations we present ReDy, a hardware accelerator that implements a ReRAM-centric Dynamic quantization scheme to take advantage of the bit serial streaming and processing of activations. The energy consumption of ReRAM-based DNN accelerators is directly proportional to the numerical precision of the input activations of each DNN layer. In particular, ReDy exploits that activations of CONV layers from Convolutional Neural Networks (CNNs), a subset of DNNs, are commonly grouped according to the size of their filters and the size of the ReRAM crossbars. Then, ReDy quantizes on-the-fly each group of activations with a different numerical precision based on a novel heuristic that takes into account the statistical distribution of each group. Overall, ReDy greatly reduces the activity of the ReRAM crossbars and the number of A/D conversions compared to an static 8-bit uniform quantization. We evaluate ReDy on a popular set of modern CNNs. On average, ReDy provides 13\% energy savings over an ISAAC-like accelerator with negligible accuracy loss and area overhead.
△ Less
Submitted 28 June, 2023;
originally announced June 2023.
-
A Survey of Near-Data Processing Architectures for Neural Networks
Authors:
Mehdi Hassanpour,
Marc Riera,
Antonio González
Abstract:
Data-intensive workloads and applications, such as machine learning (ML), are fundamentally limited by traditional computing systems based on the von-Neumann architecture. As data movement operations and energy consumption become key bottlenecks in the design of computing systems, the interest in unconventional approaches such as Near-Data Processing (NDP), machine learning, and especially neural…
▽ More
Data-intensive workloads and applications, such as machine learning (ML), are fundamentally limited by traditional computing systems based on the von-Neumann architecture. As data movement operations and energy consumption become key bottlenecks in the design of computing systems, the interest in unconventional approaches such as Near-Data Processing (NDP), machine learning, and especially neural network (NN)-based accelerators has grown significantly. Emerging memory technologies, such as ReRAM and 3D-stacked, are promising for efficiently architecting NDP-based accelerators for NN due to their capabilities to work as both: High-density/low-energy storage and in/near-memory computation/search engine. In this paper, we present a survey of techniques for designing NDP architectures for NN. By classifying the techniques based on the memory technology employed, we underscore their similarities and differences. Finally, we discuss open challenges and future perspectives that need to be explored in order to improve and extend the adoption of NDP architectures for future computing platforms. This paper will be valuable for computer architects, chip designers and researchers in the area of machine learning.
△ Less
Submitted 23 December, 2021;
originally announced December 2021.
-
FSpGEMM: An OpenCL-based HPC Framework for Accelerating General Sparse Matrix-Matrix Multiplication on FPGAs
Authors:
Erfan Bank Tavakoli,
Michael Riera,
Masudul Hassan Quraishi,
Fengbo Ren
Abstract:
General sparse matrix-matrix multiplication (SpGEMM) is an integral part of many scientific computing, high-performance computing (HPC), and graph analytic applications. This paper presents a new compressed sparse vector (CSV) format for representing sparse matrices and FSpGEMM, an OpenCL-based HPC framework for accelerating general sparse matrix-matrix multiplication on FPGAs. The proposed FSpGEM…
▽ More
General sparse matrix-matrix multiplication (SpGEMM) is an integral part of many scientific computing, high-performance computing (HPC), and graph analytic applications. This paper presents a new compressed sparse vector (CSV) format for representing sparse matrices and FSpGEMM, an OpenCL-based HPC framework for accelerating general sparse matrix-matrix multiplication on FPGAs. The proposed FSpGEMM framework includes an FPGA kernel implementing a throughput-optimized hardware architecture based on Gustavson's algorithm and a host program implementing pre-processing functions for converting input matrices to the CSV format tailored for the proposed architecture. FSpGEMM utilizes a new buffering scheme tailored to Gustavson's algorithm. We compare FSpGEMM implemented on an Intel Arria 10 GX FPGA development board with Intel Math Kernel Library (MKL) implemented on an Intel Xeon E5-2637 CPU and cuSPARSE on an NVIDIA GTX TITAN X GPU, respectively, for multiplying a set of sparse matrices selected from SuiteSparse Matrix Collection. The experiment results show that the proposed FSpGEMM solution achieves on average 4.9x and 1.7x higher performance with 31.9x and 13.1x lower energy consumption per SpGEMM computation than the CPU and GPU implementations, respectively.
△ Less
Submitted 18 December, 2021;
originally announced December 2021.
-
CREW: Computation Reuse and Efficient Weight Storage for Hardware-accelerated MLPs and RNNs
Authors:
Marc Riera,
Jose-Maria Arnau,
Antonio Gonzalez
Abstract:
Deep Neural Networks (DNNs) have achieved tremendous success for cognitive applications. The core operation in a DNN is the dot product between quantized inputs and weights. Prior works exploit the weight/input repetition that arises due to quantization to avoid redundant computations in Convolutional Neural Networks (CNNs). However, in this paper we show that their effectiveness is severely limit…
▽ More
Deep Neural Networks (DNNs) have achieved tremendous success for cognitive applications. The core operation in a DNN is the dot product between quantized inputs and weights. Prior works exploit the weight/input repetition that arises due to quantization to avoid redundant computations in Convolutional Neural Networks (CNNs). However, in this paper we show that their effectiveness is severely limited when applied to Fully-Connected (FC) layers, which are commonly used in state-of-the-art DNNs, as it is the case of modern Recurrent Neural Networks (RNNs) and Transformer models.
To improve energy-efficiency of FC computation we present CREW, a hardware accelerator that implements Computation Reuse and an Efficient Weight Storage mechanism to exploit the large number of repeated weights in FC layers. CREW first performs the multiplications of the unique weights by their respective inputs and stores the results in an on-chip buffer. The storage requirements are modest due to the small number of unique weights and the relatively small size of the input compared to convolutional layers. Next, CREW computes each output by fetching and adding its required products. To this end, each weight is replaced offline by an index in the buffer of unique products. Indices are typically smaller than the quantized weights, since the number of unique weights for each input tends to be much lower than the range of quantized weights, which reduces storage and memory bandwidth requirements.
Overall, CREW greatly reduces the number of multiplications and provides significant savings in model memory footprint and memory bandwidth usage. We evaluate CREW on a diverse set of modern DNNs. On average, CREW provides 2.61x speedup and 2.42x energy savings over a TPU-like accelerator. Compared to UCNN, a state-of-art computation reuse technique, CREW achieves 2.10x speedup and 2.08x energy savings on average.
△ Less
Submitted 11 March, 2022; v1 submitted 20 July, 2021;
originally announced July 2021.
-
FLASH 1.0: A Software Framework for Rapid Parallel Deployment and Enhancing Host Code Portability in Heterogeneous Computing
Authors:
Michael Riera,
Masudul Hassan Quraishi,
Erfan Bank Tavakoli,
Fengbo Ren
Abstract:
This paper presents FLASH 1.0, a C++-based software framework for rapid parallel deployment and enhancing host code portability in heterogeneous computing. FLASH takes a novel approach in describing kernels and dynamically dispatching them in a hardware-agnostic manner. FLASH features truly hardware-agnostic frontend interfaces, which unify the compile-time control flow and enforce a portability-o…
▽ More
This paper presents FLASH 1.0, a C++-based software framework for rapid parallel deployment and enhancing host code portability in heterogeneous computing. FLASH takes a novel approach in describing kernels and dynamically dispatching them in a hardware-agnostic manner. FLASH features truly hardware-agnostic frontend interfaces, which unify the compile-time control flow and enforce a portability-optimized code organization that imposes a demarcation between computational (performance-critical) and functional (non-performance-critical) codes as well as the separation of hardware-specific and hardware-agnostic codes in the host application. We use static code analysis to measure the hardware independence ratio of twelve popular HPC applications and show that up to 99.72% code portability can be achieved with FLASH. Similarly, we measure and compare the complexity of state-of-the-art portable programming models to show that FLASH can achieve a code reduction of up to 4.0x for two common HPC kernels while maintaining 100% code portability with a normalized framework overhead between 1% - 13% of the total kernel runtime. The codes are available at https://github.com/PSCLab-ASU/FLASH.
△ Less
Submitted 5 July, 2023; v1 submitted 25 June, 2021;
originally announced June 2021.
-
HALO 1.0: A Hardware-agnostic Accelerator Orchestration Framework for Enabling Hardware-agnostic Programming with True Performance Portability for Heterogeneous HPC
Authors:
Michael Riera,
Erfan Bank Tavakoli,
Masudul Hassan Quraishi,
Fengbo Ren
Abstract:
This paper presents HALO 1.0, an open-ended extensible multi-agent software framework that implements a set of proposed hardware-agnostic accelerator orchestration (HALO) principles. HALO implements a novel compute-centric message passing interface (C^2MPI) specification for enabling the performance portable execution of a hardware-agnostic host application across heterogeneous accelerators. The e…
▽ More
This paper presents HALO 1.0, an open-ended extensible multi-agent software framework that implements a set of proposed hardware-agnostic accelerator orchestration (HALO) principles. HALO implements a novel compute-centric message passing interface (C^2MPI) specification for enabling the performance portable execution of a hardware-agnostic host application across heterogeneous accelerators. The experiment results of evaluating eight widely used HPC subroutines based on Intel Xeon E5-2620 CPUs, Intel Arria 10 GX FPGAs, and NVIDIA GeForce RTX 2080 Ti GPUs show that HALO 1.0 allows for a unified control flow for host programs to run across all the computing devices with a consistently top performance portability score, which is up to five orders of magnitude higher than the OpenCL-based solution.
△ Less
Submitted 6 July, 2022; v1 submitted 21 November, 2020;
originally announced November 2020.
-
A Survey on Future Railway Radio Communications Services: Challenges and Opportunities
Authors:
Juan Moreno Garcia-Loygorri,
Jose Manuel Riera,
Leandro de Haro,
Carlos Rodriguez
Abstract:
Radio communications is one of the most disruptive technologies in railways, enabling a huge set of value-added services that greatly improve many aspects of railways, making them more efficient, safer, and profitable. Lately, some major technologies like ERTMS for high-speed railways and CBTC for subways have made possible a reduction of headway and increased safety never before seen in this fiel…
▽ More
Radio communications is one of the most disruptive technologies in railways, enabling a huge set of value-added services that greatly improve many aspects of railways, making them more efficient, safer, and profitable. Lately, some major technologies like ERTMS for high-speed railways and CBTC for subways have made possible a reduction of headway and increased safety never before seen in this field. The railway industry is now looking at wireless communications with great interest, and this can be seen in many projects around the world. Thus, railway radio communications is again a flourishing field, with a lot of research and many things to be done. This survey article explains both opportunities and challenges to be addressed by the railway sector in order to obtain all the possible benefits of the latest radio technologies.
△ Less
Submitted 5 October, 2020;
originally announced October 2020.
-
(Pen-) Ultimate DNN Pruning
Authors:
Marc Riera,
Jose-Maria Arnau,
Antonio Gonzalez
Abstract:
DNN pruning reduces memory footprint and computational work of DNN-based solutions to improve performance and energy-efficiency. An effective pruning scheme should be able to systematically remove connections and/or neurons that are unnecessary or redundant, reducing the DNN size without any loss in accuracy. In this paper we show that prior pruning schemes require an extremely time-consuming iter…
▽ More
DNN pruning reduces memory footprint and computational work of DNN-based solutions to improve performance and energy-efficiency. An effective pruning scheme should be able to systematically remove connections and/or neurons that are unnecessary or redundant, reducing the DNN size without any loss in accuracy. In this paper we show that prior pruning schemes require an extremely time-consuming iterative process that requires retraining the DNN many times to tune the pruning hyperparameters. We propose a DNN pruning scheme based on Principal Component Analysis and relative importance of each neuron's connection that automatically finds the optimized DNN in one shot without requiring hand-tuning of multiple parameters.
△ Less
Submitted 6 June, 2019;
originally announced June 2019.