-
Towards Efficient IMC Accelerator Design Through Joint Hardware-Workload Co-optimization
Authors:
Olga Krestinskaya,
Mohammed E. Fouda,
Ahmed Eltawil,
Khaled N. Salama
Abstract:
Designing generalized in-memory computing (IMC) hardware that efficiently supports a variety of workloads requires extensive design space exploration, which is infeasible to perform manually. Optimizing hardware individually for each workload or solely for the largest workload often fails to yield the most efficient generalized solutions. To address this, we propose a joint hardware-workload optim…
▽ More
Designing generalized in-memory computing (IMC) hardware that efficiently supports a variety of workloads requires extensive design space exploration, which is infeasible to perform manually. Optimizing hardware individually for each workload or solely for the largest workload often fails to yield the most efficient generalized solutions. To address this, we propose a joint hardware-workload optimization framework that identifies optimised IMC chip architecture parameters, enabling more efficient, workload-flexible hardware. We show that joint optimization achieves 36%, 36%, 20%, and 69% better energy-latency-area scores for VGG16, ResNet18, AlexNet, and MobileNetV3, respectively, compared to the separate architecture parameters search optimizing for a single largest workload. Additionally, we quantify the performance trade-offs and losses of the resulting generalized IMC hardware compared to workload-specific IMC designs.
△ Less
Submitted 1 February, 2025; v1 submitted 22 October, 2024;
originally announced October 2024.
-
Towards Efficient In-memory Computing Hardware for Quantized Neural Networks: State-of-the-art, Open Challenges and Perspectives
Authors:
Olga Krestinskaya,
Li Zhang,
Khaled Nabil Salama
Abstract:
The amount of data processed in the cloud, the development of Internet-of-Things (IoT) applications, and growing data privacy concerns force the transition from cloud-based to edge-based processing. Limited energy and computational resources on edge push the transition from traditional von Neumann architectures to In-memory Computing (IMC), especially for machine learning and neural network applic…
▽ More
The amount of data processed in the cloud, the development of Internet-of-Things (IoT) applications, and growing data privacy concerns force the transition from cloud-based to edge-based processing. Limited energy and computational resources on edge push the transition from traditional von Neumann architectures to In-memory Computing (IMC), especially for machine learning and neural network applications. Network compression techniques are applied to implement a neural network on limited hardware resources. Quantization is one of the most efficient network compression techniques allowing to reduce the memory footprint, latency, and energy consumption. This paper provides a comprehensive review of IMC-based Quantized Neural Networks (QNN) and links software-based quantization approaches to IMC hardware implementation. Moreover, open challenges, QNN design requirements, recommendations, and perspectives along with an IMC-based QNN hardware roadmap are provided.
△ Less
Submitted 8 July, 2023;
originally announced July 2023.
-
Towards Efficient RRAM-based Quantized Neural Networks Hardware: State-of-the-art and Open Issues
Authors:
O. Krestinskaya,
L. Zhang,
K. N. Salama
Abstract:
The increasing amount of data processed on edge and the demand for reducing the energy consumption for large neural network architectures have initiated the transition from traditional von Neumann architectures towards in-memory computing paradigms. Quantization is one of the methods to reduce power and computation requirements for neural networks by limiting bit precision. Resistive Random Access…
▽ More
The increasing amount of data processed on edge and the demand for reducing the energy consumption for large neural network architectures have initiated the transition from traditional von Neumann architectures towards in-memory computing paradigms. Quantization is one of the methods to reduce power and computation requirements for neural networks by limiting bit precision. Resistive Random Access Memory (RRAM) devices are great candidates for Quantized Neural Networks (QNN) implementations. As the number of possible conductive states in RRAMs is limited, a certain level of quantization is always considered when designing RRAM-based neural networks. In this work, we provide a comprehensive analysis of state-of-the-art RRAM-based QNN implementations, showing where RRAMs stand in terms of satisfying the criteria of efficient QNN hardware. We cover hardware and device challenges related to QNNs and show the main unsolved issues and possible future research directions.
△ Less
Submitted 25 September, 2022;
originally announced September 2022.
-
BackLink: Supervised Local Training with Backward Links
Authors:
Wenzhe Guo,
Mohammed E Fouda,
Ahmed M. Eltawil,
Khaled N. Salama
Abstract:
Empowered by the backpropagation (BP) algorithm, deep neural networks have dominated the race in solving various cognitive tasks. The restricted training pattern in the standard BP requires end-to-end error propagation, causing large memory cost and prohibiting model parallelization. Existing local training methods aim to resolve the training obstacle by completely cutting off the backward path be…
▽ More
Empowered by the backpropagation (BP) algorithm, deep neural networks have dominated the race in solving various cognitive tasks. The restricted training pattern in the standard BP requires end-to-end error propagation, causing large memory cost and prohibiting model parallelization. Existing local training methods aim to resolve the training obstacle by completely cutting off the backward path between modules and isolating their gradients to reduce memory cost and accelerate the training process. These methods prevent errors from flowing between modules and hence information exchange, resulting in inferior performance. This work proposes a novel local training algorithm, BackLink, which introduces inter-module backward dependency and allows errors to flow between modules. The algorithm facilitates information to flow backward along with the network. To preserve the computational advantage of local training, BackLink restricts the error propagation length within the module. Extensive experiments performed in various deep convolutional neural networks demonstrate that our method consistently improves the classification performance of local training algorithms over other methods. For example, in ResNet32 with 16 local modules, our method surpasses the conventional greedy local training method by 4.00\% and a recent work by 1.83\% in accuracy on CIFAR10, respectively. Analysis of computational costs reveals that small overheads are incurred in GPU memory costs and runtime on multiple GPUs. Our method can lead up to a 79\% reduction in memory cost and 52\% in simulation runtime in ResNet110 compared to the standard BP. Therefore, our method could create new opportunities for improving training algorithms towards better efficiency and biological plausibility.
△ Less
Submitted 14 May, 2022;
originally announced May 2022.
-
Efficient Training of Spiking Neural Networks with Temporally-Truncated Local Backpropagation through Time
Authors:
Wenzhe Guo,
Mohammed E. Fouda,
Ahmed M. Eltawil,
Khaled Nabil Salama
Abstract:
Directly training spiking neural networks (SNNs) has remained challenging due to complex neural dynamics and intrinsic non-differentiability in firing functions. The well-known backpropagation through time (BPTT) algorithm proposed to train SNNs suffers from large memory footprint and prohibits backward and update unlocking, making it impossible to exploit the potential of locally-supervised train…
▽ More
Directly training spiking neural networks (SNNs) has remained challenging due to complex neural dynamics and intrinsic non-differentiability in firing functions. The well-known backpropagation through time (BPTT) algorithm proposed to train SNNs suffers from large memory footprint and prohibits backward and update unlocking, making it impossible to exploit the potential of locally-supervised training methods. This work proposes an efficient and direct training algorithm for SNNs that integrates a locally-supervised training method with a temporally-truncated BPTT algorithm. The proposed algorithm explores both temporal and spatial locality in BPTT and contributes to significant reduction in computational cost including GPU memory utilization, main memory access and arithmetic operations. We thoroughly explore the design space concerning temporal truncation length and local training block size and benchmark their impact on classification accuracy of different networks running different types of tasks. The results reveal that temporal truncation has a negative effect on the accuracy of classifying frame-based datasets, but leads to improvement in accuracy on dynamic-vision-sensor (DVS) recorded datasets. In spite of resulting information loss, local training is capable of alleviating overfitting. The combined effect of temporal truncation and local training can lead to the slowdown of accuracy drop and even improvement in accuracy. In addition, training deep SNNs models such as AlexNet classifying CIFAR10-DVS dataset leads to 7.26% increase in accuracy, 89.94% reduction in GPU memory, 10.79% reduction in memory access, and 99.64% reduction in MAC operations compared to the standard end-to-end BPTT.
△ Less
Submitted 13 December, 2021;
originally announced January 2022.
-
Learning in Memristive Neural Network Architectures using Analog Backpropagation Circuits
Authors:
Olga Krestinskaya,
Khaled Nabil Salama,
Alex Pappachen James
Abstract:
The on-chip implementation of learning algorithms would speed-up the training of neural networks in crossbar arrays. The circuit level design and implementation of backpropagation algorithm using gradient descent operation for neural network architectures is an open problem. In this paper, we proposed the analog backpropagation learning circuits for various memristive learning architectures, such…
▽ More
The on-chip implementation of learning algorithms would speed-up the training of neural networks in crossbar arrays. The circuit level design and implementation of backpropagation algorithm using gradient descent operation for neural network architectures is an open problem. In this paper, we proposed the analog backpropagation learning circuits for various memristive learning architectures, such as Deep Neural Network (DNN), Binary Neural Network (BNN), Multiple Neural Network (MNN), Hierarchical Temporal Memory (HTM) and Long-Short Term Memory (LSTM). The circuit design and verification is done using TSMC 180nm CMOS process models, and TiO2 based memristor models. The application level validations of the system are done using XOR problem, MNIST character and Yale face image databases
△ Less
Submitted 31 August, 2018;
originally announced August 2018.
-
Memristor-based mono-stable oscillator
Authors:
A. T. Bahgat,
K. N. Salama
Abstract:
In this letter, a reactance-less mono-stable oscillator is introduced for the first time using memristors. By replacing bulky inductors and capacitors with memristors, the novel mono-stable oscillator can be an area-efficient solution for on-chip fully integrated systems. The proposed circuit is described, mathematically analysed and verified by circuit simulations.
In this letter, a reactance-less mono-stable oscillator is introduced for the first time using memristors. By replacing bulky inductors and capacitors with memristors, the novel mono-stable oscillator can be an area-efficient solution for on-chip fully integrated systems. The proposed circuit is described, mathematically analysed and verified by circuit simulations.
△ Less
Submitted 3 July, 2012;
originally announced July 2012.