-
Efficient and Effective Methods for Mixed Precision Neural Network Quantization for Faster, Energy-efficient Inference
Authors:
Deepika Bablani,
Jeffrey L. Mckinstry,
Steven K. Esser,
Rathinakumar Appuswamy,
Dharmendra S. Modha
Abstract:
For efficient neural network inference, it is desirable to achieve state-of-the-art accuracy with the simplest networks requiring the least computation, memory, and power. Quantizing networks to lower precision is a powerful technique for simplifying networks. As each layer of a network may have different sensitivity to quantization, mixed precision quantization methods selectively tune the precis…
▽ More
For efficient neural network inference, it is desirable to achieve state-of-the-art accuracy with the simplest networks requiring the least computation, memory, and power. Quantizing networks to lower precision is a powerful technique for simplifying networks. As each layer of a network may have different sensitivity to quantization, mixed precision quantization methods selectively tune the precision of individual layers to achieve a minimum drop in task performance (e.g., accuracy). To estimate the impact of layer precision choice on task performance, two methods are introduced: i) Entropy Approximation Guided Layer selection (EAGL) is fast and uses the entropy of the weight distribution, and ii) Accuracy-aware Layer Precision Selection (ALPS) is straightforward and relies on single epoch fine-tuning after layer precision reduction. Using EAGL and ALPS for layer precision selection, full-precision accuracy is recovered with a mix of 4-bit and 2-bit layers for ResNet-50, ResNet-101 and BERT-base transformer networks, demonstrating enhanced performance across the entire accuracy-throughput frontier. The techniques demonstrate better performance than existing techniques in several commensurate comparisons. Notably, this is accomplished with significantly lesser computational time required to reach a solution.
△ Less
Submitted 10 January, 2024; v1 submitted 30 January, 2023;
originally announced January 2023.
-
Learned Step Size Quantization
Authors:
Steven K. Esser,
Jeffrey L. McKinstry,
Deepika Bablani,
Rathinakumar Appuswamy,
Dharmendra S. Modha
Abstract:
Deep networks run with low precision operations at inference time offer power and space advantages over high precision alternatives, but need to overcome the challenge of maintaining high accuracy as precision decreases. Here, we present a method for training such networks, Learned Step Size Quantization, that achieves the highest accuracy to date on the ImageNet dataset when using models, from a…
▽ More
Deep networks run with low precision operations at inference time offer power and space advantages over high precision alternatives, but need to overcome the challenge of maintaining high accuracy as precision decreases. Here, we present a method for training such networks, Learned Step Size Quantization, that achieves the highest accuracy to date on the ImageNet dataset when using models, from a variety of architectures, with weights and activations quantized to 2-, 3- or 4-bits of precision, and that can train 3-bit models that reach full precision baseline accuracy. Our approach builds upon existing methods for learning weights in quantized networks by improving how the quantizer itself is configured. Specifically, we introduce a novel means to estimate and scale the task loss gradient at each weight and activation layer's quantizer step size, such that it can be learned in conjunction with other network parameters. This approach works using different levels of precision as needed for a given system and requires only a simple modification of existing training code.
△ Less
Submitted 6 May, 2020; v1 submitted 21 February, 2019;
originally announced February 2019.
-
Low Precision Policy Distillation with Application to Low-Power, Real-time Sensation-Cognition-Action Loop with Neuromorphic Computing
Authors:
Jeffrey L Mckinstry,
Davis R. Barch,
Deepika Bablani,
Michael V. Debole,
Steven K. Esser,
Jeffrey A. Kusnitz,
John V. Arthur,
Dharmendra S. Modha
Abstract:
Low precision networks in the reinforcement learning (RL) setting are relatively unexplored because of the limitations of binary activations for function approximation. Here, in the discrete action ATARI domain, we demonstrate, for the first time, that low precision policy distillation from a high precision network provides a principled, practical way to train an RL agent. As an application, on 10…
▽ More
Low precision networks in the reinforcement learning (RL) setting are relatively unexplored because of the limitations of binary activations for function approximation. Here, in the discrete action ATARI domain, we demonstrate, for the first time, that low precision policy distillation from a high precision network provides a principled, practical way to train an RL agent. As an application, on 10 different ATARI games, we demonstrate real-time end-to-end game playing on low-power neuromorphic hardware by converting a sequence of game frames into discrete actions.
△ Less
Submitted 24 September, 2018;
originally announced September 2018.
-
Discovering Low-Precision Networks Close to Full-Precision Networks for Efficient Embedded Inference
Authors:
Jeffrey L. McKinstry,
Steven K. Esser,
Rathinakumar Appuswamy,
Deepika Bablani,
John V. Arthur,
Izzet B. Yildiz,
Dharmendra S. Modha
Abstract:
To realize the promise of ubiquitous embedded deep network inference, it is essential to seek limits of energy and area efficiency. To this end, low-precision networks offer tremendous promise because both energy and area scale down quadratically with the reduction in precision. Here we demonstrate ResNet-18, -34, -50, -152, Inception-v3, Densenet-161, and VGG-16bn networks on the ImageNet classif…
▽ More
To realize the promise of ubiquitous embedded deep network inference, it is essential to seek limits of energy and area efficiency. To this end, low-precision networks offer tremendous promise because both energy and area scale down quadratically with the reduction in precision. Here we demonstrate ResNet-18, -34, -50, -152, Inception-v3, Densenet-161, and VGG-16bn networks on the ImageNet classification benchmark that, at 8-bit precision exceed the accuracy of the full-precision baseline networks after one epoch of finetuning, thereby leveraging the availability of pretrained models. We also demonstrate ResNet-18, -34, -50, -152, Densenet-161, and VGG-16bn 4-bit models that match the accuracy of the full-precision baseline networks -- the highest scores to date. Surprisingly, the weights of the low-precision networks are very close (in cosine similarity) to the weights of the corresponding baseline networks, making training from scratch unnecessary.
We find that gradient noise due to quantization during training increases with reduced precision, and seek ways to overcome this noise. The number of iterations required by SGD to achieve a given training error is related to the square of (a) the distance of the initial solution from the final plus (b) the maximum variance of the gradient estimates. Therefore, we (a) reduce solution distance by starting with pretrained fp32 precision baseline networks and fine-tuning, and (b) combat gradient noise introduced by quantization by training longer and reducing learning rates. Sensitivity analysis indicates that these simple techniques, coupled with proper activation function range calibration to take full advantage of the limited precision, are sufficient to discover low-precision networks, if they exist, close to fp32 precision baseline networks. The results herein provide evidence that 4-bits suffice for classification.
△ Less
Submitted 24 February, 2019; v1 submitted 11 September, 2018;
originally announced September 2018.
-
Structured Convolution Matrices for Energy-efficient Deep learning
Authors:
Rathinakumar Appuswamy,
Tapan Nayak,
John Arthur,
Steven Esser,
Paul Merolla,
Jeffrey Mckinstry,
Timothy Melano,
Myron Flickner,
Dharmendra Modha
Abstract:
We derive a relationship between network representation in energy-efficient neuromorphic architectures and block Toplitz convolutional matrices. Inspired by this connection, we develop deep convolutional networks using a family of structured convolutional matrices and achieve state-of-the-art trade-off between energy efficiency and classification accuracy for well-known image recognition tasks. We…
▽ More
We derive a relationship between network representation in energy-efficient neuromorphic architectures and block Toplitz convolutional matrices. Inspired by this connection, we develop deep convolutional networks using a family of structured convolutional matrices and achieve state-of-the-art trade-off between energy efficiency and classification accuracy for well-known image recognition tasks. We also put forward a novel method to train binary convolutional networks by utilising an existing connection between noisy-rectified linear units and binary activations.
△ Less
Submitted 8 June, 2016;
originally announced June 2016.
-
Deep neural networks are robust to weight binarization and other non-linear distortions
Authors:
Paul Merolla,
Rathinakumar Appuswamy,
John Arthur,
Steve K. Esser,
Dharmendra Modha
Abstract:
Recent results show that deep neural networks achieve excellent performance even when, during training, weights are quantized and projected to a binary representation. Here, we show that this is just the tip of the iceberg: these same networks, during testing, also exhibit a remarkable robustness to distortions beyond quantization, including additive and multiplicative noise, and a class of non-li…
▽ More
Recent results show that deep neural networks achieve excellent performance even when, during training, weights are quantized and projected to a binary representation. Here, we show that this is just the tip of the iceberg: these same networks, during testing, also exhibit a remarkable robustness to distortions beyond quantization, including additive and multiplicative noise, and a class of non-linear projections where binarization is just a special case. To quantify this robustness, we show that one such network achieves 11% test error on CIFAR-10 even with 0.68 effective bits per weight. Furthermore, we find that a common training heuristic--namely, projecting quantized weights during backpropagation--can be altered (or even removed) and networks still achieve a base level of robustness during testing. Specifically, training with weight projections other than quantization also works, as does simply clipping the weights, both of which have never been reported before. We confirm our results for CIFAR-10 and ImageNet datasets. Finally, drawing from these ideas, we propose a stochastic projection rule that leads to a new state of the art network with 7.64% test error on CIFAR-10 using no data augmentation.
△ Less
Submitted 6 June, 2016;
originally announced June 2016.
-
Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing
Authors:
Steven K. Esser,
Paul A. Merolla,
John V. Arthur,
Andrew S. Cassidy,
Rathinakumar Appuswamy,
Alexander Andreopoulos,
David J. Berg,
Jeffrey L. McKinstry,
Timothy Melano,
Davis R. Barch,
Carmelo di Nolfo,
Pallab Datta,
Arnon Amir,
Brian Taba,
Myron D. Flickner,
Dharmendra S. Modha
Abstract:
Deep networks are now able to achieve human-level performance on a broad spectrum of recognition tasks. Independently, neuromorphic computing has now demonstrated unprecedented energy-efficiency through a new chip architecture based on spiking neurons, low precision synapses, and a scalable communication network. Here, we demonstrate that neuromorphic computing, despite its novel architectural pri…
▽ More
Deep networks are now able to achieve human-level performance on a broad spectrum of recognition tasks. Independently, neuromorphic computing has now demonstrated unprecedented energy-efficiency through a new chip architecture based on spiking neurons, low precision synapses, and a scalable communication network. Here, we demonstrate that neuromorphic computing, despite its novel architectural primitives, can implement deep convolution networks that i) approach state-of-the-art classification accuracy across 8 standard datasets, encompassing vision and speech, ii) perform inference while preserving the hardware's underlying energy-efficiency and high throughput, running on the aforementioned datasets at between 1200 and 2600 frames per second and using between 25 and 275 mW (effectively > 6000 frames / sec / W) and iii) can be specified and trained using backpropagation with the same ease-of-use as contemporary deep learning. For the first time, the algorithmic power of deep learning can be merged with the efficiency of neuromorphic processors, bringing the promise of embedded, intelligent, brain-inspired computing one step closer.
△ Less
Submitted 24 May, 2016; v1 submitted 27 March, 2016;
originally announced March 2016.
-
Mapping Generative Models onto a Network of Digital Spiking Neurons
Authors:
Bruno U. Pedroni,
Srinjoy Das,
John V. Arthur,
Paul A. Merolla,
Bryan L. Jackson,
Dharmendra S. Modha,
Kenneth Kreutz-Delgado,
Gert Cauwenberghs
Abstract:
Stochastic neural networks such as Restricted Boltzmann Machines (RBMs) have been successfully used in applications ranging from speech recognition to image classification. Inference and learning in these algorithms use a Markov Chain Monte Carlo procedure called Gibbs sampling, where a logistic function forms the kernel of this sampler. On the other side of the spectrum, neuromorphic systems have…
▽ More
Stochastic neural networks such as Restricted Boltzmann Machines (RBMs) have been successfully used in applications ranging from speech recognition to image classification. Inference and learning in these algorithms use a Markov Chain Monte Carlo procedure called Gibbs sampling, where a logistic function forms the kernel of this sampler. On the other side of the spectrum, neuromorphic systems have shown great promise for low-power and parallelized cognitive computing, but lack well-suited applications and automation procedures. In this work, we propose a systematic method for bridging the RBM algorithm and digital neuromorphic systems, with a generative pattern completion task as proof of concept. For this, we first propose a method of producing the Gibbs sampler using bio-inspired digital noisy integrate-and-fire neurons. Next, we describe the process of mapping generative RBMs trained offline onto the IBM TrueNorth neurosynaptic processor -- a low-power digital neuromorphic VLSI substrate. Mapping these algorithms onto neuromorphic hardware presents unique challenges in network connectivity and weight and bias quantization, which, in turn, require architectural and design strategies for the physical realization. Generative performance metrics are analyzed to validate the neuromorphic requirements and to best select the neuron parameters for the model. Lastly, we describe a design automation procedure which achieves optimal resource usage, accounting for the novel hardware adaptations. This work represents the first implementation of generative RBM inference on a neuromorphic VLSI substrate.
△ Less
Submitted 9 October, 2015; v1 submitted 24 September, 2015;
originally announced September 2015.
-
Gibbs Sampling with Low-Power Spiking Digital Neurons
Authors:
Srinjoy Das,
Bruno Umbria Pedroni,
Paul Merolla,
John Arthur,
Andrew S. Cassidy,
Bryan L. Jackson,
Dharmendra Modha,
Gert Cauwenberghs,
Ken Kreutz-Delgado
Abstract:
Restricted Boltzmann Machines and Deep Belief Networks have been successfully used in a wide variety of applications including image classification and speech recognition. Inference and learning in these algorithms uses a Markov Chain Monte Carlo procedure called Gibbs sampling. A sigmoidal function forms the kernel of this sampler which can be realized from the firing statistics of noisy integrat…
▽ More
Restricted Boltzmann Machines and Deep Belief Networks have been successfully used in a wide variety of applications including image classification and speech recognition. Inference and learning in these algorithms uses a Markov Chain Monte Carlo procedure called Gibbs sampling. A sigmoidal function forms the kernel of this sampler which can be realized from the firing statistics of noisy integrate-and-fire neurons on a neuromorphic VLSI substrate. This paper demonstrates such an implementation on an array of digital spiking neurons with stochastic leak and threshold properties for inference tasks and presents some key performance metrics for such a hardware-based sampler in both the generative and discriminative contexts.
△ Less
Submitted 27 March, 2015; v1 submitted 26 March, 2015;
originally announced March 2015.
-
Optimal Lempel-Ziv based lossy compression for memoryless data: how to make the right mistakes
Authors:
Narayana Santhanam,
Dharmendra Modha
Abstract:
Compression refers to encoding data using bits, so that the representation uses as few bits as possible. Compression could be lossless: i.e. encoded data can be recovered exactly from its representation) or lossy where the data is compressed more than the lossless case, but can still be recovered to within prespecified distortion metric. In this paper, we prove the optimality of Codelet Parsing, a…
▽ More
Compression refers to encoding data using bits, so that the representation uses as few bits as possible. Compression could be lossless: i.e. encoded data can be recovered exactly from its representation) or lossy where the data is compressed more than the lossless case, but can still be recovered to within prespecified distortion metric. In this paper, we prove the optimality of Codelet Parsing, a quasi-linear time algorithm for lossy compression of sequences of bits that are independently and identically distributed (\iid) and Hamming distortion. Codelet Parsing extends the lossless Lempel Ziv algorithm to the lossy case---a task that has been a focus of the source coding literature for better part of two decades now. Given \iid sequences $\x$, the expected length of the shortest lossy representation such that $\x$ can be reconstructed to within distortion $\dist$ is given by the rate distortion function, $\rd$. We prove the optimality of the Codelet Parsing algorithm for lossy compression of memoryless bit sequences. It splits the input sequence naturally into phrases, representing each phrase by a codelet, a potentially distorted phrase of the same length. The codelets in the lossy representation of a length-$n$ string ${\x}$ have length roughly $(\log n)/\rd$, and like the lossless Lempel Ziv algorithm, Codelet Parsing constructs codebooks logarithmic in the sequence length.
△ Less
Submitted 17 October, 2012; v1 submitted 17 October, 2012;
originally announced October 2012.