-
Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking
Authors:
Marco Federici,
Davide Belli,
Mart van Baalen,
Amir Jalalirad,
Andrii Skliar,
Bence Major,
Markus Nagel,
Paul Whatmough
Abstract:
While mobile devices provide ever more compute power, improvements in DRAM bandwidth are much slower. This is unfortunate for large language model (LLM) token generation, which is heavily memory-bound. Previous work has proposed to leverage natural dynamic activation sparsity in ReLU-activated LLMs to reduce effective DRAM bandwidth per token. However, more recent LLMs use SwiGLU instead of ReLU,…
▽ More
While mobile devices provide ever more compute power, improvements in DRAM bandwidth are much slower. This is unfortunate for large language model (LLM) token generation, which is heavily memory-bound. Previous work has proposed to leverage natural dynamic activation sparsity in ReLU-activated LLMs to reduce effective DRAM bandwidth per token. However, more recent LLMs use SwiGLU instead of ReLU, which results in little inherent sparsity. While SwiGLU activations can be pruned based on magnitude, the resulting sparsity patterns are difficult to predict, rendering previous approaches ineffective. To circumvent this issue, our work introduces Dynamic Input Pruning (DIP): a predictor-free dynamic sparsification approach, which preserves accuracy with minimal fine-tuning. DIP can further use lightweight LoRA adapters to regain some performance lost during sparsification. Lastly, we describe a novel cache-aware masking strategy, which considers the cache state and activation magnitude to further increase cache hit rate, improving LLM token rate on mobile devices. DIP outperforms other methods in terms of accuracy, memory and throughput trade-offs across simulated hardware settings. On Phi-3-Medium, DIP achieves a 46\% reduction in memory and 40\% increase in throughput with $<$ 0.1 loss in perplexity when compared to streaming the dense model from Flash. The open source code for HW simulator, methods, and experiments in this paper is available at https://github.com/Qualcomm-AI-research/dynamic-sparsity .
△ Less
Submitted 3 April, 2025; v1 submitted 2 December, 2024;
originally announced December 2024.
-
Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference
Authors:
Andrii Skliar,
Ties van Rozendaal,
Romain Lepert,
Todor Boinovski,
Mart van Baalen,
Markus Nagel,
Paul Whatmough,
Babak Ehteshami Bejnordi
Abstract:
Mixture of Experts (MoE) LLMs have recently gained attention for their ability to enhance performance by selectively engaging specialized subnetworks or "experts" for each input. However, deploying MoEs on memory-constrained devices remains challenging, particularly when generating tokens sequentially with a batch size of one, as opposed to typical high-throughput settings involving long sequences…
▽ More
Mixture of Experts (MoE) LLMs have recently gained attention for their ability to enhance performance by selectively engaging specialized subnetworks or "experts" for each input. However, deploying MoEs on memory-constrained devices remains challenging, particularly when generating tokens sequentially with a batch size of one, as opposed to typical high-throughput settings involving long sequences or large batches. In this work, we optimize MoE on memory-constrained devices where only a subset of expert weights fit in DRAM. We introduce a novel cache-aware routing strategy that leverages expert reuse during token generation to improve cache locality. We evaluate our approach on language modeling, MMLU, and GSM8K benchmarks and present on-device results demonstrating 2$\times$ speedups on mobile devices, offering a flexible, training-free solution to extend MoE's applicability across real-world applications.
△ Less
Submitted 27 November, 2024;
originally announced December 2024.
-
Rapid Switching and Multi-Adapter Fusion via Sparse High Rank Adapters
Authors:
Kartikeya Bhardwaj,
Nilesh Prasad Pandey,
Sweta Priyadarshi,
Viswanath Ganapathy,
Rafael Esteves,
Shreya Kadambi,
Shubhankar Borse,
Paul Whatmough,
Risheek Garrepalli,
Mart Van Baalen,
Harris Teague,
Markus Nagel
Abstract:
In this paper, we propose Sparse High Rank Adapters (SHiRA) that directly finetune 1-2% of the base model weights while leaving others unchanged, thus, resulting in a highly sparse adapter. This high sparsity incurs no inference overhead, enables rapid switching directly in the fused mode, and significantly reduces concept-loss during multi-adapter fusion. Our extensive experiments on LVMs and LLM…
▽ More
In this paper, we propose Sparse High Rank Adapters (SHiRA) that directly finetune 1-2% of the base model weights while leaving others unchanged, thus, resulting in a highly sparse adapter. This high sparsity incurs no inference overhead, enables rapid switching directly in the fused mode, and significantly reduces concept-loss during multi-adapter fusion. Our extensive experiments on LVMs and LLMs demonstrate that finetuning merely 1-2% parameters in the base model is sufficient for many adapter tasks and significantly outperforms Low Rank Adaptation (LoRA). We also show that SHiRA is orthogonal to advanced LoRA methods such as DoRA and can be easily combined with existing techniques.
△ Less
Submitted 22 July, 2024;
originally announced July 2024.
-
Sparse High Rank Adapters
Authors:
Kartikeya Bhardwaj,
Nilesh Prasad Pandey,
Sweta Priyadarshi,
Viswanath Ganapathy,
Shreya Kadambi,
Rafael Esteves,
Shubhankar Borse,
Paul Whatmough,
Risheek Garrepalli,
Mart Van Baalen,
Harris Teague,
Markus Nagel
Abstract:
Low Rank Adaptation (LoRA) has gained massive attention in the recent generative AI research. One of the main advantages of LoRA is its ability to be fused with pretrained models, adding no overhead during inference. However, from a mobile deployment standpoint, we can either avoid inference overhead in the fused mode but lose the ability to switch adapters rapidly, or suffer significant (up to 30…
▽ More
Low Rank Adaptation (LoRA) has gained massive attention in the recent generative AI research. One of the main advantages of LoRA is its ability to be fused with pretrained models, adding no overhead during inference. However, from a mobile deployment standpoint, we can either avoid inference overhead in the fused mode but lose the ability to switch adapters rapidly, or suffer significant (up to 30% higher) inference latency while enabling rapid switching in the unfused mode. LoRA also exhibits concept-loss when multiple adapters are used concurrently. In this paper, we propose Sparse High Rank Adapters (SHiRA), a new paradigm which incurs no inference overhead, enables rapid switching, and significantly reduces concept-loss. Specifically, SHiRA can be trained by directly tuning only 1-2% of the base model weights while leaving others unchanged. This results in a highly sparse adapter which can be switched directly in the fused mode. We further provide theoretical and empirical insights on how high sparsity in SHiRA can aid multi-adapter fusion by reducing concept loss. Our extensive experiments on LVMs and LLMs demonstrate that finetuning only a small fraction of the parameters in the base model significantly outperforms LoRA while enabling both rapid switching and multi-adapter fusion. Finally, we provide a latency- and memory-efficient SHiRA implementation based on Parameter-Efficient Finetuning (PEFT) Library which trains at nearly the same speed as LoRA while consuming up to 16% lower peak GPU memory, thus making SHiRA easy to adopt for practical use cases. To demonstrate rapid switching benefits during inference, we show that loading SHiRA on a base model can be 5x-16x faster than LoRA fusion on a CPU.
△ Less
Submitted 26 January, 2025; v1 submitted 18 June, 2024;
originally announced June 2024.
-
GPTVQ: The Blessing of Dimensionality for LLM Quantization
Authors:
Mart van Baalen,
Andrey Kuzmin,
Ivan Koryakovskiy,
Markus Nagel,
Peter Couperus,
Cedric Bastoul,
Eric Mahurin,
Tijmen Blankevoort,
Paul Whatmough
Abstract:
In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining un…
▽ More
In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.
△ Less
Submitted 3 June, 2025; v1 submitted 23 February, 2024;
originally announced February 2024.
-
The LLM Surgeon
Authors:
Tycho F. A. van der Ouderaa,
Markus Nagel,
Mart van Baalen,
Yuki M. Asano,
Tijmen Blankevoort
Abstract:
State-of-the-art language models are becoming increasingly large in an effort to achieve the highest performance on large corpora of available textual data. However, the sheer size of the Transformer architectures makes it difficult to deploy models within computational, environmental or device-specific constraints. We explore data-driven compression of existing pretrained models as an alternative…
▽ More
State-of-the-art language models are becoming increasingly large in an effort to achieve the highest performance on large corpora of available textual data. However, the sheer size of the Transformer architectures makes it difficult to deploy models within computational, environmental or device-specific constraints. We explore data-driven compression of existing pretrained models as an alternative to training smaller models from scratch. To do so, we scale Kronecker-factored curvature approximations of the target loss landscape to large language models. In doing so, we can compute both the dynamic allocation of structures that can be removed as well as updates of remaining weights that account for the removal. We provide a general framework for unstructured, semi-structured and structured pruning and improve upon weight updates to capture more correlations between weights, while remaining computationally efficient. Experimentally, our method can prune rows and columns from a range of OPT models and Llamav2-7B by 20%-30%, with a negligible loss in performance, and achieve state-of-the-art results in unstructured and semi-structured pruning of large language models.
△ Less
Submitted 20 March, 2024; v1 submitted 28 December, 2023;
originally announced December 2023.
-
QBitOpt: Fast and Accurate Bitwidth Reallocation during Training
Authors:
Jorn Peters,
Marios Fournarakis,
Markus Nagel,
Mart van Baalen,
Tijmen Blankevoort
Abstract:
Quantizing neural networks is one of the most effective methods for achieving efficient inference on mobile and embedded devices. In particular, mixed precision quantized (MPQ) networks, whose layers can be quantized to different bitwidths, achieve better task performance for the same resource constraint compared to networks with homogeneous bitwidths. However, finding the optimal bitwidth allocat…
▽ More
Quantizing neural networks is one of the most effective methods for achieving efficient inference on mobile and embedded devices. In particular, mixed precision quantized (MPQ) networks, whose layers can be quantized to different bitwidths, achieve better task performance for the same resource constraint compared to networks with homogeneous bitwidths. However, finding the optimal bitwidth allocation is a challenging problem as the search space grows exponentially with the number of layers in the network. In this paper, we propose QBitOpt, a novel algorithm for updating bitwidths during quantization-aware training (QAT). We formulate the bitwidth allocation problem as a constraint optimization problem. By combining fast-to-compute sensitivities with efficient solvers during QAT, QBitOpt can produce mixed-precision networks with high task performance guaranteed to satisfy strict resource constraints. This contrasts with existing mixed-precision methods that learn bitwidths using gradients and cannot provide such guarantees. We evaluate QBitOpt on ImageNet and confirm that we outperform existing fixed and mixed-precision methods under average bitwidth constraints commonly found in the literature.
△ Less
Submitted 10 July, 2023;
originally announced July 2023.
-
Pruning vs Quantization: Which is Better?
Authors:
Andrey Kuzmin,
Markus Nagel,
Mart van Baalen,
Arash Behboodi,
Tijmen Blankevoort
Abstract:
Neural network pruning and quantization techniques are almost as old as neural networks themselves. However, to date only ad-hoc comparisons between the two have been published. In this paper, we set out to answer the question on which is better: neural network quantization or pruning? By answering this question, we hope to inform design decisions made on neural network hardware going forward. We…
▽ More
Neural network pruning and quantization techniques are almost as old as neural networks themselves. However, to date only ad-hoc comparisons between the two have been published. In this paper, we set out to answer the question on which is better: neural network quantization or pruning? By answering this question, we hope to inform design decisions made on neural network hardware going forward. We provide an extensive comparison between the two techniques for compressing deep neural networks. First, we give an analytical comparison of expected quantization and pruning error for general data distributions. Then, we provide lower bounds for the per-layer pruning and quantization error in trained networks, and compare these to empirical error after optimization. Finally, we provide an extensive experimental comparison for training 8 large-scale models on 3 tasks. Our results show that in most cases quantization outperforms pruning. Only in some scenarios with very high compression ratio, pruning might be beneficial from an accuracy standpoint.
△ Less
Submitted 16 February, 2024; v1 submitted 6 July, 2023;
originally announced July 2023.
-
FP8 versus INT8 for efficient deep learning inference
Authors:
Mart van Baalen,
Andrey Kuzmin,
Suparna S Nair,
Yuwei Ren,
Eric Mahurin,
Chirag Patel,
Sundar Subramanian,
Sanghyuk Lee,
Markus Nagel,
Joseph Soriaga,
Tijmen Blankevoort
Abstract:
Recently, the idea of using FP8 as a number format for neural network training has been floating around the deep learning world. Given that most training is currently conducted with entire networks in FP32, or sometimes FP16 with mixed-precision, the step to having some parts of a network run in FP8 with 8-bit weights is an appealing potential speed-up for the generally costly and time-intensive t…
▽ More
Recently, the idea of using FP8 as a number format for neural network training has been floating around the deep learning world. Given that most training is currently conducted with entire networks in FP32, or sometimes FP16 with mixed-precision, the step to having some parts of a network run in FP8 with 8-bit weights is an appealing potential speed-up for the generally costly and time-intensive training procedures in deep learning. A natural question arises regarding what this development means for efficient inference on edge devices. In the efficient inference device world, workloads are frequently executed in INT8. Sometimes going even as low as INT4 when efficiency calls for it. In this whitepaper, we compare the performance for both the FP8 and INT formats for efficient on-device inference. We theoretically show the difference between the INT and FP formats for neural networks and present a plethora of post-training quantization and quantization-aware-training results to show how this theory translates to practice. We also provide a hardware analysis showing that the FP formats are somewhere between 50-180% less efficient in terms of compute in dedicated hardware than the INT format. Based on our research and a read of the research field, we conclude that although the proposed FP8 format could be good for training, the results for inference do not warrant a dedicated implementation of FP8 in favor of INT8 for efficient inference. We show that our results are mostly consistent with previous findings but that important comparisons between the formats have thus far been lacking. Finally, we discuss what happens when FP8-trained networks are converted to INT8 and conclude with a brief discussion on the most efficient way for on-device deployment and an extensive suite of INT8 results for many models.
△ Less
Submitted 15 June, 2023; v1 submitted 31 March, 2023;
originally announced March 2023.
-
A Practical Mixed Precision Algorithm for Post-Training Quantization
Authors:
Nilesh Prasad Pandey,
Markus Nagel,
Mart van Baalen,
Yin Huang,
Chirag Patel,
Tijmen Blankevoort
Abstract:
Neural network quantization is frequently used to optimize model size, latency and power consumption for on-device deployment of neural networks. In many cases, a target bit-width is set for an entire network, meaning every layer get quantized to the same number of bits. However, for many networks some layers are significantly more robust to quantization noise than others, leaving an important axi…
▽ More
Neural network quantization is frequently used to optimize model size, latency and power consumption for on-device deployment of neural networks. In many cases, a target bit-width is set for an entire network, meaning every layer get quantized to the same number of bits. However, for many networks some layers are significantly more robust to quantization noise than others, leaving an important axis of improvement unused. As many hardware solutions provide multiple different bit-width settings, mixed-precision quantization has emerged as a promising solution to find a better performance-efficiency trade-off than homogeneous quantization. However, most existing mixed precision algorithms are rather difficult to use for practitioners as they require access to the training data, have many hyper-parameters to tune or even depend on end-to-end retraining of the entire model. In this work, we present a simple post-training mixed precision algorithm that only requires a small unlabeled calibration dataset to automatically select suitable bit-widths for each layer for desirable on-device performance. Our algorithm requires no hyper-parameter tuning, is robust to data variation and takes into account practical hardware deployment constraints making it a great candidate for practical use. We experimentally validate our proposed method on several computer vision tasks, natural language processing tasks and many different networks, and show that we can find mixed precision networks that provide a better trade-off between accuracy and efficiency than their homogeneous bit-width equivalents.
△ Less
Submitted 10 February, 2023;
originally announced February 2023.
-
FP8 Quantization: The Power of the Exponent
Authors:
Andrey Kuzmin,
Mart Van Baalen,
Yuwei Ren,
Markus Nagel,
Jorn Peters,
Tijmen Blankevoort
Abstract:
When quantizing neural networks for efficient inference, low-bit integers are the go-to format for efficiency. However, low-bit floating point numbers have an extra degree of freedom, assigning some bits to work on an exponential scale instead. This paper in-depth investigates this benefit of the floating point format for neural network inference. We detail the choices that can be made for the FP8…
▽ More
When quantizing neural networks for efficient inference, low-bit integers are the go-to format for efficiency. However, low-bit floating point numbers have an extra degree of freedom, assigning some bits to work on an exponential scale instead. This paper in-depth investigates this benefit of the floating point format for neural network inference. We detail the choices that can be made for the FP8 format, including the important choice of the number of bits for the mantissa and exponent, and show analytically in which settings these choices give better performance. Then we show how these findings translate to real networks, provide an efficient implementation for FP8 simulation, and a new algorithm that enables the learning of both the scale parameters and the number of exponent bits in the FP8 format. Our chief conclusion is that when doing post-training quantization for a wide range of networks, the FP8 format is better than INT8 in terms of accuracy, and the choice of the number of exponent bits is driven by the severity of outliers in the network. We also conduct experiments with quantization-aware training where the difference in formats disappears as the network is trained to reduce the effect of outliers.
△ Less
Submitted 23 February, 2024; v1 submitted 19 August, 2022;
originally announced August 2022.
-
Quantized Sparse Weight Decomposition for Neural Network Compression
Authors:
Andrey Kuzmin,
Mart van Baalen,
Markus Nagel,
Arash Behboodi
Abstract:
In this paper, we introduce a novel method of neural network weight compression. In our method, we store weight tensors as sparse, quantized matrix factors, whose product is computed on the fly during inference to generate the target model's weights. We use projected gradient descent methods to find quantized and sparse factorization of the weight tensors. We show that this approach can be seen as…
▽ More
In this paper, we introduce a novel method of neural network weight compression. In our method, we store weight tensors as sparse, quantized matrix factors, whose product is computed on the fly during inference to generate the target model's weights. We use projected gradient descent methods to find quantized and sparse factorization of the weight tensors. We show that this approach can be seen as a unification of weight SVD, vector quantization, and sparse PCA. Combined with end-to-end fine-tuning our method exceeds or is on par with previous state-of-the-art methods in terms of the trade-off between accuracy and model size. Our method is applicable to both moderate compression regimes, unlike vector quantization, and extreme compression regimes.
△ Less
Submitted 22 July, 2022;
originally announced July 2022.
-
Cyclical Pruning for Sparse Neural Networks
Authors:
Suraj Srinivas,
Andrey Kuzmin,
Markus Nagel,
Mart van Baalen,
Andrii Skliar,
Tijmen Blankevoort
Abstract:
Current methods for pruning neural network weights iteratively apply magnitude-based pruning on the model weights and re-train the resulting model to recover lost accuracy. In this work, we show that such strategies do not allow for the recovery of erroneously pruned weights. To enable weight recovery, we propose a simple strategy called \textit{cyclical pruning} which requires the pruning schedul…
▽ More
Current methods for pruning neural network weights iteratively apply magnitude-based pruning on the model weights and re-train the resulting model to recover lost accuracy. In this work, we show that such strategies do not allow for the recovery of erroneously pruned weights. To enable weight recovery, we propose a simple strategy called \textit{cyclical pruning} which requires the pruning schedule to be periodic and allows for weights pruned erroneously in one cycle to recover in subsequent ones. Experimental results on both linear models and large-scale deep neural networks show that cyclical pruning outperforms existing pruning algorithms, especially at high sparsity ratios. Our approach is easy to tune and can be readily incorporated into existing pruning pipelines to boost performance.
△ Less
Submitted 2 February, 2022;
originally announced February 2022.
-
On the origins of the Omicron variant of the SARS-CoV-2 virus
Authors:
Robert Penner,
Minus van Baalen
Abstract:
A possible explanation based on first principles for the appearance of the Omicron variant of the SARS-CoV-2 virus is proposed involving coinfection with HIV. The gist is that the resultant HIV-induced immunocompromise allows SARS-CoV-2 greater latitude to explore its own mutational space. This latitude is not withoutr estriction, and a specific biophysical constraint is explored. Specifically, a…
▽ More
A possible explanation based on first principles for the appearance of the Omicron variant of the SARS-CoV-2 virus is proposed involving coinfection with HIV. The gist is that the resultant HIV-induced immunocompromise allows SARS-CoV-2 greater latitude to explore its own mutational space. This latitude is not withoutr estriction, and a specific biophysical constraint is explored. Specifically, a nearly two- to five-fold discrepancy in backbone hydrogen bonding is observed between sub-molecules in Protein Data Bank files of the spike glycoprotein yielding two conclusions: mutagenic residues in the receptor-binding subunit of the spike much more frequently do not participate in backbone hydrogen bonds; and a technique of viral escape is therefore to remove such bonds within physico-chemical and functional constraints. Earlier work, from which the previous discussion is entirely independent, explains these phenomena from general principles of free energy, namely, the metastability of the glycoprotein. The conclusions therefore likely hold more generally as principles in virology.
△ Less
Submitted 8 December, 2021;
originally announced January 2022.
-
A White Paper on Neural Network Quantization
Authors:
Markus Nagel,
Marios Fournarakis,
Rana Ali Amjad,
Yelysei Bondarenko,
Mart van Baalen,
Tijmen Blankevoort
Abstract:
While neural networks have advanced the frontiers in many applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is key if we want to integrate modern networks into edge devices with strict power and compute requirements. Neural network quantization is one of the most effective ways of achieving these savings but the additional noise…
▽ More
While neural networks have advanced the frontiers in many applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is key if we want to integrate modern networks into edge devices with strict power and compute requirements. Neural network quantization is one of the most effective ways of achieving these savings but the additional noise it induces can lead to accuracy degradation. In this white paper, we introduce state-of-the-art algorithms for mitigating the impact of quantization noise on the network's performance while maintaining low-bit weights and activations. We start with a hardware motivated introduction to quantization and then consider two main classes of algorithms: Post-Training Quantization (PTQ) and Quantization-Aware-Training (QAT). PTQ requires no re-training or labelled data and is thus a lightweight push-button approach to quantization. In most cases, PTQ is sufficient for achieving 8-bit quantization with close to floating-point accuracy. QAT requires fine-tuning and access to labeled training data but enables lower bit quantization with competitive results. For both solutions, we provide tested pipelines based on existing literature and extensive experimentation that lead to state-of-the-art performance for common deep learning models and tasks.
△ Less
Submitted 15 June, 2021;
originally announced June 2021.
-
Bayesian Bits: Unifying Quantization and Pruning
Authors:
Mart van Baalen,
Christos Louizos,
Markus Nagel,
Rana Ali Amjad,
Ying Wang,
Tijmen Blankevoort,
Max Welling
Abstract:
We introduce Bayesian Bits, a practical method for joint mixed precision quantization and pruning through gradient based optimization. Bayesian Bits employs a novel decomposition of the quantization operation, which sequentially considers doubling the bit width. At each new bit width, the residual error between the full precision value and the previously rounded value is quantized. We then decide…
▽ More
We introduce Bayesian Bits, a practical method for joint mixed precision quantization and pruning through gradient based optimization. Bayesian Bits employs a novel decomposition of the quantization operation, which sequentially considers doubling the bit width. At each new bit width, the residual error between the full precision value and the previously rounded value is quantized. We then decide whether or not to add this quantized residual error for a higher effective bit width and lower quantization noise. By starting with a power-of-two bit width, this decomposition will always produce hardware-friendly configurations, and through an additional 0-bit option, serves as a unified view of pruning and quantization. Bayesian Bits then introduces learnable stochastic gates, which collectively control the bit width of the given tensor. As a result, we can obtain low bit solutions by performing approximate inference over the gates, with prior distributions that encourage most of them to be switched off. We experimentally validate our proposed method on several benchmark datasets and show that we can learn pruned, mixed precision networks that provide a better trade-off between accuracy and efficiency than their static bit width equivalents.
△ Less
Submitted 27 October, 2020; v1 submitted 14 May, 2020;
originally announced May 2020.
-
Up or Down? Adaptive Rounding for Post-Training Quantization
Authors:
Markus Nagel,
Rana Ali Amjad,
Mart van Baalen,
Christos Louizos,
Tijmen Blankevoort
Abstract:
When quantizing neural networks, assigning each floating-point weight to its nearest fixed-point value is the predominant approach. We find that, perhaps surprisingly, this is not the best we can do. In this paper, we propose AdaRound, a better weight-rounding mechanism for post-training quantization that adapts to the data and the task loss. AdaRound is fast, does not require fine-tuning of the n…
▽ More
When quantizing neural networks, assigning each floating-point weight to its nearest fixed-point value is the predominant approach. We find that, perhaps surprisingly, this is not the best we can do. In this paper, we propose AdaRound, a better weight-rounding mechanism for post-training quantization that adapts to the data and the task loss. AdaRound is fast, does not require fine-tuning of the network, and only uses a small amount of unlabelled data. We start by theoretically analyzing the rounding problem for a pre-trained neural network. By approximating the task loss with a Taylor series expansion, the rounding task is posed as a quadratic unconstrained binary optimization problem. We simplify this to a layer-wise local loss and propose to optimize this loss with a soft relaxation. AdaRound not only outperforms rounding-to-nearest by a significant margin but also establishes a new state-of-the-art for post-training quantization on several networks and tasks. Without fine-tuning, we can quantize the weights of Resnet18 and Resnet50 to 4 bits while staying within an accuracy loss of 1%.
△ Less
Submitted 30 June, 2020; v1 submitted 22 April, 2020;
originally announced April 2020.
-
Gradient $\ell_1$ Regularization for Quantization Robustness
Authors:
Milad Alizadeh,
Arash Behboodi,
Mart van Baalen,
Christos Louizos,
Tijmen Blankevoort,
Max Welling
Abstract:
We analyze the effect of quantizing weights and activations of neural networks on their loss and derive a simple regularization scheme that improves robustness against post-training quantization. By training quantization-ready networks, our approach enables storing a single set of weights that can be quantized on-demand to different bit-widths as energy and memory requirements of the application c…
▽ More
We analyze the effect of quantizing weights and activations of neural networks on their loss and derive a simple regularization scheme that improves robustness against post-training quantization. By training quantization-ready networks, our approach enables storing a single set of weights that can be quantized on-demand to different bit-widths as energy and memory requirements of the application change. Unlike quantization-aware training using the straight-through estimator that only targets a specific bit-width and requires access to training data and pipeline, our regularization-based method paves the way for "on the fly'' post-training quantization to various bit-widths. We show that by modeling quantization as a $\ell_\infty$-bounded perturbation, the first-order term in the loss expansion can be regularized using the $\ell_1$-norm of gradients. We experimentally validate the effectiveness of our regularization scheme on different architectures on CIFAR-10 and ImageNet datasets.
△ Less
Submitted 18 February, 2020;
originally announced February 2020.
-
Data-Free Quantization Through Weight Equalization and Bias Correction
Authors:
Markus Nagel,
Mart van Baalen,
Tijmen Blankevoort,
Max Welling
Abstract:
We introduce a data-free quantization method for deep neural networks that does not require fine-tuning or hyperparameter selection. It achieves near-original model performance on common computer vision architectures and tasks. 8-bit fixed-point quantization is essential for efficient inference on modern deep learning hardware. However, quantizing models to run in 8-bit is a non-trivial task, freq…
▽ More
We introduce a data-free quantization method for deep neural networks that does not require fine-tuning or hyperparameter selection. It achieves near-original model performance on common computer vision architectures and tasks. 8-bit fixed-point quantization is essential for efficient inference on modern deep learning hardware. However, quantizing models to run in 8-bit is a non-trivial task, frequently leading to either significant performance reduction or engineering time spent on training a network to be amenable to quantization. Our approach relies on equalizing the weight ranges in the network by making use of a scale-equivariance property of activation functions. In addition the method corrects biases in the error that are introduced during quantization. This improves quantization accuracy performance, and can be applied to many common computer vision architectures with a straight forward API call. For common architectures, such as the MobileNet family, we achieve state-of-the-art quantized model performance. We further show that the method also extends to other computer vision architectures and tasks such as semantic segmentation and object detection.
△ Less
Submitted 25 November, 2019; v1 submitted 11 June, 2019;
originally announced June 2019.
-
Coevolutionary patterns caused by prey selection
Authors:
Sabrina B. L. Araujo,
Marcelo Eduardo Borges,
Francisco W. von Hartenthal,
Leonardo R. Jorge,
Thomas M. Lewinsohn,
Paulo R. Guimaraes Jr.,
Minus van Baalen
Abstract:
Many theoretical models have been formulated to better understand the coevolutionary patterns that emerge from antagonistic interactions. These models usually assume that the attacks by the exploiters are random, so the effect of victim selection by exploiters on coevolutionary patterns remains unexplored. Here we analytically studied the payoff for predators and prey under coevolution assuming th…
▽ More
Many theoretical models have been formulated to better understand the coevolutionary patterns that emerge from antagonistic interactions. These models usually assume that the attacks by the exploiters are random, so the effect of victim selection by exploiters on coevolutionary patterns remains unexplored. Here we analytically studied the payoff for predators and prey under coevolution assuming that every individual predator can attack only a small number of prey any given time, considering two scenarios: (i) predation occurs at random; (ii) predators select prey according to phenotype matching. We also develop an individual based model to verify the robustness of our analytical prediction. We show that both scenarios result in well known similar coevolutionary patterns if population sizes are sufficiently high: symmetrical coevolutionary branching and symmetrical coevolutionary cycling (Red Queen dynamics). However, for small population sizes, prey selection can cause unexpected coevolutionary patterns. One is the breaking of symmetry of the coevolutionary pattern, where the phenotypes evolve towards one of two evolutionarily stable patterns. As population size increases, the phenotypes oscillate between these two values in a novel form of Red Queen dynamics, the episodic reversal between the two stable patterns. Thus, prey selection causes prey phenotypes to evolve towards more extreme values, which reduces the fitness of both predators and prey, increasing the likelihood of extinction.
△ Less
Submitted 19 May, 2020; v1 submitted 24 September, 2018;
originally announced September 2018.