-
Unifying Uniform and Binary-coding Quantization for Accurate Compression of Large Language Models
Authors:
Seungcheol Park,
Jeongin Bae,
Beomseok Kwon,
Minjun Kim,
Byeongwook Kim,
Se Jung Kwon,
U Kang,
Dongsoo Lee
Abstract:
How can we quantize large language models while preserving accuracy? Quantization is essential for deploying large language models (LLMs) efficiently. Binary-coding quantization (BCQ) and uniform quantization (UQ) are promising quantization schemes that have strong expressiveness and optimizability, respectively. However, neither scheme leverages both advantages. In this paper, we propose UniQuanF…
▽ More
How can we quantize large language models while preserving accuracy? Quantization is essential for deploying large language models (LLMs) efficiently. Binary-coding quantization (BCQ) and uniform quantization (UQ) are promising quantization schemes that have strong expressiveness and optimizability, respectively. However, neither scheme leverages both advantages. In this paper, we propose UniQuanF (Unified Quantization with Flexible Mapping), an accurate quantization method for LLMs. UniQuanF harnesses both strong expressiveness and optimizability by unifying the flexible mapping technique in UQ and non-uniform quantization levels of BCQ. We propose unified initialization, and local and periodic mapping techniques to optimize the parameters in UniQuanF precisely. After optimization, our unification theorem removes computational and memory overhead, allowing us to utilize the superior accuracy of UniQuanF without extra deployment costs induced by the unification. Experimental results demonstrate that UniQuanF outperforms existing UQ and BCQ methods, achieving up to 4.60% higher accuracy on GSM8K benchmark.
△ Less
Submitted 16 June, 2025; v1 submitted 4 June, 2025;
originally announced June 2025.
-
Faster Inference of LLMs using FP8 on the Intel Gaudi
Authors:
Joonhyung Lee,
Shmulik Markovich-Golan,
Daniel Ohayon,
Yair Hanani,
Gunho Park,
Byeongwook Kim,
Asaf Karnieli,
Uri Livne,
Haihao Shen,
Tai Huang,
Se Jung Kwon,
Dongsoo Lee
Abstract:
Low-precision data types are essential in modern neural networks during both training and inference as they enhance throughput and computational capacity by better exploiting available hardware resources. Despite the incorporation of FP8 in commercially available neural network accelerators, a comprehensive exposition of its underlying mechanisms, along with rigorous performance and accuracy evalu…
▽ More
Low-precision data types are essential in modern neural networks during both training and inference as they enhance throughput and computational capacity by better exploiting available hardware resources. Despite the incorporation of FP8 in commercially available neural network accelerators, a comprehensive exposition of its underlying mechanisms, along with rigorous performance and accuracy evaluations, is still lacking. In this work, we contribute in three significant ways. First, we analyze the implementation details and quantization options associated with FP8 for inference on the Intel Gaudi AI accelerator. Second, we empirically quantify the throughput improvements afforded by the use of FP8 at both the operator level and in end-to-end scenarios. Third, we assess the accuracy impact of various FP8 quantization methods. Our experimental results indicate that the Intel Gaudi 2 accelerator consistently achieves high computational unit utilization, frequently exceeding 90% MFU, while incurring an accuracy degradation of less than 1%.
△ Less
Submitted 16 March, 2025; v1 submitted 12 March, 2025;
originally announced March 2025.
-
An Inquiry into Datacenter TCO for LLM Inference with FP8
Authors:
Jiwoo Kim,
Joonhyung Lee,
Gunho Park,
Byeongwook Kim,
Se Jung Kwon,
Dongsoo Lee,
Youngjoo Lee
Abstract:
As large language models (LLMs) continue to scale, their inference demands present significant challenges, particularly due to the high power consumption of AI accelerators in datacenters. These facilities require specialized cooling and power management systems, substantially increasing the total cost of ownership (TCO) for cloud service providers (CSPs). In this work, we analyze the computationa…
▽ More
As large language models (LLMs) continue to scale, their inference demands present significant challenges, particularly due to the high power consumption of AI accelerators in datacenters. These facilities require specialized cooling and power management systems, substantially increasing the total cost of ownership (TCO) for cloud service providers (CSPs). In this work, we analyze the computational characteristics and constraints of LLM inference from a TCO perspective, focusing on two representative accelerators: the Gaudi 2 and NVIDIA H100. We present a generalizable framework that enables CSPs to compare and select AI accelerators according to diverse operational requirements. Using this model, we analyze the impact of FP8 precision and LLM inference workload characteristics as key factors influencing TCO. We investigate FP8 quantization, which is gaining adoption in LLM training, as a technique to improve inference throughput while maintaining cost efficiency. Furthermore, our analysis of LLM inference workloads reveals that performance on thin GEMMs, which dominate the decode phase, can have a greater impact than theoretical hardware peak performance. By studying the interaction between power consumption, quantization strategies, and hardware architecture, we offer insights that support informed deployment decisions and guide future accelerator designs to improve the TCO of LLM inference.
△ Less
Submitted 29 April, 2025; v1 submitted 3 February, 2025;
originally announced February 2025.
-
Debunking the CUDA Myth Towards GPU-based AI Systems
Authors:
Yunjae Lee,
Juntaek Lim,
Jehyeon Bang,
Eunyeong Cho,
Huijong Jeong,
Taesu Kim,
Hyungjun Kim,
Joonhyung Lee,
Jinseop Im,
Ranggi Hwang,
Se Jung Kwon,
Dongsoo Lee,
Minsoo Rhu
Abstract:
This paper presents a comprehensive evaluation of Intel Gaudi NPUs as an alternative to NVIDIA GPUs, which is currently the de facto standard in AI system design. First, we create a suite of microbenchmarks to compare Intel Gaudi-2 with NVIDIA A100, showing that Gaudi-2 achieves competitive performance not only in primitive AI compute, memory, and communication operations but also in executing sev…
▽ More
This paper presents a comprehensive evaluation of Intel Gaudi NPUs as an alternative to NVIDIA GPUs, which is currently the de facto standard in AI system design. First, we create a suite of microbenchmarks to compare Intel Gaudi-2 with NVIDIA A100, showing that Gaudi-2 achieves competitive performance not only in primitive AI compute, memory, and communication operations but also in executing several important AI workloads end-to-end. We then assess Gaudi NPU's programmability by discussing several software-level optimization strategies to employ for implementing critical FBGEMM operators and vLLM, evaluating their efficiency against GPU-optimized counterparts. Results indicate that Gaudi-2 achieves energy efficiency comparable to A100, though there are notable areas for improvement in terms of software maturity. Overall, we conclude that, with effective integration into high-level AI frameworks, Gaudi NPUs could challenge NVIDIA GPU's dominance in the AI server market, though further improvements are necessary to fully compete with NVIDIA's robust software ecosystem.
△ Less
Submitted 21 March, 2025; v1 submitted 30 December, 2024;
originally announced January 2025.
-
LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices
Authors:
Jung Hyun Lee,
Jeonghoon Kim,
June Yong Yang,
Se Jung Kwon,
Eunho Yang,
Kang Min Yoo,
Dongsoo Lee
Abstract:
With the commercialization of large language models (LLMs), weight-activation quantization has emerged to compress and accelerate LLMs, achieving high throughput while reducing inference costs. However, existing post-training quantization (PTQ) techniques for quantizing weights and activations of LLMs still suffer from non-negligible accuracy drops, especially on massive multitask language underst…
▽ More
With the commercialization of large language models (LLMs), weight-activation quantization has emerged to compress and accelerate LLMs, achieving high throughput while reducing inference costs. However, existing post-training quantization (PTQ) techniques for quantizing weights and activations of LLMs still suffer from non-negligible accuracy drops, especially on massive multitask language understanding. To address this issue, we propose Low-Rank Quantization (LRQ) - a simple yet effective post-training weight quantization method for LLMs that reconstructs the outputs of an intermediate Transformer block by leveraging low-rank weight-scaling matrices, replacing the conventional full weight-scaling matrices that entail as many learnable scales as their associated weights. Thanks to parameter sharing via low-rank structure, LRQ only needs to learn significantly fewer parameters while enabling the individual scaling of weights, thus boosting the generalization capability of quantized LLMs. We show the superiority of LRQ over prior LLM PTQ works under (i) 8-bit weight and per-tensor activation quantization, (ii) 4-bit weight and 8-bit per-token activation quantization, and (iii) low-bit weight-only quantization schemes. Our code is available at Software.
△ Less
Submitted 8 February, 2025; v1 submitted 16 July, 2024;
originally announced July 2024.
-
To FP8 and Back Again: Quantifying Reduced Precision Effects on LLM Training Stability
Authors:
Joonhyung Lee,
Jeongin Bae,
Byeongwook Kim,
Se Jung Kwon,
Dongsoo Lee
Abstract:
The massive computational costs associated with large language model (LLM) pretraining have spurred great interest in reduced-precision floating-point representations to accelerate the process. As a result, the BrainFloat16 (BF16) precision has become the de facto standard for LLM training, with hardware support included in recent generations of accelerators. This trend has gone even further in th…
▽ More
The massive computational costs associated with large language model (LLM) pretraining have spurred great interest in reduced-precision floating-point representations to accelerate the process. As a result, the BrainFloat16 (BF16) precision has become the de facto standard for LLM training, with hardware support included in recent generations of accelerators. This trend has gone even further in the latest processors, where FP8 has recently been introduced. However, prior experience with FP16, which was found to be less stable than BF16, raises concerns as to whether FP8, with even fewer bits than FP16, can be a cost-effective option for LLM training. We argue that reduced-precision training schemes must have similar training stability and hyperparameter sensitivities to their higher-precision counterparts in order to be cost-effective. However, we find that currently available methods for FP8 training are not robust enough to allow their use as economical replacements. This prompts us to investigate the stability of reduced-precision LLM training in terms of robustness across random seeds, learning rates, and datasets. To this end, we propose new evaluation techniques and a new metric for quantifying loss landscape sharpness in autoregressive language models. By simulating incremental bit reductions in floating-point representations, we analyze the relationship between representational power and training stability with the intent of aiding future research into the field.
△ Less
Submitted 25 March, 2025; v1 submitted 28 May, 2024;
originally announced May 2024.
-
HyperCLOVA X Technical Report
Authors:
Kang Min Yoo,
Jaegeun Han,
Sookyo In,
Heewon Jeon,
Jisu Jeong,
Jaewook Kang,
Hyunwook Kim,
Kyung-Min Kim,
Munhyong Kim,
Sungju Kim,
Donghyun Kwak,
Hanock Kwak,
Se Jung Kwon,
Bado Lee,
Dongsoo Lee,
Gichang Lee,
Jooho Lee,
Baeseong Park,
Seongjin Shin,
Joonsang Yu,
Seolki Baek,
Sumin Byeon,
Eungsup Cho,
Dooseok Choe,
Jeesung Han
, et al. (371 additional authors not shown)
Abstract:
We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment t…
▽ More
We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment to responsible AI. The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English. HyperCLOVA X exhibits strong reasoning capabilities in Korean backed by a deep understanding of the language and cultural nuances. Further analysis of the inherent bilingual nature and its extension to multilingualism highlights the model's cross-lingual proficiency and strong generalization ability to untargeted languages, including machine translation between several language pairs and cross-lingual inference tasks. We believe that HyperCLOVA X can provide helpful guidance for regions or countries in developing their sovereign LLMs.
△ Less
Submitted 13 April, 2024; v1 submitted 2 April, 2024;
originally announced April 2024.
-
No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization
Authors:
June Yong Yang,
Byeongwook Kim,
Jeongin Bae,
Beomseok Kwon,
Gunho Park,
Eunho Yang,
Se Jung Kwon,
Dongsoo Lee
Abstract:
Key-Value (KV) Caching has become an essential technique for accelerating the inference speed and throughput of generative Large Language Models~(LLMs). However, the memory footprint of the KV cache poses a critical bottleneck in LLM deployment as the cache size grows with batch size and sequence length, often surpassing even the size of the model itself. Although recent methods were proposed to s…
▽ More
Key-Value (KV) Caching has become an essential technique for accelerating the inference speed and throughput of generative Large Language Models~(LLMs). However, the memory footprint of the KV cache poses a critical bottleneck in LLM deployment as the cache size grows with batch size and sequence length, often surpassing even the size of the model itself. Although recent methods were proposed to select and evict unimportant KV pairs from the cache to reduce memory consumption, the potential ramifications of eviction on the generative process are yet to be thoroughly examined. In this paper, we examine the detrimental impact of cache eviction and observe that unforeseen risks arise as the information contained in the KV pairs is exhaustively discarded, resulting in safety breaches, hallucinations, and context loss. Surprisingly, we find that preserving even a small amount of information contained in the evicted KV pairs via reduced precision quantization substantially recovers the incurred degradation. On the other hand, we observe that the important KV pairs must be kept at a relatively higher precision to safeguard the generation quality. Motivated by these observations, we propose \textit{Mixed-precision KV cache}~(MiKV), a reliable cache compression method that simultaneously preserves the context details by retaining the evicted KV pairs in low-precision and ensure generation quality by keeping the important KV pairs in high-precision. Experiments on diverse benchmarks and LLM backbones show that our proposed method offers a state-of-the-art trade-off between compression ratio and performance, compared to other baselines.
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
DropBP: Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation
Authors:
Sunghyeon Woo,
Baeseong Park,
Byeongwook Kim,
Minjung Jo,
Se Jung Kwon,
Dongsuk Jeon,
Dongsoo Lee
Abstract:
Large language models (LLMs) have achieved significant success across various domains. However, training these LLMs typically involves substantial memory and computational costs during both forward and backward propagation. While parameter-efficient fine-tuning (PEFT) considerably reduces the training memory associated with parameters, it does not address the significant computational costs and ac…
▽ More
Large language models (LLMs) have achieved significant success across various domains. However, training these LLMs typically involves substantial memory and computational costs during both forward and backward propagation. While parameter-efficient fine-tuning (PEFT) considerably reduces the training memory associated with parameters, it does not address the significant computational costs and activation memory. In this paper, we propose Dropping Backward Propagation (DropBP), a novel approach designed to reduce computational costs and activation memory while maintaining accuracy. DropBP randomly drops layers during backward propagation, which is essentially equivalent to training shallow submodules generated by undropped layers and residual connections. Additionally, DropBP calculates the sensitivity of each layer to assign an appropriate drop rate, thereby stabilizing the training process. DropBP is not only applicable to full fine-tuning but can also be orthogonally integrated with all types of PEFT by dropping layers during backward propagation. Specifically, DropBP can reduce training time by 44% with comparable accuracy to the baseline, accelerate convergence to the same perplexity by 1.5x, and enable training with a sequence length 6.2x larger on a single NVIDIA-A100 GPU. Furthermore, our DropBP enabled a throughput increase of 79% on a NVIDIA A100 GPU and 117% on an Intel Gaudi2 HPU. The code is available at https://github.com/WooSunghyeon/dropbp.
△ Less
Submitted 28 February, 2025; v1 submitted 27 February, 2024;
originally announced February 2024.
-
Label-Noise Robust Diffusion Models
Authors:
Byeonghu Na,
Yeongmin Kim,
HeeSun Bae,
Jung Hyun Lee,
Se Jung Kwon,
Wanmo Kang,
Il-Chul Moon
Abstract:
Conditional diffusion models have shown remarkable performance in various generative tasks, but training them requires large-scale datasets that often contain noise in conditional inputs, a.k.a. noisy labels. This noise leads to condition mismatch and quality degradation of generated data. This paper proposes Transition-aware weighted Denoising Score Matching (TDSM) for training conditional diffus…
▽ More
Conditional diffusion models have shown remarkable performance in various generative tasks, but training them requires large-scale datasets that often contain noise in conditional inputs, a.k.a. noisy labels. This noise leads to condition mismatch and quality degradation of generated data. This paper proposes Transition-aware weighted Denoising Score Matching (TDSM) for training conditional diffusion models with noisy labels, which is the first study in the line of diffusion models. The TDSM objective contains a weighted sum of score networks, incorporating instance-wise and time-dependent label transition probabilities. We introduce a transition-aware weight estimator, which leverages a time-dependent noisy-label classifier distinctively customized to the diffusion process. Through experiments across various datasets and noisy label settings, TDSM improves the quality of generated samples aligned with given conditions. Furthermore, our method improves generation performance even on prevalent benchmark datasets, which implies the potential noisy labels and their risk of generative model learning. Finally, we show the improved performance of TDSM on top of conventional noisy label corrections, which empirically proving its contribution as a part of label-noise robust generative models. Our code is available at: https://github.com/byeonghu-na/tdsm.
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models
Authors:
Jung Hwan Heo,
Jeonghoon Kim,
Beomseok Kwon,
Byeongwook Kim,
Se Jung Kwon,
Dongsoo Lee
Abstract:
Large Language Models (LLMs) have recently demonstrated remarkable success across various tasks. However, efficiently serving LLMs has been a challenge due to the large memory bottleneck, specifically in small batch inference settings (e.g. mobile devices). Weight-only quantization can be a promising approach, but sub-4 bit quantization remains a challenge due to large-magnitude activation outlier…
▽ More
Large Language Models (LLMs) have recently demonstrated remarkable success across various tasks. However, efficiently serving LLMs has been a challenge due to the large memory bottleneck, specifically in small batch inference settings (e.g. mobile devices). Weight-only quantization can be a promising approach, but sub-4 bit quantization remains a challenge due to large-magnitude activation outliers. To mitigate the undesirable outlier effect, we first propose per-IC quantization, a simple yet effective method that creates quantization groups within each input channel (IC) rather than the conventional per-output-channel (per-OC). Our method is motivated by the observation that activation outliers affect the input dimension of the weight matrix, so similarly grouping the weights in the IC direction can isolate outliers within a group. We also find that activation outliers do not dictate quantization difficulty, and inherent weight sensitivities also exist. With per-IC quantization as a new outlier-friendly scheme, we propose Adaptive Dimensions (AdaDim), a versatile quantization framework that can adapt to various weight sensitivity patterns. We demonstrate the effectiveness of AdaDim by augmenting prior methods such as Round-To-Nearest and GPTQ, showing significant improvements across various language modeling benchmarks for both base (up to +4.7% on MMLU) and instruction-tuned (up to +10% on HumanEval) LLMs. Code is available at https://github.com/johnheo/adadim-llm
△ Less
Submitted 13 April, 2025; v1 submitted 27 September, 2023;
originally announced September 2023.
-
FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization
Authors:
Jung Hyun Lee,
Jeonghoon Kim,
Se Jung Kwon,
Dongsoo Lee
Abstract:
Post-training quantization (PTQ) has been gaining popularity for the deployment of deep neural networks on resource-limited devices since unlike quantization-aware training, neither a full training dataset nor end-to-end training is required at all. As PTQ schemes based on reconstructing each layer or block output turn out to be effective to enhance quantized model performance, recent works have d…
▽ More
Post-training quantization (PTQ) has been gaining popularity for the deployment of deep neural networks on resource-limited devices since unlike quantization-aware training, neither a full training dataset nor end-to-end training is required at all. As PTQ schemes based on reconstructing each layer or block output turn out to be effective to enhance quantized model performance, recent works have developed algorithms to devise and learn a new weight-rounding scheme so as to better reconstruct each layer or block output. In this work, we propose a simple yet effective new weight-rounding mechanism for PTQ, coined \emph{FlexRound}, based on element-wise division instead of typical element-wise addition such that FlexRound enables jointly learning a common quantization grid size as well as a different scale for each pre-trained weight. Thanks to the reciprocal rule of derivatives induced by element-wise division, FlexRound is inherently able to exploit pre-trained weights when updating their corresponding scales, and thus, flexibly quantize pre-trained weights depending on their magnitudes. We empirically validate the efficacy of FlexRound on a wide range of models and tasks. To the best of our knowledge, our work is the first to carry out comprehensive experiments on not only image classification and natural language understanding but also natural language generation. Moreover, we demonstrate, for the first time, that large language models can be efficiently quantized, with only a negligible impact on performance compared to half-precision baselines, achieved by reconstructing the output in a block-by-block manner. Our code is available at \url{https://github.com/onliwad101/FlexRound_LRQ}.
△ Less
Submitted 16 July, 2024; v1 submitted 31 May, 2023;
originally announced June 2023.
-
Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization
Authors:
Jeonghoon Kim,
Jung Hyun Lee,
Sungdong Kim,
Joonsuk Park,
Kang Min Yoo,
Se Jung Kwon,
Dongsoo Lee
Abstract:
Large language models (LLMs) face the challenges in fine-tuning and deployment due to their high memory demands and computational costs. While parameter-efficient fine-tuning (PEFT) methods aim to reduce the memory usage of the optimizer state during fine-tuning, the inherent size of pre-trained LLM weights continues to be a pressing concern. Even though quantization techniques are widely proposed…
▽ More
Large language models (LLMs) face the challenges in fine-tuning and deployment due to their high memory demands and computational costs. While parameter-efficient fine-tuning (PEFT) methods aim to reduce the memory usage of the optimizer state during fine-tuning, the inherent size of pre-trained LLM weights continues to be a pressing concern. Even though quantization techniques are widely proposed to ease memory demands and accelerate LLM inference, most of these techniques are geared towards the deployment phase. To bridge this gap, this paper presents Parameter-Efficient and Quantization-aware Adaptation (PEQA) - a simple yet effective method that combines the advantages of PEFT with quantized LLMs. By updating solely the quantization scales, PEQA can be directly applied to quantized LLMs, ensuring seamless task transitions. Parallel to existing PEFT methods, PEQA significantly reduces the memory overhead associated with the optimizer state. Furthermore, it leverages the advantages of quantization to substantially reduce model sizes. Even after fine-tuning, the quantization structure of a PEQA-tuned LLM remains intact, allowing for accelerated inference on the deployment stage. We employ PEQA-tuning for task-specific adaptation on LLMs with up to 65 billion parameters. To assess the logical reasoning and language comprehension of PEQA-tuned LLMs, we fine-tune low-bit quantized LLMs using a instruction dataset. Our results show that even when LLMs are quantized to below 4-bit precision, their capabilities in language modeling, few-shot in-context learning, and comprehension can be resiliently restored to (or even improved over) their full-precision original performances with PEQA.
△ Less
Submitted 28 October, 2023; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Refining Generative Process with Discriminator Guidance in Score-based Diffusion Models
Authors:
Dongjun Kim,
Yeongmin Kim,
Se Jung Kwon,
Wanmo Kang,
Il-Chul Moon
Abstract:
The proposed method, Discriminator Guidance, aims to improve sample generation of pre-trained diffusion models. The approach introduces a discriminator that gives explicit supervision to a denoising sample path whether it is realistic or not. Unlike GANs, our approach does not require joint training of score and discriminator networks. Instead, we train the discriminator after score training, maki…
▽ More
The proposed method, Discriminator Guidance, aims to improve sample generation of pre-trained diffusion models. The approach introduces a discriminator that gives explicit supervision to a denoising sample path whether it is realistic or not. Unlike GANs, our approach does not require joint training of score and discriminator networks. Instead, we train the discriminator after score training, making discriminator training stable and fast to converge. In sample generation, we add an auxiliary term to the pre-trained score to deceive the discriminator. This term corrects the model score to the data score at the optimal discriminator, which implies that the discriminator helps better score estimation in a complementary way. Using our algorithm, we achive state-of-the-art results on ImageNet 256x256 with FID 1.83 and recall 0.64, similar to the validation data's FID (1.68) and recall (0.66). We release the code at https://github.com/alsdudrla10/DG.
△ Less
Submitted 4 June, 2023; v1 submitted 28 November, 2022;
originally announced November 2022.
-
AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models
Authors:
Se Jung Kwon,
Jeonghoon Kim,
Jeongin Bae,
Kang Min Yoo,
Jin-Hwa Kim,
Baeseong Park,
Byeongwook Kim,
Jung-Woo Ha,
Nako Sung,
Dongsoo Lee
Abstract:
There are growing interests in adapting large-scale language models using parameter-efficient fine-tuning methods. However, accelerating the model itself and achieving better inference efficiency through model compression has not been thoroughly explored yet. Model compression could provide the benefits of reducing memory footprints, enabling low-precision computations, and ultimately achieving co…
▽ More
There are growing interests in adapting large-scale language models using parameter-efficient fine-tuning methods. However, accelerating the model itself and achieving better inference efficiency through model compression has not been thoroughly explored yet. Model compression could provide the benefits of reducing memory footprints, enabling low-precision computations, and ultimately achieving cost-effective inference. To combine parameter-efficient adaptation and model compression, we propose AlphaTuning consisting of post-training quantization of the pre-trained language model and fine-tuning only some parts of quantized parameters for a target task. Specifically, AlphaTuning works by employing binary-coding quantization, which factorizes the full-precision parameters into binary parameters and a separate set of scaling factors. During the adaptation phase, the binary values are frozen for all tasks, while the scaling factors are fine-tuned for the downstream task. We demonstrate that AlphaTuning, when applied to GPT-2 and OPT, performs competitively with full fine-tuning on a variety of downstream tasks while achieving >10x compression ratio under 4-bit quantization and >1,000x reduction in the number of trainable parameters.
△ Less
Submitted 7 October, 2022;
originally announced October 2022.
-
LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models
Authors:
Gunho Park,
Baeseong Park,
Minsub Kim,
Sungjae Lee,
Jeonghoon Kim,
Beomseok Kwon,
Se Jung Kwon,
Byeongwook Kim,
Youngjoo Lee,
Dongsoo Lee
Abstract:
Recent advances in self-supervised learning and the Transformer architecture have significantly improved natural language processing (NLP), achieving remarkably low perplexity. However, the growing size of NLP models introduces a memory wall problem during the generation phase. To mitigate this issue, recent efforts have focused on quantizing model weights to sub-4-bit precision while preserving f…
▽ More
Recent advances in self-supervised learning and the Transformer architecture have significantly improved natural language processing (NLP), achieving remarkably low perplexity. However, the growing size of NLP models introduces a memory wall problem during the generation phase. To mitigate this issue, recent efforts have focused on quantizing model weights to sub-4-bit precision while preserving full precision for activations, resulting in practical speed-ups during inference on a single GPU. However, these improvements primarily stem from reduced memory movement, which necessitates a resource-intensive dequantization process rather than actual computational reduction. In this paper, we introduce LUT-GEMM, an efficient kernel for quantized matrix multiplication, which not only eliminates the resource-intensive dequantization process but also reduces computational costs compared to previous kernels for weight-only quantization. Furthermore, we proposed group-wise quantization to offer a flexible trade-off between compression ratio and accuracy. The impact of LUT-GEMM is facilitated by implementing high compression ratios through low-bit quantization and efficient LUT-based operations. We show experimentally that when applied to the OPT-175B model with 3-bit quantization, LUT-GEMM substantially accelerates token generation latency, achieving a remarkable 2.1$\times$ improvement on a single GPU when compared to OPTQ, which relies on the costly dequantization process.
△ Less
Submitted 1 April, 2024; v1 submitted 19 June, 2022;
originally announced June 2022.
-
Maximum Likelihood Training of Implicit Nonlinear Diffusion Models
Authors:
Dongjun Kim,
Byeonghu Na,
Se Jung Kwon,
Dongsoo Lee,
Wanmo Kang,
Il-Chul Moon
Abstract:
Whereas diverse variations of diffusion models exist, extending the linear diffusion into a nonlinear diffusion process is investigated by very few works. The nonlinearity effect has been hardly understood, but intuitively, there would be promising diffusion patterns to efficiently train the generative distribution towards the data distribution. This paper introduces a data-adaptive nonlinear diff…
▽ More
Whereas diverse variations of diffusion models exist, extending the linear diffusion into a nonlinear diffusion process is investigated by very few works. The nonlinearity effect has been hardly understood, but intuitively, there would be promising diffusion patterns to efficiently train the generative distribution towards the data distribution. This paper introduces a data-adaptive nonlinear diffusion process for score-based diffusion models. The proposed Implicit Nonlinear Diffusion Model (INDM) learns by combining a normalizing flow and a diffusion process. Specifically, INDM implicitly constructs a nonlinear diffusion on the \textit{data space} by leveraging a linear diffusion on the \textit{latent space} through a flow network. This flow network is key to forming a nonlinear diffusion, as the nonlinearity depends on the flow network. This flexible nonlinearity improves the learning curve of INDM to nearly Maximum Likelihood Estimation (MLE) against the non-MLE curve of DDPM++, which turns out to be an inflexible version of INDM with the flow fixed as an identity mapping. Also, the discretization of INDM shows the sampling robustness. In experiments, INDM achieves the state-of-the-art FID of 1.75 on CelebA. We release our code at https://github.com/byeonghu-na/INDM.
△ Less
Submitted 12 October, 2022; v1 submitted 26 May, 2022;
originally announced May 2022.
-
Modulating Regularization Frequency for Efficient Compression-Aware Model Training
Authors:
Dongsoo Lee,
Se Jung Kwon,
Byeongwook Kim,
Jeongin Yun,
Baeseong Park,
Yongkweon Jeon
Abstract:
While model compression is increasingly important because of large neural network size, compression-aware training is challenging as it needs sophisticated model modifications and longer training time.In this paper, we introduce regularization frequency (i.e., how often compression is performed during training) as a new regularization technique for a practical and efficient compression-aware train…
▽ More
While model compression is increasingly important because of large neural network size, compression-aware training is challenging as it needs sophisticated model modifications and longer training time.In this paper, we introduce regularization frequency (i.e., how often compression is performed during training) as a new regularization technique for a practical and efficient compression-aware training method. For various regularization techniques, such as weight decay and dropout, optimizing the regularization strength is crucial to improve generalization in Deep Neural Networks (DNNs). While model compression also demands the right amount of regularization, the regularization strength incurred by model compression has been controlled only by compression ratio. Throughout various experiments, we show that regularization frequency critically affects the regularization strength of model compression. Combining regularization frequency and compression ratio, the amount of weight updates by model compression per mini-batch can be optimized to achieve the best model accuracy. Modulating regularization frequency is implemented by occasional model compression while conventional compression-aware training is usually performed for every mini-batch.
△ Less
Submitted 5 May, 2021;
originally announced May 2021.
-
Encoding Weights of Irregular Sparsity for Fixed-to-Fixed Model Compression
Authors:
Baeseong Park,
Se Jung Kwon,
Daehwan Oh,
Byeongwook Kim,
Dongsoo Lee
Abstract:
Even though fine-grained pruning techniques achieve a high compression ratio, conventional sparsity representations (such as CSR) associated with irregular sparsity degrade parallelism significantly. Practical pruning methods, thus, usually lower pruning rates (by structured pruning) to improve parallelism. In this paper, we study fixed-to-fixed (lossless) encoding architecture/algorithm to suppor…
▽ More
Even though fine-grained pruning techniques achieve a high compression ratio, conventional sparsity representations (such as CSR) associated with irregular sparsity degrade parallelism significantly. Practical pruning methods, thus, usually lower pruning rates (by structured pruning) to improve parallelism. In this paper, we study fixed-to-fixed (lossless) encoding architecture/algorithm to support fine-grained pruning methods such that sparse neural networks can be stored in a highly regular structure. We first estimate the maximum compression ratio of encoding-based compression using entropy. Then, as an effort to push the compression ratio to the theoretical maximum (by entropy), we propose a sequential fixed-to-fixed encoding scheme. We demonstrate that our proposed compression scheme achieves almost the maximum compression ratio for the Transformer and ResNet-50 pruned by various fine-grained pruning methods.
△ Less
Submitted 30 January, 2022; v1 submitted 5 May, 2021;
originally announced May 2021.
-
Q-Rater: Non-Convex Optimization for Post-Training Uniform Quantization
Authors:
Byeongwook Kim,
Dongsoo Lee,
Yeonju Ro,
Yongkweon Jeon,
Se Jung Kwon,
Baeseong Park,
Daehwan Oh
Abstract:
Various post-training uniform quantization methods have usually been studied based on convex optimization. As a result, most previous ones rely on the quantization error minimization and/or quadratic approximations. Such approaches are computationally efficient and reasonable when a large number of quantization bits are employed. When the number of quantization bits is relatively low, however, non…
▽ More
Various post-training uniform quantization methods have usually been studied based on convex optimization. As a result, most previous ones rely on the quantization error minimization and/or quadratic approximations. Such approaches are computationally efficient and reasonable when a large number of quantization bits are employed. When the number of quantization bits is relatively low, however, non-convex optimization is unavoidable to improve model accuracy. In this paper, we propose a new post-training uniform quantization technique considering non-convexity. We empirically show that hyper-parameters for clipping and rounding of weights and activations can be explored by monitoring task loss. Then, an optimally searched set of hyper-parameters is frozen to proceed to the next layer such that an incremental non-convex optimization is enabled for post-training quantization. Throughout extensive experimental results using various models, our proposed technique presents higher model accuracy, especially for a low-bit quantization.
△ Less
Submitted 5 May, 2021;
originally announced May 2021.
-
Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation
Authors:
Insoo Chung,
Byeongwook Kim,
Yoonjung Choi,
Se Jung Kwon,
Yongkweon Jeon,
Baeseong Park,
Sangha Kim,
Dongsoo Lee
Abstract:
The deployment of widely used Transformer architecture is challenging because of heavy computation load and memory overhead during inference, especially when the target device is limited in computational resources such as mobile or edge devices. Quantization is an effective technique to address such challenges. Our analysis shows that for a given number of quantization bits, each block of Transfor…
▽ More
The deployment of widely used Transformer architecture is challenging because of heavy computation load and memory overhead during inference, especially when the target device is limited in computational resources such as mobile or edge devices. Quantization is an effective technique to address such challenges. Our analysis shows that for a given number of quantization bits, each block of Transformer contributes to translation quality and inference computations in different manners. Moreover, even inside an embedding block, each word presents vastly different contributions. Correspondingly, we propose a mixed precision quantization strategy to represent Transformer weights by an extremely low number of bits (e.g., under 3 bits). For example, for each word in an embedding block, we assign different quantization bits based on statistical property. Our quantized Transformer model achieves 11.8$\times$ smaller model size than the baseline model, with less than -0.5 BLEU. We achieve 8.3$\times$ reduction in run-time memory footprints and 3.5$\times$ speed up (Galaxy N10+) such that our proposed compression strategy enables efficient implementation for on-device NMT.
△ Less
Submitted 13 October, 2020; v1 submitted 15 September, 2020;
originally announced September 2020.
-
FleXOR: Trainable Fractional Quantization
Authors:
Dongsoo Lee,
Se Jung Kwon,
Byeongwook Kim,
Yongkweon Jeon,
Baeseong Park,
Jeongin Yun
Abstract:
Quantization based on the binary codes is gaining attention because each quantized bit can be directly utilized for computations without dequantization using look-up tables. Previous attempts, however, only allow for integer numbers of quantization bits, which ends up restricting the search space for compression ratio and accuracy. In this paper, we propose an encryption algorithm/architecture to…
▽ More
Quantization based on the binary codes is gaining attention because each quantized bit can be directly utilized for computations without dequantization using look-up tables. Previous attempts, however, only allow for integer numbers of quantization bits, which ends up restricting the search space for compression ratio and accuracy. In this paper, we propose an encryption algorithm/architecture to compress quantized weights so as to achieve fractional numbers of bits per weight. Decryption during inference is implemented by digital XOR-gate networks added into the neural network model while XOR gates are described by utilizing $\tanh(x)$ for backward propagation to enable gradient calculations. We perform experiments using MNIST, CIFAR-10, and ImageNet to show that inserting XOR gates learns quantization/encrypted bit decisions through training and obtains high accuracy even for fractional sub 1-bit weights. As a result, our proposed method yields smaller size and higher model accuracy compared to binary neural networks.
△ Less
Submitted 22 October, 2020; v1 submitted 9 September, 2020;
originally announced September 2020.
-
BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs
Authors:
Yongkweon Jeon,
Baeseong Park,
Se Jung Kwon,
Byeongwook Kim,
Jeongin Yun,
Dongsoo Lee
Abstract:
The number of parameters in deep neural networks (DNNs) is rapidly increasing to support complicated tasks and to improve model accuracy. Correspondingly, the amount of computations and required memory footprint increase as well. Quantization is an efficient method to address such concerns by compressing DNNs such that computations can be simplified while required storage footprint is significantl…
▽ More
The number of parameters in deep neural networks (DNNs) is rapidly increasing to support complicated tasks and to improve model accuracy. Correspondingly, the amount of computations and required memory footprint increase as well. Quantization is an efficient method to address such concerns by compressing DNNs such that computations can be simplified while required storage footprint is significantly reduced. Unfortunately, commercial CPUs and GPUs do not fully support quantization because only fixed data transfers (such as 32 bits) are allowed. As a result, even if weights are quantized into a few bits, CPUs and GPUs cannot access multiple quantized weights without memory bandwidth waste. Success of quantization in practice, hence, relies on an efficient computation engine design, especially for matrix multiplication that is a basic computation engine in most DNNs. In this paper, we propose a novel matrix multiplication method, called BiQGEMM, dedicated to quantized DNNs. BiQGEMM can access multiple quantized weights simultaneously in one instruction. In addition, BiQGEMM pre-computes intermediate results that are highly redundant when quantization leads to limited available computation space. Since pre-computed values are stored in lookup tables and reused, BiQGEMM achieves lower amount of overall computations. Our extensive experimental results show that BiQGEMM presents higher performance than conventional schemes when DNNs are quantized.
△ Less
Submitted 31 August, 2020; v1 submitted 20 May, 2020;
originally announced May 2020.
-
Learning Low-Rank Approximation for CNNs
Authors:
Dongsoo Lee,
Se Jung Kwon,
Byeongwook Kim,
Gu-Yeon Wei
Abstract:
Low-rank approximation is an effective model compression technique to not only reduce parameter storage requirements, but to also reduce computations. For convolutional neural networks (CNNs), however, well-known low-rank approximation methods, such as Tucker or CP decomposition, result in degraded model accuracy because decomposed layers hinder training convergence. In this paper, we propose a ne…
▽ More
Low-rank approximation is an effective model compression technique to not only reduce parameter storage requirements, but to also reduce computations. For convolutional neural networks (CNNs), however, well-known low-rank approximation methods, such as Tucker or CP decomposition, result in degraded model accuracy because decomposed layers hinder training convergence. In this paper, we propose a new training technique that finds a flat minimum in the view of low-rank approximation without a decomposed structure during training. By preserving the original model structure, 2-dimensional low-rank approximation demanding lowering (such as im2col) is available in our proposed scheme. We show that CNN models can be compressed by low-rank approximation with much higher compression ratio than conventional training methods while maintaining or even enhancing model accuracy. We also discuss various 2-dimensional low-rank approximation techniques for CNNs.
△ Less
Submitted 24 May, 2019;
originally announced May 2019.
-
Structured Compression by Weight Encryption for Unstructured Pruning and Quantization
Authors:
Se Jung Kwon,
Dongsoo Lee,
Byeongwook Kim,
Parichay Kapoor,
Baeseong Park,
Gu-Yeon Wei
Abstract:
Model compression techniques, such as pruning and quantization, are becoming increasingly important to reduce the memory footprints and the amount of computations. Despite model size reduction, achieving performance enhancement on devices is, however, still challenging mainly due to the irregular representations of sparse matrix formats. This paper proposes a new weight representation scheme for S…
▽ More
Model compression techniques, such as pruning and quantization, are becoming increasingly important to reduce the memory footprints and the amount of computations. Despite model size reduction, achieving performance enhancement on devices is, however, still challenging mainly due to the irregular representations of sparse matrix formats. This paper proposes a new weight representation scheme for Sparse Quantized Neural Networks, specifically achieved by fine-grained and unstructured pruning method. The representation is encrypted in a structured regular format, which can be efficiently decoded through XOR-gate network during inference in a parallel manner. We demonstrate various deep learning models that can be compressed and represented by our proposed format with fixed and high compression ratio. For example, for fully-connected layers of AlexNet on ImageNet dataset, we can represent the sparse weights by only 0.28 bits/weight for 1-bit quantization and 91% pruning rate with a fixed decoding rate and full memory bandwidth usage. Decoding through XOR-gate network can be performed without any model accuracy degradation with additional patch data associated with small overhead.
△ Less
Submitted 5 March, 2020; v1 submitted 24 May, 2019;
originally announced May 2019.
-
Network Pruning for Low-Rank Binary Indexing
Authors:
Dongsoo Lee,
Se Jung Kwon,
Byeongwook Kim,
Parichay Kapoor,
Gu-Yeon Wei
Abstract:
Pruning is an efficient model compression technique to remove redundancy in the connectivity of deep neural networks (DNNs). Computations using sparse matrices obtained by pruning parameters, however, exhibit vastly different parallelism depending on the index representation scheme. As a result, fine-grained pruning has not gained much attention due to its irregular index form leading to large mem…
▽ More
Pruning is an efficient model compression technique to remove redundancy in the connectivity of deep neural networks (DNNs). Computations using sparse matrices obtained by pruning parameters, however, exhibit vastly different parallelism depending on the index representation scheme. As a result, fine-grained pruning has not gained much attention due to its irregular index form leading to large memory footprint and low parallelism for convolutions and matrix multiplications. In this paper, we propose a new network pruning technique that generates a low-rank binary index matrix to compress index data while decompressing index data is performed by simple binary matrix multiplication. This proposed compression method finds a particular fine-grained pruning mask that can be decomposed into two binary matrices. We also propose a tile-based factorization technique that not only lowers memory requirements but also enhances compression ratio. Various DNN models can be pruned with much fewer indexes compared to previous sparse matrix formats while maintaining the same pruning rate.
△ Less
Submitted 14 May, 2019;
originally announced May 2019.