Skip to main content

Showing 1–26 of 26 results for author: Kwon, S J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.03781  [pdf, ps, other

    cs.CL

    Unifying Uniform and Binary-coding Quantization for Accurate Compression of Large Language Models

    Authors: Seungcheol Park, Jeongin Bae, Beomseok Kwon, Minjun Kim, Byeongwook Kim, Se Jung Kwon, U Kang, Dongsoo Lee

    Abstract: How can we quantize large language models while preserving accuracy? Quantization is essential for deploying large language models (LLMs) efficiently. Binary-coding quantization (BCQ) and uniform quantization (UQ) are promising quantization schemes that have strong expressiveness and optimizability, respectively. However, neither scheme leverages both advantages. In this paper, we propose UniQuanF… ▽ More

    Submitted 16 June, 2025; v1 submitted 4 June, 2025; originally announced June 2025.

    Comments: ACL 2025 Main Track

    MSC Class: 68T50 ACM Class: I.2.7

  2. arXiv:2503.09975  [pdf, ps, other

    cs.AR

    Faster Inference of LLMs using FP8 on the Intel Gaudi

    Authors: Joonhyung Lee, Shmulik Markovich-Golan, Daniel Ohayon, Yair Hanani, Gunho Park, Byeongwook Kim, Asaf Karnieli, Uri Livne, Haihao Shen, Tai Huang, Se Jung Kwon, Dongsoo Lee

    Abstract: Low-precision data types are essential in modern neural networks during both training and inference as they enhance throughput and computational capacity by better exploiting available hardware resources. Despite the incorporation of FP8 in commercially available neural network accelerators, a comprehensive exposition of its underlying mechanisms, along with rigorous performance and accuracy evalu… ▽ More

    Submitted 16 March, 2025; v1 submitted 12 March, 2025; originally announced March 2025.

  3. arXiv:2502.01070  [pdf, other

    cs.LG cs.PF

    An Inquiry into Datacenter TCO for LLM Inference with FP8

    Authors: Jiwoo Kim, Joonhyung Lee, Gunho Park, Byeongwook Kim, Se Jung Kwon, Dongsoo Lee, Youngjoo Lee

    Abstract: As large language models (LLMs) continue to scale, their inference demands present significant challenges, particularly due to the high power consumption of AI accelerators in datacenters. These facilities require specialized cooling and power management systems, substantially increasing the total cost of ownership (TCO) for cloud service providers (CSPs). In this work, we analyze the computationa… ▽ More

    Submitted 29 April, 2025; v1 submitted 3 February, 2025; originally announced February 2025.

  4. arXiv:2501.00210  [pdf, other

    cs.DC cs.AI cs.AR

    Debunking the CUDA Myth Towards GPU-based AI Systems

    Authors: Yunjae Lee, Juntaek Lim, Jehyeon Bang, Eunyeong Cho, Huijong Jeong, Taesu Kim, Hyungjun Kim, Joonhyung Lee, Jinseop Im, Ranggi Hwang, Se Jung Kwon, Dongsoo Lee, Minsoo Rhu

    Abstract: This paper presents a comprehensive evaluation of Intel Gaudi NPUs as an alternative to NVIDIA GPUs, which is currently the de facto standard in AI system design. First, we create a suite of microbenchmarks to compare Intel Gaudi-2 with NVIDIA A100, showing that Gaudi-2 achieves competitive performance not only in primitive AI compute, memory, and communication operations but also in executing sev… ▽ More

    Submitted 21 March, 2025; v1 submitted 30 December, 2024; originally announced January 2025.

    Comments: Accepted for publication at the 52nd IEEE/ACM International Symposium on Computer Architecture (ISCA-52), 2025

  5. arXiv:2407.11534  [pdf, other

    cs.LG cs.AI

    LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices

    Authors: Jung Hyun Lee, Jeonghoon Kim, June Yong Yang, Se Jung Kwon, Eunho Yang, Kang Min Yoo, Dongsoo Lee

    Abstract: With the commercialization of large language models (LLMs), weight-activation quantization has emerged to compress and accelerate LLMs, achieving high throughput while reducing inference costs. However, existing post-training quantization (PTQ) techniques for quantizing weights and activations of LLMs still suffer from non-negligible accuracy drops, especially on massive multitask language underst… ▽ More

    Submitted 8 February, 2025; v1 submitted 16 July, 2024; originally announced July 2024.

    Comments: Accepted to the main conference at NAACL 2025

  6. arXiv:2405.18710  [pdf, other

    cs.LG cs.AI

    To FP8 and Back Again: Quantifying Reduced Precision Effects on LLM Training Stability

    Authors: Joonhyung Lee, Jeongin Bae, Byeongwook Kim, Se Jung Kwon, Dongsoo Lee

    Abstract: The massive computational costs associated with large language model (LLM) pretraining have spurred great interest in reduced-precision floating-point representations to accelerate the process. As a result, the BrainFloat16 (BF16) precision has become the de facto standard for LLM training, with hardware support included in recent generations of accelerators. This trend has gone even further in th… ▽ More

    Submitted 25 March, 2025; v1 submitted 28 May, 2024; originally announced May 2024.

  7. arXiv:2404.01954  [pdf, other

    cs.CL cs.AI

    HyperCLOVA X Technical Report

    Authors: Kang Min Yoo, Jaegeun Han, Sookyo In, Heewon Jeon, Jisu Jeong, Jaewook Kang, Hyunwook Kim, Kyung-Min Kim, Munhyong Kim, Sungju Kim, Donghyun Kwak, Hanock Kwak, Se Jung Kwon, Bado Lee, Dongsoo Lee, Gichang Lee, Jooho Lee, Baeseong Park, Seongjin Shin, Joonsang Yu, Seolki Baek, Sumin Byeon, Eungsup Cho, Dooseok Choe, Jeesung Han , et al. (371 additional authors not shown)

    Abstract: We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment t… ▽ More

    Submitted 13 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

    Comments: 44 pages; updated authors list and fixed author names

  8. arXiv:2402.18096  [pdf, other

    cs.LG cs.AI

    No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization

    Authors: June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee

    Abstract: Key-Value (KV) Caching has become an essential technique for accelerating the inference speed and throughput of generative Large Language Models~(LLMs). However, the memory footprint of the KV cache poses a critical bottleneck in LLM deployment as the cache size grows with batch size and sequence length, often surpassing even the size of the model itself. Although recent methods were proposed to s… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

  9. arXiv:2402.17812  [pdf, other

    cs.LG cs.CL

    DropBP: Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation

    Authors: Sunghyeon Woo, Baeseong Park, Byeongwook Kim, Minjung Jo, Se Jung Kwon, Dongsuk Jeon, Dongsoo Lee

    Abstract: Large language models (LLMs) have achieved significant success across various domains. However, training these LLMs typically involves substantial memory and computational costs during both forward and backward propagation. While parameter-efficient fine-tuning (PEFT) considerably reduces the training memory associated with parameters, it does not address the significant computational costs and ac… ▽ More

    Submitted 28 February, 2025; v1 submitted 27 February, 2024; originally announced February 2024.

  10. arXiv:2402.17517  [pdf, other

    cs.LG

    Label-Noise Robust Diffusion Models

    Authors: Byeonghu Na, Yeongmin Kim, HeeSun Bae, Jung Hyun Lee, Se Jung Kwon, Wanmo Kang, Il-Chul Moon

    Abstract: Conditional diffusion models have shown remarkable performance in various generative tasks, but training them requires large-scale datasets that often contain noise in conditional inputs, a.k.a. noisy labels. This noise leads to condition mismatch and quality degradation of generated data. This paper proposes Transition-aware weighted Denoising Score Matching (TDSM) for training conditional diffus… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

    Comments: Accepted at ICLR 2024

  11. arXiv:2309.15531  [pdf, other

    cs.LG

    Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models

    Authors: Jung Hwan Heo, Jeonghoon Kim, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, Dongsoo Lee

    Abstract: Large Language Models (LLMs) have recently demonstrated remarkable success across various tasks. However, efficiently serving LLMs has been a challenge due to the large memory bottleneck, specifically in small batch inference settings (e.g. mobile devices). Weight-only quantization can be a promising approach, but sub-4 bit quantization remains a challenge due to large-magnitude activation outlier… ▽ More

    Submitted 13 April, 2025; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: ICLR 2024

  12. arXiv:2306.00317  [pdf, other

    cs.LG cs.AI

    FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization

    Authors: Jung Hyun Lee, Jeonghoon Kim, Se Jung Kwon, Dongsoo Lee

    Abstract: Post-training quantization (PTQ) has been gaining popularity for the deployment of deep neural networks on resource-limited devices since unlike quantization-aware training, neither a full training dataset nor end-to-end training is required at all. As PTQ schemes based on reconstructing each layer or block output turn out to be effective to enhance quantized model performance, recent works have d… ▽ More

    Submitted 16 July, 2024; v1 submitted 31 May, 2023; originally announced June 2023.

    Comments: Accepted to ICML 2023

  13. arXiv:2305.14152  [pdf, other

    cs.LG cs.AI

    Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization

    Authors: Jeonghoon Kim, Jung Hyun Lee, Sungdong Kim, Joonsuk Park, Kang Min Yoo, Se Jung Kwon, Dongsoo Lee

    Abstract: Large language models (LLMs) face the challenges in fine-tuning and deployment due to their high memory demands and computational costs. While parameter-efficient fine-tuning (PEFT) methods aim to reduce the memory usage of the optimizer state during fine-tuning, the inherent size of pre-trained LLM weights continues to be a pressing concern. Even though quantization techniques are widely proposed… ▽ More

    Submitted 28 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Published at NeurIPS 2023. Camera-ready version

  14. arXiv:2211.17091  [pdf, other

    cs.CV cs.AI cs.LG

    Refining Generative Process with Discriminator Guidance in Score-based Diffusion Models

    Authors: Dongjun Kim, Yeongmin Kim, Se Jung Kwon, Wanmo Kang, Il-Chul Moon

    Abstract: The proposed method, Discriminator Guidance, aims to improve sample generation of pre-trained diffusion models. The approach introduces a discriminator that gives explicit supervision to a denoising sample path whether it is realistic or not. Unlike GANs, our approach does not require joint training of score and discriminator networks. Instead, we train the discriminator after score training, maki… ▽ More

    Submitted 4 June, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

    Comments: International Conference on Machine Learning (ICML23)

  15. arXiv:2210.03858  [pdf, other

    cs.LG cs.CL

    AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models

    Authors: Se Jung Kwon, Jeonghoon Kim, Jeongin Bae, Kang Min Yoo, Jin-Hwa Kim, Baeseong Park, Byeongwook Kim, Jung-Woo Ha, Nako Sung, Dongsoo Lee

    Abstract: There are growing interests in adapting large-scale language models using parameter-efficient fine-tuning methods. However, accelerating the model itself and achieving better inference efficiency through model compression has not been thoroughly explored yet. Model compression could provide the benefits of reducing memory footprints, enabling low-precision computations, and ultimately achieving co… ▽ More

    Submitted 7 October, 2022; originally announced October 2022.

    Comments: Findings of EMNLP 2022

  16. arXiv:2206.09557  [pdf, ps, other

    cs.DC cs.CL

    LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models

    Authors: Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, Dongsoo Lee

    Abstract: Recent advances in self-supervised learning and the Transformer architecture have significantly improved natural language processing (NLP), achieving remarkably low perplexity. However, the growing size of NLP models introduces a memory wall problem during the generation phase. To mitigate this issue, recent efforts have focused on quantizing model weights to sub-4-bit precision while preserving f… ▽ More

    Submitted 1 April, 2024; v1 submitted 19 June, 2022; originally announced June 2022.

    Comments: ICLR 2024

  17. arXiv:2205.13699  [pdf, other

    cs.LG

    Maximum Likelihood Training of Implicit Nonlinear Diffusion Models

    Authors: Dongjun Kim, Byeonghu Na, Se Jung Kwon, Dongsoo Lee, Wanmo Kang, Il-Chul Moon

    Abstract: Whereas diverse variations of diffusion models exist, extending the linear diffusion into a nonlinear diffusion process is investigated by very few works. The nonlinearity effect has been hardly understood, but intuitively, there would be promising diffusion patterns to efficiently train the generative distribution towards the data distribution. This paper introduces a data-adaptive nonlinear diff… ▽ More

    Submitted 12 October, 2022; v1 submitted 26 May, 2022; originally announced May 2022.

    Journal ref: Advances in Neural Information Processing Systems 2022 (NeurIPS22)

  18. arXiv:2105.01875  [pdf, ps, other

    cs.LG cs.AI

    Modulating Regularization Frequency for Efficient Compression-Aware Model Training

    Authors: Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, Baeseong Park, Yongkweon Jeon

    Abstract: While model compression is increasingly important because of large neural network size, compression-aware training is challenging as it needs sophisticated model modifications and longer training time.In this paper, we introduce regularization frequency (i.e., how often compression is performed during training) as a new regularization technique for a practical and efficient compression-aware train… ▽ More

    Submitted 5 May, 2021; originally announced May 2021.

    Comments: arXiv admin note: text overlap with arXiv:1905.10145

  19. arXiv:2105.01869  [pdf, other

    cs.LG cs.IT

    Encoding Weights of Irregular Sparsity for Fixed-to-Fixed Model Compression

    Authors: Baeseong Park, Se Jung Kwon, Daehwan Oh, Byeongwook Kim, Dongsoo Lee

    Abstract: Even though fine-grained pruning techniques achieve a high compression ratio, conventional sparsity representations (such as CSR) associated with irregular sparsity degrade parallelism significantly. Practical pruning methods, thus, usually lower pruning rates (by structured pruning) to improve parallelism. In this paper, we study fixed-to-fixed (lossless) encoding architecture/algorithm to suppor… ▽ More

    Submitted 30 January, 2022; v1 submitted 5 May, 2021; originally announced May 2021.

    Comments: ICLR 2022 Accepted

  20. arXiv:2105.01868  [pdf, ps, other

    cs.LG math.OC

    Q-Rater: Non-Convex Optimization for Post-Training Uniform Quantization

    Authors: Byeongwook Kim, Dongsoo Lee, Yeonju Ro, Yongkweon Jeon, Se Jung Kwon, Baeseong Park, Daehwan Oh

    Abstract: Various post-training uniform quantization methods have usually been studied based on convex optimization. As a result, most previous ones rely on the quantization error minimization and/or quadratic approximations. Such approaches are computationally efficient and reasonable when a large number of quantization bits are employed. When the number of quantization bits is relatively low, however, non… ▽ More

    Submitted 5 May, 2021; originally announced May 2021.

  21. arXiv:2009.07453  [pdf, ps, other

    cs.LG cs.CL stat.ML

    Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation

    Authors: Insoo Chung, Byeongwook Kim, Yoonjung Choi, Se Jung Kwon, Yongkweon Jeon, Baeseong Park, Sangha Kim, Dongsoo Lee

    Abstract: The deployment of widely used Transformer architecture is challenging because of heavy computation load and memory overhead during inference, especially when the target device is limited in computational resources such as mobile or edge devices. Quantization is an effective technique to address such challenges. Our analysis shows that for a given number of quantization bits, each block of Transfor… ▽ More

    Submitted 13 October, 2020; v1 submitted 15 September, 2020; originally announced September 2020.

    Comments: Findings of EMNLP 2020

  22. arXiv:2009.04126  [pdf, ps, other

    cs.LG stat.ML

    FleXOR: Trainable Fractional Quantization

    Authors: Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Yongkweon Jeon, Baeseong Park, Jeongin Yun

    Abstract: Quantization based on the binary codes is gaining attention because each quantized bit can be directly utilized for computations without dequantization using look-up tables. Previous attempts, however, only allow for integer numbers of quantization bits, which ends up restricting the search space for compression ratio and accuracy. In this paper, we propose an encryption algorithm/architecture to… ▽ More

    Submitted 22 October, 2020; v1 submitted 9 September, 2020; originally announced September 2020.

    Comments: Neurips 2020 Accepted

  23. arXiv:2005.09904  [pdf, ps, other

    cs.LG stat.ML

    BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs

    Authors: Yongkweon Jeon, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, Dongsoo Lee

    Abstract: The number of parameters in deep neural networks (DNNs) is rapidly increasing to support complicated tasks and to improve model accuracy. Correspondingly, the amount of computations and required memory footprint increase as well. Quantization is an efficient method to address such concerns by compressing DNNs such that computations can be simplified while required storage footprint is significantl… ▽ More

    Submitted 31 August, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

    Comments: 13 pages, 12 figures

  24. arXiv:1905.10145  [pdf, ps, other

    cs.LG stat.ML

    Learning Low-Rank Approximation for CNNs

    Authors: Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Gu-Yeon Wei

    Abstract: Low-rank approximation is an effective model compression technique to not only reduce parameter storage requirements, but to also reduce computations. For convolutional neural networks (CNNs), however, well-known low-rank approximation methods, such as Tucker or CP decomposition, result in degraded model accuracy because decomposed layers hinder training convergence. In this paper, we propose a ne… ▽ More

    Submitted 24 May, 2019; originally announced May 2019.

  25. arXiv:1905.10138  [pdf, ps, other

    cs.LG stat.ML

    Structured Compression by Weight Encryption for Unstructured Pruning and Quantization

    Authors: Se Jung Kwon, Dongsoo Lee, Byeongwook Kim, Parichay Kapoor, Baeseong Park, Gu-Yeon Wei

    Abstract: Model compression techniques, such as pruning and quantization, are becoming increasingly important to reduce the memory footprints and the amount of computations. Despite model size reduction, achieving performance enhancement on devices is, however, still challenging mainly due to the irregular representations of sparse matrix formats. This paper proposes a new weight representation scheme for S… ▽ More

    Submitted 5 March, 2020; v1 submitted 24 May, 2019; originally announced May 2019.

  26. arXiv:1905.05686  [pdf, ps, other

    cs.LG stat.ML

    Network Pruning for Low-Rank Binary Indexing

    Authors: Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Parichay Kapoor, Gu-Yeon Wei

    Abstract: Pruning is an efficient model compression technique to remove redundancy in the connectivity of deep neural networks (DNNs). Computations using sparse matrices obtained by pruning parameters, however, exhibit vastly different parallelism depending on the index representation scheme. As a result, fine-grained pruning has not gained much attention due to its irregular index form leading to large mem… ▽ More

    Submitted 14 May, 2019; originally announced May 2019.