Search | arXiv e-print repository

LLM Inference Acceleration via Efficient Operation Fusion

Authors: Mahsa Salmani, Ilya Soloveychik

Abstract: The rapid development of the Transformer-based Large Language Models (LLMs) in recent years has been closely linked to their ever-growing and already enormous sizes. Many LLMs contain hundreds of billions of parameters and require dedicated hardware resources for training and inference. One of the key challenges inherent to the Transformer architecture is the requirement to support numerous non-li… ▽ More The rapid development of the Transformer-based Large Language Models (LLMs) in recent years has been closely linked to their ever-growing and already enormous sizes. Many LLMs contain hundreds of billions of parameters and require dedicated hardware resources for training and inference. One of the key challenges inherent to the Transformer architecture is the requirement to support numerous non-linear transformations that involves normalization. For instance, each decoder block typically contains at least one Softmax operation and two Layernorms. The computation of the corresponding normalization scaling factors becomes a major bottleneck as it requires spatial collective operations. In other words, when it comes to the computation of denominators for Softmax and Layernorm, all vector elements must be aggregated into a single location, requiring significant communication. These collective operations slow down inference on Transformers by approximately 20%, defeating the whole purpose of distributed in-memory compute. In this work, we propose an extremely efficient technique that can completely hide the overhead caused by such collective operations. Note that each Softmax and Layernorm operation is typically followed by a linear layer. Since non-linear and linear operations are performed on different hardware engines, they can be easily parallelized once the algebra allows such commutation. By leveraging the inherent properties of linear operations, we can defer the normalization of the preceding Softmax and Layernorm until after the linear layer is computed. Now we can compute the collective scaling factors concurrently with the matrix multiplication and completely hide the latency of the former behind the latter. Such parallelization preserves the numerical accuracy while significantly improving the hardware utilization and reducing the overall latency. △ Less

Submitted 24 February, 2025; originally announced February 2025.

arXiv:2410.10553 [pdf, other]

SLaNC: Static LayerNorm Calibration

Authors: Mahsa Salmani, Nikita Trukhanov, Ilya Soloveychik

Abstract: The ever increasing sizes of Large Language Models (LLMs) beyond hundreds of billions of parameters have generated enormous pressure on the manufacturers of dedicated hardware accelerators and made the innovative design of the latter one of the most rapidly expanding fields of the AI industry. Various approaches have been explored to enable efficient and accurate processing of LLMs on the availabl… ▽ More The ever increasing sizes of Large Language Models (LLMs) beyond hundreds of billions of parameters have generated enormous pressure on the manufacturers of dedicated hardware accelerators and made the innovative design of the latter one of the most rapidly expanding fields of the AI industry. Various approaches have been explored to enable efficient and accurate processing of LLMs on the available accelerators given their computational and storage limitations. Among these, various quantization techniques have become the main focus of the community as a means of reducing the compute, communication and storage requirements. Quantization to lower precision formats naturally poses a number of challenges caused by the limited range of the available value representations. When it comes to processing the popular Transformer models on hardware, one of the main issues becomes calculation of the LayerNorm simply because accumulation of the variance requires a much wider dynamic range than the hardware enables. In this article, we address this matter and propose a computationally-efficient scaling technique that can be easily applied to Transformer models during inference. Our method suggests a straightforward way of scaling the LayerNorm inputs based on the static weights of the immediately preceding linear layers. The scaling factors are computed offline, based solely on the linear layer weights, hence no latency or computational overhead is added during inference. Most importantly, our technique ensures that no numerical issues such as overflow or underflow could happen during the compute. This approach offers smooth, accurate and resource-effective inference across a wide range of hardware architectures. The article provides theoretical justification as well as supporting numerical simulations. △ Less

Submitted 14 October, 2024; originally announced October 2024.

Comments: 9 pages, 3 figures, NeurIPS 2024 MLNCP Workshop

arXiv:2405.07135 [pdf, other]

Post Training Quantization of Large Language Models with Microscaling Formats

Authors: Sayeh Sharify, Utkarsh Saxena, Zifei Xu, Wanzin Yazar, Ilya Soloveychik, Xin Wang

Abstract: Large Language Models (LLMs) have distinguished themselves with outstanding performance in complex language modeling tasks, yet they come with significant computational and storage challenges. This paper explores the potential of quantization to mitigate these challenges. We systematically study the combined application of three well-known post-training techniques, SmoothQuant, AWQ, and GPTQ, and… ▽ More Large Language Models (LLMs) have distinguished themselves with outstanding performance in complex language modeling tasks, yet they come with significant computational and storage challenges. This paper explores the potential of quantization to mitigate these challenges. We systematically study the combined application of three well-known post-training techniques, SmoothQuant, AWQ, and GPTQ, and provide a comprehensive analysis of their interactions and implications for advancing LLM quantization. We enhance the versatility of these methods by enabling quantization to microscaling (MX) formats, extending the applicability of these PTQ algorithms beyond their original fixed-point format targets. We show that combining different PTQ methods enables us to quantize models to 4-bit weights and 8-bit activations using the MXINT format with negligible accuracy loss compared to the uncompressed baseline. △ Less

Submitted 15 October, 2024; v1 submitted 11 May, 2024; originally announced May 2024.

arXiv:2403.20137 [pdf, other]

Accurate Block Quantization in LLMs with Outliers

Authors: Nikita Trukhanov, Ilya Soloveychik

Abstract: The demand for inference on extremely large scale LLMs has seen enormous growth in the recent months. It made evident the colossal shortage of dedicated hardware capable of efficient and fast processing of the involved compute and memory movement. The problem is aggravated by the exploding raise in the lengths of the sequences being processed, since those require efficient on-chip storage of the K… ▽ More The demand for inference on extremely large scale LLMs has seen enormous growth in the recent months. It made evident the colossal shortage of dedicated hardware capable of efficient and fast processing of the involved compute and memory movement. The problem is aggravated by the exploding raise in the lengths of the sequences being processed, since those require efficient on-chip storage of the KV-cache of size proportional to the sequence length. To make the required compute feasible and fit the involved data into available memory, numerous quantization techniques have been proposed that allow accurate quantization for both weights and activations. One of the main recent breakthroughs in this direction was introduction of the family of Block Floating Point (BFP) formats characterized by a block of mantissas with a shared scale factor. These enable memory- power-, and compute- efficient hardware support of the tensor operations and provide extremely good quantization accuracy. The main issues preventing widespread application of block formats is caused by the presence of outliers in weights and activations since those affect the accuracy of the other values in the same block. In this paper, we focus on the most critical problem of limited KV-cache storage. We propose a novel approach enabling usage of low precision BFP formats without compromising the resulting model accuracy. We exploit the common channel-wise patterns exhibited by the outliers to rearrange them in such a way, that their quantization quality is significantly improved. The methodology yields 2x savings in the memory footprint without significant degradation of the model's accuracy. Importantly, the rearrangement of channels happens at the compile time and thus has no impact on the inference latency. △ Less

Submitted 29 March, 2024; originally announced March 2024.

arXiv:2403.09054 [pdf, other]

Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference

Authors: Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath

Abstract: Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative language models, the inference process involves two primary phases: prompt processing and token generation. Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phas… ▽ More Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative language models, the inference process involves two primary phases: prompt processing and token generation. Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phase is constrained by memory bandwidth due to the overhead of transferring weights and KV cache values from the memory system to the computing units. This memory bottleneck becomes particularly pronounced in applications that require long-context and extensive text generation, both of which are increasingly crucial for LLMs. This paper introduces "Keyformer", an innovative inference-time approach, to mitigate the challenges associated with KV cache size and memory bandwidth utilization. Keyformer leverages the observation that approximately 90% of the attention weight in generative inference focuses on a specific subset of tokens, referred to as "key" tokens. Keyformer retains only the key tokens in the KV cache by identifying these crucial tokens using a novel score function. This approach effectively reduces both the KV cache size and memory bandwidth usage without compromising model accuracy. We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT, which employ various positional embedding algorithms. Our assessment encompasses a variety of tasks, with a particular emphasis on summarization and conversation tasks involving extended contexts. Keyformer's reduction of KV cache reduces inference latency by 2.1x and improves token generation throughput by 2.4x, while preserving the model's accuracy. △ Less

Submitted 5 April, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

MSC Class: 68U35 ACM Class: I.2.7; C.0

Journal ref: Proceedings of the 7th Annual Conference on Machine Learning and Systems (MLSys), 2024

arXiv:2308.10119 [pdf, other]

Error Probability Bounds for Invariant Causal Prediction via Multiple Access Channels

Authors: Austin Goddard, Yu Xiang, Ilya Soloveychik

Abstract: We consider the problem of lower bounding the error probability under the invariant causal prediction (ICP) framework. To this end, we examine and draw connections between ICP and the zero-rate Gaussian multiple access channel by first proposing a variant of the original invariant prediction assumption, and then considering a special case of the Gaussian multiple access channel where a codebook is… ▽ More We consider the problem of lower bounding the error probability under the invariant causal prediction (ICP) framework. To this end, we examine and draw connections between ICP and the zero-rate Gaussian multiple access channel by first proposing a variant of the original invariant prediction assumption, and then considering a special case of the Gaussian multiple access channel where a codebook is shared between an unknown number of senders. This connection allows us to develop three types of lower bounds on the error probability, each with different assumptions and constraints, leveraging techniques for multiple access channels. The proposed bounds are evaluated with respect to existing causal discovery methods as well as a proposed heuristic method based on minimum distance decoding. △ Less

Submitted 19 August, 2023; originally announced August 2023.

Comments: Accepted to the 2023 Asilomar Conference on Signals, Systems, and Computers

arXiv:2210.05470 [pdf, other]

Block Format Error Bounds and Optimal Block Size Selection

Authors: Ilya Soloveychik, Ilya Lyubomirsky, Xin Wang, Sudeep Bhoja

Abstract: The amounts of data that need to be transmitted, processed, and stored by the modern deep neural networks have reached truly enormous volumes in the last few years calling for the invention of new paradigms both in hardware and software development. One of the most promising and rapidly advancing frontiers here is the creation of new numerical formats. In this work we focus on the family of block… ▽ More The amounts of data that need to be transmitted, processed, and stored by the modern deep neural networks have reached truly enormous volumes in the last few years calling for the invention of new paradigms both in hardware and software development. One of the most promising and rapidly advancing frontiers here is the creation of new numerical formats. In this work we focus on the family of block floating point numerical formats due to their combination of wide dynamic range, numerical accuracy, and efficient hardware implementation of inner products using simple integer arithmetic. These formats are characterized by a block of mantissas with a shared scale factor. The basic Block Floating Point (BFP) format quantizes the block scales into the nearest powers of two on the right. Its simple modification - Scaled BFP (SBFP) - stores the same scales in full precision and thus allows higher accuracy. In this paper, we study the statistical behavior of both these formats rigorously. We develop asymptotic bounds on the inner product error in SBFP- and BFP-quantized normally distributed vectors. Next, we refine those asymptotic results to finite dimensional settings and derive high-dimensional tight bounds for the same errors. Based on the obtained results we introduce a performance measure assessing accuracy of any block format. This measure allows us to determine the optimal parameters, such as the block size, yielding highest accuracy. In particular, we show that if the precision of the BFP format is fixed at 4 bits, the optimal block size becomes 64. All theoretical derivations are supported by numerical experiments and studies on the weights of publicly available pretrained neural networks. △ Less

Submitted 7 November, 2022; v1 submitted 11 October, 2022; originally announced October 2022.

arXiv:2206.14362 [pdf, other]

Lower Bounds on the Error Probability for Invariant Causal Prediction

Authors: Austin Goddard, Yu Xiang, Ilya Soloveychik

Abstract: It is common practice to collect observations of feature and response pairs from different environments. A natural question is how to identify features that have consistent prediction power across environments. The invariant causal prediction framework proposes to approach this problem through invariance, assuming a linear model that is invariant under different environments. In this work, we make… ▽ More It is common practice to collect observations of feature and response pairs from different environments. A natural question is how to identify features that have consistent prediction power across environments. The invariant causal prediction framework proposes to approach this problem through invariance, assuming a linear model that is invariant under different environments. In this work, we make an attempt to shed light on this framework by connecting it to the Gaussian multiple access channel problem. Specifically, we incorporate optimal code constructions and decoding methods to provide lower bounds on the error probability. We illustrate our findings by various simulation settings. △ Less

Submitted 29 June, 2022; v1 submitted 28 June, 2022; originally announced June 2022.

Comments: Accepted to the 2022 IEEE International Workshop on Machine Learning for Signal Processing (MLSP)

arXiv:2006.03311 [pdf, other]

A Robust Test for Elliptical Symmetry

Authors: Ilya Soloveychik

Abstract: Most signal processing and statistical applications heavily rely on specific data distribution models. The Gaussian distributions, although being the most common choice, are inadequate in most real world scenarios as they fail to account for data coming from heavy-tailed populations or contaminated by outliers. Such problems call for the use of Robust Statistics. The robust models and estimators a… ▽ More Most signal processing and statistical applications heavily rely on specific data distribution models. The Gaussian distributions, although being the most common choice, are inadequate in most real world scenarios as they fail to account for data coming from heavy-tailed populations or contaminated by outliers. Such problems call for the use of Robust Statistics. The robust models and estimators are usually based on elliptical populations, making the latter ubiquitous in all methods of robust statistics. To determine whether such tools are applicable in any specific case, goodness-of-fit (GoF) tests are used to verify the ellipticity hypothesis. Ellipticity GoF tests are usually hard to analyze and often their statistical power is not particularly strong. In this work, assuming the true covariance matrix is unknown we design and rigorously analyze a robust GoF test consistent against all alternatives to ellipticity on the unit sphere. The proposed test is based on Tyler's estimator and is formulated in terms of easily computable statistics of the data. For its rigorous analysis, we develop a novel framework based on the exchangeable random variables calculus introduced by de Finetti. Our findings are supported by numerical simulations comparing them to other popular GoF tests and demonstrating the significantly higher statistical power of the suggested technique. △ Less

Submitted 14 April, 2023; v1 submitted 5 June, 2020; originally announced June 2020.

arXiv:1806.03571 [pdf, other]

Stationary Geometric Graphical Model Selection

Authors: Ilya Soloveychik, Vahid Tarokh

Abstract: We consider the problem of model selection in Gaussian Markov fields in the sample deficient scenario. In many practically important cases, the underlying networks are embedded into Euclidean spaces. Using the natural geometric structure, we introduce the notion of spatially stationary distributions over geometric graphs. This directly generalizes the notion of stationary time series to the multid… ▽ More We consider the problem of model selection in Gaussian Markov fields in the sample deficient scenario. In many practically important cases, the underlying networks are embedded into Euclidean spaces. Using the natural geometric structure, we introduce the notion of spatially stationary distributions over geometric graphs. This directly generalizes the notion of stationary time series to the multidimensional setting lacking time axis. We show that the idea of spatial stationarity leads to a dramatic decrease in the sample complexity of the model selection compared to abstract graphs with the same level of sparsity. For geometric graphs on randomly spread vertices and edges of bounded length, we develop tight information-theoretic bounds on sample complexity and show that a finite number of independent samples is sufficient for a consistent recovery. Finally, we develop an efficient technique capable of reliably and consistently reconstructing graphs with a bounded number of measurements. △ Less

Submitted 29 October, 2018; v1 submitted 9 June, 2018; originally announced June 2018.

Comments: arXiv admin note: text overlap with arXiv:1802.03848

arXiv:1701.05544 [pdf, other]

Pseudo-Wigner Matrices

Authors: Ilya Soloveychik, Yu Xiang, Vahid Tarokh

Abstract: We consider the problem of generating pseudo-random matrices based on the similarity of their spectra to Wigner's semicircular law. We introduce the notion of an r-independent pseudo-Wigner matrix ensemble and prove closeness of the spectra of its matrices to the semicircular density in the Kolmogorov distance. We give an explicit construction of a family of N by N pseudo-Wigner ensembles using du… ▽ More We consider the problem of generating pseudo-random matrices based on the similarity of their spectra to Wigner's semicircular law. We introduce the notion of an r-independent pseudo-Wigner matrix ensemble and prove closeness of the spectra of its matrices to the semicircular density in the Kolmogorov distance. We give an explicit construction of a family of N by N pseudo-Wigner ensembles using dual BCH codes and show that the Kolmogorov complexity of the obtained matrices is of the order of log(N) bits for a fixed designed Kolmogorov distance precision. We compare our construction to the quasi-random graphs introduced by Chung, Graham and Wilson and demonstrate that the pseudo-Wigner matrices pass stronger randomness tests than the adjacency matrices of these graphs (lifted by the mapping 0 -> 1 and 1 -> -1) do. Finally, we provide numerical simulations verifying our theoretical results. △ Less

Submitted 26 February, 2018; v1 submitted 19 January, 2017; originally announced January 2017.

Showing 1–11 of 11 results for author: Soloveychik, I