-
QuantEase: Optimization-based Quantization for Language Models
Authors:
Kayhan Behdin,
Ayan Acharya,
Aman Gupta,
Qingquan Song,
Siyu Zhu,
Sathiya Keerthi,
Rahul Mazumder
Abstract:
With the rising popularity of Large Language Models (LLMs), there has been an increasing interest in compression techniques that enable their efficient deployment. This study focuses on the Post-Training Quantization (PTQ) of LLMs. Drawing from recent advances, our work introduces QuantEase, a layer-wise quantization framework where individual layers undergo separate quantization. The problem is f…
▽ More
With the rising popularity of Large Language Models (LLMs), there has been an increasing interest in compression techniques that enable their efficient deployment. This study focuses on the Post-Training Quantization (PTQ) of LLMs. Drawing from recent advances, our work introduces QuantEase, a layer-wise quantization framework where individual layers undergo separate quantization. The problem is framed as a discrete-structured non-convex optimization, prompting the development of algorithms rooted in Coordinate Descent (CD) techniques. These CD-based methods provide high-quality solutions to the complex non-convex layer-wise quantization problems. Notably, our CD-based approach features straightforward updates, relying solely on matrix and vector operations, circumventing the need for matrix inversion or decomposition. We also explore an outlier-aware variant of our approach, allowing for retaining significant weights (outliers) with complete precision. Our proposal attains state-of-the-art performance in terms of perplexity and zero-shot accuracy in empirical evaluations across various LLMs and datasets, with relative improvements up to 15% over methods such as GPTQ. Leveraging careful linear algebra optimizations, QuantEase can quantize models like Falcon-180B on a single NVIDIA A100 GPU in $\sim$3 hours. Particularly noteworthy is our outlier-aware algorithm's capability to achieve near or sub-3-bit quantization of LLMs with an acceptable drop in accuracy, obviating the need for non-uniform quantization or grouping techniques, improving upon methods such as SpQR by up to two times in terms of perplexity.
△ Less
Submitted 1 December, 2023; v1 submitted 4 September, 2023;
originally announced September 2023.
-
mSAM: Micro-Batch-Averaged Sharpness-Aware Minimization
Authors:
Kayhan Behdin,
Qingquan Song,
Aman Gupta,
Sathiya Keerthi,
Ayan Acharya,
Borja Ocejo,
Gregory Dexter,
Rajiv Khanna,
David Durfee,
Rahul Mazumder
Abstract:
Modern deep learning models are over-parameterized, where different optima can result in widely varying generalization performance. The Sharpness-Aware Minimization (SAM) technique modifies the fundamental loss function that steers gradient descent methods toward flatter minima, which are believed to exhibit enhanced generalization prowess. Our study delves into a specific variant of SAM known as…
▽ More
Modern deep learning models are over-parameterized, where different optima can result in widely varying generalization performance. The Sharpness-Aware Minimization (SAM) technique modifies the fundamental loss function that steers gradient descent methods toward flatter minima, which are believed to exhibit enhanced generalization prowess. Our study delves into a specific variant of SAM known as micro-batch SAM (mSAM). This variation involves aggregating updates derived from adversarial perturbations across multiple shards (micro-batches) of a mini-batch during training. We extend a recently developed and well-studied general framework for flatness analysis to theoretically show that SAM achieves flatter minima than SGD, and mSAM achieves even flatter minima than SAM. We provide a thorough empirical evaluation of various image classification and natural language processing tasks to substantiate this theoretical advancement. We also show that contrary to previous work, mSAM can be implemented in a flexible and parallelizable manner without significantly increasing computational costs. Our implementation of mSAM yields superior generalization performance across a wide range of tasks compared to SAM, further supporting our theoretical framework.
△ Less
Submitted 30 September, 2023; v1 submitted 19 February, 2023;
originally announced February 2023.
-
Employing Feature Selection Algorithms to Determine the Immune State of a Mouse Model of Rheumatoid Arthritis
Authors:
Brendon K. Colbert,
Joslyn L. Mangal,
Aleksandr Talitckii,
Abhinav P. Acharya,
Matthew M. Peet
Abstract:
The immune response is a dynamic process by which the body determines whether an antigen is self or nonself. The state of this dynamic process is defined by the relative balance and population of inflammatory and regulatory actors which comprise this decision making process. The goal of immunotherapy as applied to, e.g. Rheumatoid Arthritis (RA), then, is to bias the immune state in favor of the r…
▽ More
The immune response is a dynamic process by which the body determines whether an antigen is self or nonself. The state of this dynamic process is defined by the relative balance and population of inflammatory and regulatory actors which comprise this decision making process. The goal of immunotherapy as applied to, e.g. Rheumatoid Arthritis (RA), then, is to bias the immune state in favor of the regulatory actors - thereby shutting down autoimmune pathways in the response. While there are several known approaches to immunotherapy, the effectiveness of the therapy will depend on how this intervention alters the evolution of this state. Unfortunately, this process is determined not only by the dynamics of the process, but the state of the system at the time of intervention - a state which is difficult if not impossible to determine prior to application of the therapy. To identify such states we consider a mouse model of RA (Collagen-Induced Arthritis (CIA)) immunotherapy; collect high dimensional data on T cell markers and populations of mice after treatment with a recently developed immunotherapy for CIA; and use feature selection algorithms in order to select a lower dimensional subset of this data which can be used to predict both the full set of T cell markers and populations, along with the efficacy of immunotherapy treatment.
△ Less
Submitted 21 October, 2023; v1 submitted 12 July, 2022;
originally announced July 2022.
-
Robust Training in High Dimensions via Block Coordinate Geometric Median Descent
Authors:
Anish Acharya,
Abolfazl Hashemi,
Prateek Jain,
Sujay Sanghavi,
Inderjit S. Dhillon,
Ufuk Topcu
Abstract:
Geometric median (\textsc{Gm}) is a classical method in statistics for achieving a robust estimation of the uncorrupted data; under gross corruption, it achieves the optimal breakdown point of 0.5. However, its computational complexity makes it infeasible for robustifying stochastic gradient descent (SGD) for high-dimensional optimization problems. In this paper, we show that by applying \textsc{G…
▽ More
Geometric median (\textsc{Gm}) is a classical method in statistics for achieving a robust estimation of the uncorrupted data; under gross corruption, it achieves the optimal breakdown point of 0.5. However, its computational complexity makes it infeasible for robustifying stochastic gradient descent (SGD) for high-dimensional optimization problems. In this paper, we show that by applying \textsc{Gm} to only a judiciously chosen block of coordinates at a time and using a memory mechanism, one can retain the breakdown point of 0.5 for smooth non-convex problems, with non-asymptotic convergence rates comparable to the SGD with \textsc{Gm}.
△ Less
Submitted 16 June, 2021;
originally announced June 2021.
-
Faster Non-Convex Federated Learning via Global and Local Momentum
Authors:
Rudrajit Das,
Anish Acharya,
Abolfazl Hashemi,
Sujay Sanghavi,
Inderjit S. Dhillon,
Ufuk Topcu
Abstract:
We propose \texttt{FedGLOMO}, a novel federated learning (FL) algorithm with an iteration complexity of $\mathcal{O}(ε^{-1.5})$ to converge to an $ε$-stationary point (i.e., $\mathbb{E}[\|\nabla f(\bm{x})\|^2] \leq ε$) for smooth non-convex functions -- under arbitrary client heterogeneity and compressed communication -- compared to the $\mathcal{O}(ε^{-2})$ complexity of most prior works. Our key…
▽ More
We propose \texttt{FedGLOMO}, a novel federated learning (FL) algorithm with an iteration complexity of $\mathcal{O}(ε^{-1.5})$ to converge to an $ε$-stationary point (i.e., $\mathbb{E}[\|\nabla f(\bm{x})\|^2] \leq ε$) for smooth non-convex functions -- under arbitrary client heterogeneity and compressed communication -- compared to the $\mathcal{O}(ε^{-2})$ complexity of most prior works. Our key algorithmic idea that enables achieving this improved complexity is based on the observation that the convergence in FL is hampered by two sources of high variance: (i) the global server aggregation step with multiple local updates, exacerbated by client heterogeneity, and (ii) the noise of the local client-level stochastic gradients. By modeling the server aggregation step as a generalized gradient-type update, we propose a variance-reducing momentum-based global update at the server, which when applied in conjunction with variance-reduced local updates at the clients, enables \texttt{FedGLOMO} to enjoy an improved convergence rate. Moreover, we derive our results under a novel and more realistic client-heterogeneity assumption which we verify empirically -- unlike prior assumptions that are hard to verify. Our experiments illustrate the intrinsic variance reduction effect of \texttt{FedGLOMO}, which implicitly suppresses client-drift in heterogeneous data distribution settings and promotes communication efficiency.
△ Less
Submitted 24 October, 2021; v1 submitted 7 December, 2020;
originally announced December 2020.
-
On the Benefits of Multiple Gossip Steps in Communication-Constrained Decentralized Optimization
Authors:
Abolfazl Hashemi,
Anish Acharya,
Rudrajit Das,
Haris Vikalo,
Sujay Sanghavi,
Inderjit Dhillon
Abstract:
In decentralized optimization, it is common algorithmic practice to have nodes interleave (local) gradient descent iterations with gossip (i.e. averaging over the network) steps. Motivated by the training of large-scale machine learning models, it is also increasingly common to require that messages be {\em lossy compressed} versions of the local parameters. In this paper, we show that, in such co…
▽ More
In decentralized optimization, it is common algorithmic practice to have nodes interleave (local) gradient descent iterations with gossip (i.e. averaging over the network) steps. Motivated by the training of large-scale machine learning models, it is also increasingly common to require that messages be {\em lossy compressed} versions of the local parameters. In this paper, we show that, in such compressed decentralized optimization settings, there are benefits to having {\em multiple} gossip steps between subsequent gradient iterations, even when the cost of doing so is appropriately accounted for e.g. by means of reducing the precision of compressed information. In particular, we show that having $O(\log\frac{1}ε)$ gradient iterations {with constant step size} - and $O(\log\frac{1}ε)$ gossip steps between every pair of these iterations - enables convergence to within $ε$ of the optimal value for smooth non-convex objectives satisfying Polyak-Łojasiewicz condition. This result also holds for smooth strongly convex objectives. To our knowledge, this is the first work that derives convergence results for nonconvex optimization under arbitrary communication compression.
△ Less
Submitted 20 November, 2020;
originally announced November 2020.
-
Isometric Graph Neural Networks
Authors:
Matthew Walker,
Bo Yan,
Yiou Xiao,
Yafei Wang,
Ayan Acharya
Abstract:
Many tasks that rely on representations of nodes in graphs would benefit if those representations were faithful to distances between nodes in the graph. Geometric techniques to extract such representations have poor scaling over large graph size, and recent advances in Graph Neural Network (GNN) algorithms have limited ability to reflect graph distance information beyond the first degree neighborh…
▽ More
Many tasks that rely on representations of nodes in graphs would benefit if those representations were faithful to distances between nodes in the graph. Geometric techniques to extract such representations have poor scaling over large graph size, and recent advances in Graph Neural Network (GNN) algorithms have limited ability to reflect graph distance information beyond the first degree neighborhood. To enable this highly desired capability, we propose a technique to learn Isometric Graph Neural Networks (IGNN), which requires changing the input representation space and loss function to enable any GNN algorithm to generate representations that reflect distances between nodes. We experiment with the isometric technique on several GNN architectures for modeling multiple prediction tasks on multiple datasets. In addition to an improvement in AUC-ROC as high as $43\%$ in these experiments, we observe a consistent and substantial improvement as high as 400% in Kendall's Tau (KT), a measure that directly reflects distance information, demonstrating that the learned embeddings do account for graph distances.
△ Less
Submitted 16 June, 2020;
originally announced June 2020.
-
Combining Outcome-Based and Preference-Based Matching: A Constrained Priority Mechanism
Authors:
Avidit Acharya,
Kirk Bansak,
Jens Hainmueller
Abstract:
We introduce a constrained priority mechanism that combines outcome-based matching from machine-learning with preference-based allocation schemes common in market design. Using real-world data, we illustrate how our mechanism could be applied to the assignment of refugee families to host country locations, and kindergarteners to schools. Our mechanism allows a planner to first specify a threshold…
▽ More
We introduce a constrained priority mechanism that combines outcome-based matching from machine-learning with preference-based allocation schemes common in market design. Using real-world data, we illustrate how our mechanism could be applied to the assignment of refugee families to host country locations, and kindergarteners to schools. Our mechanism allows a planner to first specify a threshold $\bar g$ for the minimum acceptable average outcome score that should be achieved by the assignment. In the refugee matching context, this score corresponds to the predicted probability of employment, while in the student assignment context it corresponds to standardized test scores. The mechanism is a priority mechanism that considers both outcomes and preferences by assigning agents (refugee families, students) based on their preferences, but subject to meeting the planner's specified threshold. The mechanism is both strategy-proof and constrained efficient in that it always generates a matching that is not Pareto dominated by any other matching that respects the planner's threshold.
△ Less
Submitted 11 August, 2020; v1 submitted 19 February, 2019;
originally announced February 2019.
-
Online Embedding Compression for Text Classification using Low Rank Matrix Factorization
Authors:
Anish Acharya,
Rahul Goel,
Angeliki Metallinou,
Inderjit Dhillon
Abstract:
Deep learning models have become state of the art for natural language processing (NLP) tasks, however deploying these models in production system poses significant memory constraints. Existing compression methods are either lossy or introduce significant latency. We propose a compression method that leverages low rank matrix factorization during training,to compress the word embedding layer which…
▽ More
Deep learning models have become state of the art for natural language processing (NLP) tasks, however deploying these models in production system poses significant memory constraints. Existing compression methods are either lossy or introduce significant latency. We propose a compression method that leverages low rank matrix factorization during training,to compress the word embedding layer which represents the size bottleneck for most NLP models. Our models are trained, compressed and then further re-trained on the downstream task to recover accuracy while maintaining the reduced size. Empirically, we show that the proposed method can achieve 90% compression with minimal impact in accuracy for sentence classification tasks, and outperforms alternative methods like fixed-point quantization or offline word embedding compression. We also analyze the inference time and storage space for our method through FLOP calculations, showing that we can compress DNN models by a configurable ratio and regain accuracy loss without introducing additional latency compared to fixed point quantization. Finally, we introduce a novel learning rate schedule, the Cyclically Annealed Learning Rate (CALR), which we empirically demonstrate to outperform other popular adaptive learning rate algorithms on a sentence classification benchmark.
△ Less
Submitted 1 November, 2018;
originally announced November 2018.
-
Minimax estimation of qubit states with Bures risk
Authors:
Anirudh Acharya,
Madalin Guta
Abstract:
The central problem of quantum statistics is to devise measurement schemes for the estimation of an unknown state, given an ensemble of $n$ independent identically prepared systems. For locally quadratic loss functions, the risk of standard procedures has the usual scaling of $1/n$. However, it has been noticed that for fidelity based metrics such as the Bures distance, the risk of conventional (n…
▽ More
The central problem of quantum statistics is to devise measurement schemes for the estimation of an unknown state, given an ensemble of $n$ independent identically prepared systems. For locally quadratic loss functions, the risk of standard procedures has the usual scaling of $1/n$. However, it has been noticed that for fidelity based metrics such as the Bures distance, the risk of conventional (non-adaptive) qubit tomography schemes scales as $1/\sqrt{n}$ for states close to the boundary of the Bloch sphere. Several proposed estimators appear to improve this scaling, and our goal is to analyse the problem from the perspective of the maximum risk over all states.
We propose qubit estimation strategies based on separate and adaptive measurements, that achieve $1/n$ scalings for the maximum Bures risk. The estimator involving local measurements uses a fixed fraction of the available resource $n$ to estimate the Bloch vector direction; the length of the Bloch vector is then estimated from the remaining copies by measuring in the estimator eigenbasis. The estimator based on collective measurements uses local asymptotic normality techniques which allows us derive upper and lower bounds to its maximum Bures risk. We also discuss how to construct a minimax optimal estimator in this setup. Finally, we consider quantum relative entropy and show that the risk of the estimator based on collective measurements achieves a rate $O(n^{-1}\log n)$ under this loss function. Furthermore, we show that no estimator can achieve faster rates, in particular the `standard' rate $1/n$.
△ Less
Submitted 25 September, 2017; v1 submitted 16 August, 2017;
originally announced August 2017.
-
Statistical analysis of low rank tomography with compressive random measurements
Authors:
Anirudh Acharya,
Madalin Guta
Abstract:
We consider the statistical problem of `compressive' estimation of low rank states with random basis measurements, where the estimation error is expressed terms of two metrics - the Frobenius norm and quantum infidelity. It is known that unlike the case of general full state tomography, low rank states can be identified from a reduced number of observables' expectations. Here we investigate whethe…
▽ More
We consider the statistical problem of `compressive' estimation of low rank states with random basis measurements, where the estimation error is expressed terms of two metrics - the Frobenius norm and quantum infidelity. It is known that unlike the case of general full state tomography, low rank states can be identified from a reduced number of observables' expectations. Here we investigate whether for a fixed sample size $N$, the estimation error associated to a `compressive' measurement setup is `close' to that of the setting where a large number of bases are measured. In terms of the Frobenius norm, we demonstrate that for all states the error attains the optimal rate $rd/N$ with only $O(r \log{d})$ random basis measurements. We provide an illustrative example of a single qubit and demonstrate a concentration in the Frobenius error about its optimal for all qubit states. In terms of the quantum infidelity, we show that such a concentration does not exist uniformly over all states. Specifically, we show that for states that are nearly pure and close to the surface of the Bloch sphere, the mean infidelity scales as $1/\sqrt{N}$ but the constant converges to zero as the number of settings is increased. This demonstrates a lack of `compressive' recovery for nearly pure states in this metric.
△ Less
Submitted 13 September, 2016;
originally announced September 2016.
-
Nonparametric Bayesian Factor Analysis for Dynamic Count Matrices
Authors:
Ayan Acharya,
Joydeep Ghosh,
Mingyuan Zhou
Abstract:
A gamma process dynamic Poisson factor analysis model is proposed to factorize a dynamic count matrix, whose columns are sequentially observed count vectors. The model builds a novel Markov chain that sends the latent gamma random variables at time $(t-1)$ as the shape parameters of those at time $t$, which are linked to observed or latent counts under the Poisson likelihood. The significant chall…
▽ More
A gamma process dynamic Poisson factor analysis model is proposed to factorize a dynamic count matrix, whose columns are sequentially observed count vectors. The model builds a novel Markov chain that sends the latent gamma random variables at time $(t-1)$ as the shape parameters of those at time $t$, which are linked to observed or latent counts under the Poisson likelihood. The significant challenge of inferring the gamma shape parameters is fully addressed, using unique data augmentation and marginalization techniques for the negative binomial distribution. The same nonparametric Bayesian model also applies to the factorization of a dynamic binary matrix, via a Bernoulli-Poisson link that connects a binary observation to a latent count, with closed-form conditional posteriors for the latent counts and efficient computation for sparse observations. We apply the model to text and music analysis, with state-of-the-art results.
△ Less
Submitted 30 December, 2015;
originally announced December 2015.
-
Statistically efficient tomography of low rank states with incomplete measurements
Authors:
Anirudh Acharya,
Theodore Kypraios,
Madalin Guta
Abstract:
The construction of physically relevant low dimensional state models, and the design of appropriate measurements are key issues in tackling quantum state tomography for large dimensional systems. We consider the statistical problem of estimating low rank states in the set-up of multiple ions tomography, and investigate how the estimation error behaves with a reduction in the number of measurement…
▽ More
The construction of physically relevant low dimensional state models, and the design of appropriate measurements are key issues in tackling quantum state tomography for large dimensional systems. We consider the statistical problem of estimating low rank states in the set-up of multiple ions tomography, and investigate how the estimation error behaves with a reduction in the number of measurement settings, compared with the standard ion tomography setup. We present extensive simulation results showing that the error is robust with respect to the choice of states of a given rank, the random selection of settings, and that the number of settings can be significantly reduced with only a negligible increase in error. We present an argument to explain these findings based on a concentration inequality for the Fisher information matrix. In the more general setup of random basis measurements we use this argument to show that for certain rank $r$ states it suffices to measure in $O(r\log d)$ bases to achieve the average Fisher information over all bases. We present numerical evidence for states upto 8 atoms, supporting a conjecture on a lower bound for the Fisher information which, if true, would imply a similar behaviour in the case of Pauli bases. The relation to similar problems in compressed sensing is also discussed.
△ Less
Submitted 23 October, 2015; v1 submitted 12 October, 2015;
originally announced October 2015.
-
A Complete Review of Controlling the FDR in a Multiple Comparison Problem Framework -- The Benjamini-Hochberg Algorithm
Authors:
Anish Acharya
Abstract:
This paper is a review of the popular Benjamini Hochberg Method and other related useful methods of Multiple Hypothesis testing. This is written with the purpose of serving a short but complete easy to understand review of the main article with proper background. The paper titled 'Controlling the False Discovery Rate-a practical and powerful Approach to multiple Testing' by benjamini et. al.[1] pr…
▽ More
This paper is a review of the popular Benjamini Hochberg Method and other related useful methods of Multiple Hypothesis testing. This is written with the purpose of serving a short but complete easy to understand review of the main article with proper background. The paper titled 'Controlling the False Discovery Rate-a practical and powerful Approach to multiple Testing' by benjamini et. al.[1] proposes a new framework of controlling the False Discovery Rate in a Multiple Hypothesis testing problem. It has been claimed that the procedure proposed in the paper results in a substantial gain in power more applicable in case of problems which call for False discovery rate (FDR) control rather than Familywise Error Rate (FWER). The proposed method uses a simple Bonferroni type procedure for FDR control.
△ Less
Submitted 27 June, 2014;
originally announced June 2014.
-
Probabilistic Combination of Classifier and Cluster Ensembles for Non-transductive Learning
Authors:
Ayan Acharya,
Eduardo R. Hruschka,
Joydeep Ghosh,
Badrul Sarwar,
Jean-David Ruvini
Abstract:
Unsupervised models can provide supplementary soft constraints to help classify new target data under the assumption that similar objects in the target set are more likely to share the same class label. Such models can also help detect possible differences between training and target distributions, which is useful in applications where concept drift may take place. This paper describes a Bayesian…
▽ More
Unsupervised models can provide supplementary soft constraints to help classify new target data under the assumption that similar objects in the target set are more likely to share the same class label. Such models can also help detect possible differences between training and target distributions, which is useful in applications where concept drift may take place. This paper describes a Bayesian framework that takes as input class labels from existing classifiers (designed based on labeled data from the source domain), as well as cluster labels from a cluster ensemble operating solely on the target data to be classified, and yields a consensus labeling of the target data. This framework is particularly useful when the statistics of the target data drift or change from those of the training data. We also show that the proposed framework is privacy-aware and allows performing distributed learning when data/models have sharing restrictions. Experiments show that our framework can yield superior results to those provided by applying classifier ensembles only.
△ Less
Submitted 10 November, 2012;
originally announced November 2012.
-
A Privacy-Aware Bayesian Approach for Combining Classifier and Cluster Ensembles
Authors:
Ayan Acharya,
Eduardo R. Hruschka,
Joydeep Ghosh
Abstract:
This paper introduces a privacy-aware Bayesian approach that combines ensembles of classifiers and clusterers to perform semi-supervised and transductive learning. We consider scenarios where instances and their classification/clustering results are distributed across different data sites and have sharing restrictions. As a special case, the privacy aware computation of the model when instances of…
▽ More
This paper introduces a privacy-aware Bayesian approach that combines ensembles of classifiers and clusterers to perform semi-supervised and transductive learning. We consider scenarios where instances and their classification/clustering results are distributed across different data sites and have sharing restrictions. As a special case, the privacy aware computation of the model when instances of the target data are distributed across different data sites, is also discussed. Experimental results show that the proposed approach can provide good classification accuracies while adhering to the data/model sharing constraints.
△ Less
Submitted 19 April, 2012;
originally announced April 2012.