-
Memory-Efficient Differentially Private Training with Gradient Random Projection
Authors:
Alex Mulrooney,
Devansh Gupta,
James Flemings,
Huanyu Zhang,
Murali Annavaram,
Meisam Razaviyayn,
Xinwei Zhang
Abstract:
Differential privacy (DP) protects sensitive data during neural network training, but standard methods like DP-Adam suffer from high memory overhead due to per-sample gradient clipping, limiting scalability. We introduce DP-GRAPE (Gradient RAndom ProjEction), a DP training method that significantly reduces memory usage while maintaining utility on par with first-order DP approaches. Rather than di…
▽ More
Differential privacy (DP) protects sensitive data during neural network training, but standard methods like DP-Adam suffer from high memory overhead due to per-sample gradient clipping, limiting scalability. We introduce DP-GRAPE (Gradient RAndom ProjEction), a DP training method that significantly reduces memory usage while maintaining utility on par with first-order DP approaches. Rather than directly applying DP to GaLore, DP-GRAPE introduces three key modifications: (1) gradients are privatized after projection, (2) random Gaussian matrices replace SVD-based subspaces, and (3) projection is applied during backpropagation. These contributions eliminate the need for costly SVD computations, enable substantial memory savings, and lead to improved utility. Despite operating in lower-dimensional subspaces, our theoretical analysis shows that DP-GRAPE achieves a privacy-utility trade-off comparable to DP-SGD. Our extensive empirical experiments show that DP-GRAPE can reduce the memory footprint of DP training without sacrificing accuracy or training time. In particular, DP-GRAPE reduces memory usage by over 63% when pre-training Vision Transformers and over 70% when fine-tuning RoBERTa-Large as compared to DP-Adam, while achieving similar performance. We further demonstrate that DP-GRAPE scales to fine-tuning large models such as OPT with up to 6.7 billion parameters.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
MARché: Fast Masked Autoregressive Image Generation with Cache-Aware Attention
Authors:
Chaoyi Jiang,
Sungwoo Kim,
Lei Gao,
Hossein Entezari Zarch,
Won Woo Ro,
Murali Annavaram
Abstract:
Masked autoregressive (MAR) models unify the strengths of masked and autoregressive generation by predicting tokens in a fixed order using bidirectional attention for image generation. While effective, MAR models suffer from significant computational overhead, as they recompute attention and feed-forward representations for all tokens at every decoding step, despite most tokens remaining semantica…
▽ More
Masked autoregressive (MAR) models unify the strengths of masked and autoregressive generation by predicting tokens in a fixed order using bidirectional attention for image generation. While effective, MAR models suffer from significant computational overhead, as they recompute attention and feed-forward representations for all tokens at every decoding step, despite most tokens remaining semantically stable across steps. We propose a training-free generation framework MARché to address this inefficiency through two key components: cache-aware attention and selective KV refresh. Cache-aware attention partitions tokens into active and cached sets, enabling separate computation paths that allow efficient reuse of previously computed key/value projections without compromising full-context modeling. But a cached token cannot be used indefinitely without recomputation due to the changing contextual information over multiple steps. MARché recognizes this challenge and applies a technique called selective KV refresh. Selective KV refresh identifies contextually relevant tokens based on attention scores from newly generated tokens and updates only those tokens that require recomputation, while preserving image generation quality. MARché significantly reduces redundant computation in MAR without modifying the underlying architecture. Empirically, MARché achieves up to 1.7x speedup with negligible impact on image quality, offering a scalable and broadly applicable solution for efficient masked transformer generation.
△ Less
Submitted 22 May, 2025;
originally announced June 2025.
-
DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding
Authors:
Hossein Entezari Zarch,
Lei Gao,
Chaoyi Jiang,
Murali Annavaram
Abstract:
Speculative Decoding (SD) is a widely used approach to accelerate the inference of large language models (LLMs) without reducing generation quality. It operates by first using a compact model to draft multiple tokens efficiently, followed by parallel verification using the target LLM. This approach leads to faster inference compared to auto-regressive decoding. While there are multiple approaches…
▽ More
Speculative Decoding (SD) is a widely used approach to accelerate the inference of large language models (LLMs) without reducing generation quality. It operates by first using a compact model to draft multiple tokens efficiently, followed by parallel verification using the target LLM. This approach leads to faster inference compared to auto-regressive decoding. While there are multiple approaches to create a draft model, one promising approach is to use early-exit methods. These methods draft candidate tokens by using a subset of layers of the primary model and applying the remaining layers for verification, allowing a single model to handle both drafting and verification. While this technique reduces memory usage and computational cost, its performance relies on the choice of the exit layer for drafting and the number of tokens drafted (speculation length) in each SD round. Prior works use hyperparameter exploration to statically select these values. However, our evaluations show that these hyperparameter values are task-specific, and even within a task they are dependent on the current sequence context. We introduce DEL, a plug-and-play method that adaptively selects the exit layer and speculation length during inference. DEL dynamically tracks the token acceptance rate if the tokens are drafted at each layer of an LLM and uses that knowledge to heuristically select the optimal exit layer and speculation length. Our experiments across a broad range of models and downstream tasks show that DEL achieves overall speedups of $2.16\times$$\sim$$2.50\times$ over vanilla auto-regressive decoding and improves upon the state-of-the-art SD methods by up to $0.27\times$.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
Differentially Private In-context Learning via Sampling Few-shot Mixed with Zero-shot Outputs
Authors:
James Flemings,
Haosheng Gan,
Hongyi Li,
Meisam Razaviyayn,
Murali Annavaram
Abstract:
In-context learning (ICL) has shown promising improvement in downstream task adaptation of LLMs by augmenting prompts with relevant input-output examples (demonstrations). However, the ICL demonstrations can contain privacy-sensitive information, which can be leaked and/or regurgitated by the LLM output. Differential Privacy (DP), a widely adopted privacy safeguard, has emerged to mitigate this pr…
▽ More
In-context learning (ICL) has shown promising improvement in downstream task adaptation of LLMs by augmenting prompts with relevant input-output examples (demonstrations). However, the ICL demonstrations can contain privacy-sensitive information, which can be leaked and/or regurgitated by the LLM output. Differential Privacy (DP), a widely adopted privacy safeguard, has emerged to mitigate this privacy leakage, with recent work demonstrating strong privacy-utility tradeoffs in classification tasks for ICL. However, generation tasks for ICL are challenging due to the high-dimensional output space of open-ended generation. To this end, we propose $\texttt{dps-mozo}$, Differentially Private Sampling by Mixing One-shot with Zero-shot Outputs, a decoding framework that generates DP text by sampling from the product of multiple one-shot outputs mixed with a zero-shot output. This mixing effectively reduces the amount of information that can be leaked by each demonstration. By utilizing the inherent randomness in sampling from the mixed distributions, we can achieve DP without adding noise, thereby improving the privacy-utility tradeoff. Our experimental evaluations show $\texttt{dps-mozo}$ can achieve a strong privacy guarantee, $ε=2$, with minimal utility degradation compared to non-private few-shot learning, $\textbf{0.3}$% ROUGE-L F1 score decrease on the SAMSum dataset with Gemma 2 2B.
△ Less
Submitted 31 January, 2025;
originally announced January 2025.
-
Characterization of GPU TEE Overheads in Distributed Data Parallel ML Training
Authors:
Jonghyun Lee,
Yongqin Wang,
Rachit Rajat,
Murali Annavaram
Abstract:
Confidential computing (CC) or trusted execution enclaves (TEEs) is now the most common approach to enable secure computing in the cloud. The recent introduction of GPU TEEs by NVIDIA enables machine learning (ML) models to be trained without leaking model weights or data to the cloud provider. However, the potential performance implications of using GPU TEEs for ML training are not well character…
▽ More
Confidential computing (CC) or trusted execution enclaves (TEEs) is now the most common approach to enable secure computing in the cloud. The recent introduction of GPU TEEs by NVIDIA enables machine learning (ML) models to be trained without leaking model weights or data to the cloud provider. However, the potential performance implications of using GPU TEEs for ML training are not well characterized. In this work, we present an in-depth characterization study on performance overhead associated with running distributed data parallel (DDP) ML training with GPU Trusted Execution Environments (TEE).
Our study reveals the performance challenges in DDP training within GPU TEEs. DDP uses ring-all-reduce, a well-known approach, to aggregate gradients from multiple devices. Ring all-reduce consists of multiple scatter-reduce and all-gather operations. In GPU TEEs only the GPU package (GPU and HBM memory) is trusted. Hence, any data communicated outside the GPU packages must be encrypted and authenticated for confidentiality and integrity verification. Hence, each phase of the ring-all-reduce requires encryption and message authentication code (MAC) generation from the sender, and decryption and MAC authentication on the receiver. As the number of GPUs participating in DDP increases, the overhead of secure inter-GPU communication during ring-all-reduce grows proportionally. Additionally, larger models lead to more asynchronous all-reduce operations, exacerbating the communication cost. Our results show that with four GPU TEEs, depending on the model that is being trained, the runtime per training iteration increases by an average of 8x and up to a maximum of 41.6x compared to DDP training without TEE.
△ Less
Submitted 27 March, 2025; v1 submitted 20 January, 2025;
originally announced January 2025.
-
KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
Authors:
Chaoyi Jiang,
Lei Gao,
Hossein Entezari Zarch,
Murali Annavaram
Abstract:
Inference for Large Language Models (LLMs) is computationally demanding. To reduce the cost of auto-regressive decoding, Key-Value (KV) cache is used to store intermediate activations, which significantly lowers the computational overhead for token generation. However, the memory required for the KV cache grows rapidly, often exceeding the capacity of GPU memory. A cost-effective alternative is to…
▽ More
Inference for Large Language Models (LLMs) is computationally demanding. To reduce the cost of auto-regressive decoding, Key-Value (KV) cache is used to store intermediate activations, which significantly lowers the computational overhead for token generation. However, the memory required for the KV cache grows rapidly, often exceeding the capacity of GPU memory. A cost-effective alternative is to offload KV cache to CPU memory, which alleviates GPU memory pressure, but shifts the bottleneck to the limited bandwidth of the PCIe connection between the CPU and GPU. Existing methods attempt to address these issues by overlapping GPU computation with I/O or employing CPU-GPU heterogeneous execution, but they are hindered by excessive data movement and dependence on CPU capabilities. Fully overlapping PCIe communication latency gets challenging as the size of the KV cache grows and/or the GPU compute capabilities increase. In this paper, we introduce KVPR, an efficient I/O-aware LLM inference method where the CPU first transfers a partial set of activations, from which the GPU can start recomputing the KV cache values. While the GPU recomputes the partial KV cache, the remaining portion of the KV cache is transferred concurrently from the CPU. This approach overlaps GPU recomputation with KV cache transfer to minimize idle GPU time and maximize inference performance. KVPR is fully automated by integrating a profiler module that utilizes input characteristics and system hardware information, a scheduler module to optimize the distribution of computation and communication workloads, and a runtime module to efficiently execute the derived execution plan. Experimental results show that KVPR achieves up to 35.8% lower latency and 46.2% higher throughput during decoding compared to state-of-the-art approaches. The code is available at https://github.com/chaoyij/KVPR.
△ Less
Submitted 4 June, 2025; v1 submitted 25 November, 2024;
originally announced November 2024.
-
Fastrack: Fast IO for Secure ML using GPU TEEs
Authors:
Yongqin Wang,
Rachit Rajat,
Jonghyun Lee,
Tingting Tang,
Murali Annavaram
Abstract:
As cloud-based ML expands, ensuring data security during training and inference is critical. GPU-based Trusted Execution Environments (TEEs) offer secure, high-performance solutions, with CPU TEEs managing data movement and GPU TEEs handling authentication and computation. However, CPU-to-GPU communication overheads significantly hinder performance, as data must be encrypted, authenticated, decryp…
▽ More
As cloud-based ML expands, ensuring data security during training and inference is critical. GPU-based Trusted Execution Environments (TEEs) offer secure, high-performance solutions, with CPU TEEs managing data movement and GPU TEEs handling authentication and computation. However, CPU-to-GPU communication overheads significantly hinder performance, as data must be encrypted, authenticated, decrypted, and verified, increasing costs by 12.69 to 33.53 times. This results in GPU TEE inference becoming 54.12% to 903.9% slower and training 10% to 455% slower than non-TEE systems, undermining GPU TEE advantages in latency-sensitive applications.
This paper analyzes Nvidia H100 TEE protocols and identifies three key overheads: 1) redundant CPU re-encryption, 2) limited authentication parallelism, and 3) unnecessary operation serialization. We propose Fastrack, optimizing with 1) direct GPU TEE communication, 2) parallelized authentication, and 3) overlapping decryption with PCI-e transmission. These optimizations cut communication costs and reduce inference/training runtime by up to 84.6%, with minimal overhead compared to non-TEE systems.
△ Less
Submitted 19 October, 2024;
originally announced October 2024.
-
Estimating Privacy Leakage of Augmented Contextual Knowledge in Language Models
Authors:
James Flemings,
Bo Jiang,
Wanrong Zhang,
Zafar Takhirov,
Murali Annavaram
Abstract:
Language models (LMs) rely on their parametric knowledge augmented with relevant contextual knowledge for certain tasks, such as question answering. However, the contextual knowledge can contain private information that may be leaked when answering queries, and estimating this privacy leakage is not well understood. A straightforward approach of directly comparing an LM's output to the contexts ca…
▽ More
Language models (LMs) rely on their parametric knowledge augmented with relevant contextual knowledge for certain tasks, such as question answering. However, the contextual knowledge can contain private information that may be leaked when answering queries, and estimating this privacy leakage is not well understood. A straightforward approach of directly comparing an LM's output to the contexts can overestimate the privacy risk, since the LM's parametric knowledge might already contain the augmented contextual knowledge. To this end, we introduce *context influence*, a metric that builds on differential privacy, a widely-adopted privacy notion, to estimate the privacy leakage of contextual knowledge during decoding. Our approach effectively measures how each subset of the context influences an LM's response while separating the specific parametric knowledge of the LM. Using our context influence metric, we demonstrate that context privacy leakage occurs when contextual knowledge is out of distribution with respect to parametric knowledge. Moreover, we experimentally demonstrate how context influence properly attributes the privacy leakage to augmented contexts, and we evaluate how factors -- such as model size, context size, generation position, etc. -- affect context privacy leakage. The practical implications of our results will inform practitioners of the privacy risk associated with augmented contextual knowledge.
△ Less
Submitted 30 May, 2025; v1 submitted 3 October, 2024;
originally announced October 2024.
-
Adaptively Private Next-Token Prediction of Large Language Models
Authors:
James Flemings,
Meisam Razaviyayn,
Murali Annavaram
Abstract:
As Large Language Models (LLMs) proliferate, developing privacy safeguards for these models is crucial. One popular safeguard involves training LLMs in a differentially private manner. However, such solutions are shown to be computationally expensive and detrimental to the utility of these models. Since LLMs are deployed on the cloud and thus only accessible via an API, a Machine Learning as a Ser…
▽ More
As Large Language Models (LLMs) proliferate, developing privacy safeguards for these models is crucial. One popular safeguard involves training LLMs in a differentially private manner. However, such solutions are shown to be computationally expensive and detrimental to the utility of these models. Since LLMs are deployed on the cloud and thus only accessible via an API, a Machine Learning as a Service (MLaaS) provider can protect its downstream data by privatizing the predictions during the decoding process. However, the practicality of such solutions still largely lags behind DP training methods. One recent promising approach, Private Mixing of Ensemble Distributions (PMixED), avoids additive noise by sampling from the output distributions of private LLMs mixed with the output distribution of a public model. Yet, PMixED must satisfy a fixed privacy level for a given number of queries, which is difficult for an analyst to estimate before inference and, hence, does not scale. To this end, we relax the requirements to a more practical setting by introducing Adaptive PMixED (AdaPMixED), a private decoding framework based on PMixED that is adaptive to the private and public output distributions evaluated on a given input query. In this setting, we introduce a noisy screening mechanism that filters out queries with potentially expensive privacy loss, and a data-dependent analysis that exploits the divergence of the private and public output distributions in its privacy loss calculation. Our experimental evaluations demonstrate that our mechanism and analysis can reduce the privacy loss by 16x while preserving the utility over the original PMixED. Furthermore, performing 100K predictions with AdaPMixED still achieves strong utility and a reasonable data-dependent privacy loss of 5.25.
△ Less
Submitted 2 October, 2024;
originally announced October 2024.
-
Enabling Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines
Authors:
Lei Gao,
Amir Ziashahabi,
Yue Niu,
Salman Avestimehr,
Murali Annavaram
Abstract:
Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers. The next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data. Given the sensitive nature of such private data, it is desirable to fine-tune these models on edge devices to improve user trust. However, fine-tuning on resource-constrained edge devices pre…
▽ More
Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers. The next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data. Given the sensitive nature of such private data, it is desirable to fine-tune these models on edge devices to improve user trust. However, fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands, as well as limited infrastructure support. We observe that inference engines (e.g., ExecuTorch) can be repurposed for fine-tuning by leveraging zeroth-order (ZO) optimization, which uses multiple forward passes to approximate gradients. However, directly applying ZO methods on edge devices is impractical due to the high computational cost of multiple model perturbations required to achieve accuracy improvements. Based on these observations, we propose a memory- and computation-efficient LLM fine-tuning method for edge devices. Our approach has three key innovations: (1) We introduce a parallelized randomized gradient estimation (P-RGE) technique that achieves high parallel efficiency by leveraging outer-loop and inner-loop parallelization. This enables multiple function queries and forward passes to be executed in parallel, reducing training time. (2) We integrate P-RGE with parameter-efficient fine-tuning methods (e.g. LoRA) to further reduce computational and memory overhead. (3) We implement a P-RGE LoRA-FA module that fully supports fine-tuning with ExecuTorch. Our approach requires no modifications to ExecuTorch's runtime code, as it can be implemented with server-side code changes only. Experiments demonstrate that P-RGE achieves substantial runtime speedups and memory savings while improving fine-tuning accuracy, paving the way for practical deployment of LLMs in real-time, on-device applications.
△ Less
Submitted 6 November, 2024; v1 submitted 23 September, 2024;
originally announced September 2024.
-
CADC: Encoding User-Item Interactions for Compressing Recommendation Model Training Data
Authors:
Hossein Entezari Zarch,
Abdulla Alshabanah,
Chaoyi Jiang,
Murali Annavaram
Abstract:
Deep learning recommendation models (DLRMs) are at the heart of the current e-commerce industry. However, the amount of training data used to train these large models is growing exponentially, leading to substantial training hurdles. The training dataset contains two primary types of information: content-based information (features of users and items) and collaborative information (interactions be…
▽ More
Deep learning recommendation models (DLRMs) are at the heart of the current e-commerce industry. However, the amount of training data used to train these large models is growing exponentially, leading to substantial training hurdles. The training dataset contains two primary types of information: content-based information (features of users and items) and collaborative information (interactions between users and items). One approach to reduce the training dataset is to remove user-item interactions. But that significantly diminishes collaborative information, which is crucial for maintaining accuracy due to its inclusion of interaction histories. This loss profoundly impacts DLRM performance.
This paper makes an important observation that if one can capture the user-item interaction history to enrich the user and item embeddings, then the interaction history can be compressed without losing model accuracy. Thus, this work, Collaborative Aware Data Compression (CADC), takes a two-step approach to training dataset compression. In the first step, we use matrix factorization of the user-item interaction matrix to create a novel embedding representation for both the users and items. Once the user and item embeddings are enriched by the interaction history information the approach then applies uniform random sampling of the training dataset to drastically reduce the training dataset size while minimizing model accuracy drop. The source code of CADC is available at \href{https://anonymous.4open.science/r/DSS-RM-8C1D/README.md}{https://anonymous.4open.science/r/DSS-RM-8C1D/README.md}.
△ Less
Submitted 23 July, 2024; v1 submitted 10 July, 2024;
originally announced July 2024.
-
Differentially Private Next-Token Prediction of Large Language Models
Authors:
James Flemings,
Meisam Razaviyayn,
Murali Annavaram
Abstract:
Ensuring the privacy of Large Language Models (LLMs) is becoming increasingly important. The most widely adopted technique to accomplish this is DP-SGD, which trains a model to guarantee Differential Privacy (DP). However, DP-SGD overestimates an adversary's capabilities in having white box access to the model and, as a result, causes longer training times and larger memory usage than SGD. On the…
▽ More
Ensuring the privacy of Large Language Models (LLMs) is becoming increasingly important. The most widely adopted technique to accomplish this is DP-SGD, which trains a model to guarantee Differential Privacy (DP). However, DP-SGD overestimates an adversary's capabilities in having white box access to the model and, as a result, causes longer training times and larger memory usage than SGD. On the other hand, commercial LLM deployments are predominantly cloud-based; hence, adversarial access to LLMs is black-box. Motivated by these observations, we present Private Mixing of Ensemble Distributions (PMixED): a private prediction protocol for next-token prediction that utilizes the inherent stochasticity of next-token sampling and a public model to achieve Differential Privacy. We formalize this by introducing RD-mollifers which project each of the model's output distribution from an ensemble of fine-tuned LLMs onto a set around a public LLM's output distribution, then average the projected distributions and sample from it. Unlike DP-SGD which needs to consider the model architecture during training, PMixED is model agnostic, which makes PMixED a very appealing solution for current deployments. Our results show that PMixED achieves a stronger privacy guarantee than sample-level privacy and outperforms DP-SGD for privacy $ε= 8$ on large-scale datasets. Thus, PMixED offers a practical alternative to DP training methods for achieving strong generative utility without compromising privacy.
△ Less
Submitted 26 April, 2024; v1 submitted 22 March, 2024;
originally announced March 2024.
-
Edge Private Graph Neural Networks with Singular Value Perturbation
Authors:
Tingting Tang,
Yue Niu,
Salman Avestimehr,
Murali Annavaram
Abstract:
Graph neural networks (GNNs) play a key role in learning representations from graph-structured data and are demonstrated to be useful in many applications. However, the GNN training pipeline has been shown to be vulnerable to node feature leakage and edge extraction attacks. This paper investigates a scenario where an attacker aims to recover private edge information from a trained GNN model. Prev…
▽ More
Graph neural networks (GNNs) play a key role in learning representations from graph-structured data and are demonstrated to be useful in many applications. However, the GNN training pipeline has been shown to be vulnerable to node feature leakage and edge extraction attacks. This paper investigates a scenario where an attacker aims to recover private edge information from a trained GNN model. Previous studies have employed differential privacy (DP) to add noise directly to the adjacency matrix or a compact graph representation. The added perturbations cause the graph structure to be substantially morphed, reducing the model utility. We propose a new privacy-preserving GNN training algorithm, Eclipse, that maintains good model utility while providing strong privacy protection on edges. Eclipse is based on two key observations. First, adjacency matrices in graph structures exhibit low-rank behavior. Thus, Eclipse trains GNNs with a low-rank format of the graph via singular values decomposition (SVD), rather than the original graph. Using the low-rank format, Eclipse preserves the primary graph topology and removes the remaining residual edges. Eclipse adds noise to the low-rank singular values instead of the entire graph, thereby preserving the graph privacy while still maintaining enough of the graph structure to maintain model utility. We theoretically show Eclipse provide formal DP guarantee on edges. Experiments on benchmark graph datasets show that Eclipse achieves significantly better privacy-utility tradeoff compared to existing privacy-preserving GNN training methods. In particular, under strong privacy constraints ($ε$ < 4), Eclipse shows significant gains in the model utility by up to 46%. We further demonstrate that Eclipse also has better resilience against common edge attacks (e.g., LPA), lowering the attack AUC by up to 5% compared to other state-of-the-art baselines.
△ Less
Submitted 16 March, 2024;
originally announced March 2024.
-
Ethos: Rectifying Language Models in Orthogonal Parameter Space
Authors:
Lei Gao,
Yue Niu,
Tingting Tang,
Salman Avestimehr,
Murali Annavaram
Abstract:
Language models (LMs) have greatly propelled the research on natural language processing. However, LMs also raise concerns regarding the generation of biased or toxic content and the potential disclosure of private information from the training dataset. In this work, we present a new efficient approach, Ethos, that rectifies LMs to mitigate toxicity and bias in outputs and avoid privacy leakage. E…
▽ More
Language models (LMs) have greatly propelled the research on natural language processing. However, LMs also raise concerns regarding the generation of biased or toxic content and the potential disclosure of private information from the training dataset. In this work, we present a new efficient approach, Ethos, that rectifies LMs to mitigate toxicity and bias in outputs and avoid privacy leakage. Ethos is built on task arithmetic. However, unlike current task arithmetic algorithms, Ethos distinguishes general beneficial and undesired knowledge when reconstructing task vectors. Specifically, Ethos first obtains a set of principal components from the pre-trained models using singular value decomposition. Then, by projecting the task vector onto principal components, Ethos identifies the principal components that encode general or undesired knowledge. Ethos performs negating using the task vector with undesired knowledge only, thereby minimizing collateral damage on general model utility. We demonstrate the efficacy of our approach on three different tasks: debiasing, detoxification, and memorization unlearning. Evaluations show Ethos is more effective in removing undesired knowledge and maintaining the overall model performance compared to current task arithmetic methods.
△ Less
Submitted 1 April, 2024; v1 submitted 13 March, 2024;
originally announced March 2024.
-
Differentially Private Knowledge Distillation via Synthetic Text Generation
Authors:
James Flemings,
Murali Annavaram
Abstract:
Large Language models (LLMs) are achieving state-of-the-art performance in many different downstream tasks. However, the increasing urgency of data privacy puts pressure on practitioners to train LLMs with Differential Privacy (DP) on private data. Concurrently, the exponential growth in parameter size of LLMs necessitates model compression before deployment of LLMs on resource-constrained devices…
▽ More
Large Language models (LLMs) are achieving state-of-the-art performance in many different downstream tasks. However, the increasing urgency of data privacy puts pressure on practitioners to train LLMs with Differential Privacy (DP) on private data. Concurrently, the exponential growth in parameter size of LLMs necessitates model compression before deployment of LLMs on resource-constrained devices or latency-sensitive applications. Differential privacy and model compression generally must trade off utility loss to achieve their objectives. Moreover, simultaneously applying both schemes can compound the utility degradation. To this end, we propose DistilDP: a novel differentially private knowledge distillation algorithm that exploits synthetic data generated by a differentially private teacher LLM. The knowledge of a teacher LLM is transferred onto the student in two ways: one way from the synthetic data itself -- the hard labels, and the other way by the output distribution of the teacher evaluated on the synthetic data -- the soft labels. Furthermore, if the teacher and student share a similar architectural structure, we can further distill knowledge by aligning the hidden representations between both. Our experimental results demonstrate that DistilDP can substantially improve the utility over existing baselines, at least $9.0$ PPL on the Big Patent dataset, with strong privacy parameters, $ε=2$. These promising results progress privacy-preserving compression of autoregressive LLMs. Our code can be accessed here: https://github.com/james-flemings/dp_compress.
△ Less
Submitted 4 June, 2024; v1 submitted 1 March, 2024;
originally announced March 2024.
-
CompactTag: Minimizing Computation Overheads in Actively-Secure MPC for Deep Neural Networks
Authors:
Yongqin Wang,
Pratik Sarkar,
Nishat Koti,
Arpita Patra,
Murali Annavaram
Abstract:
Secure Multiparty Computation (MPC) protocols enable secure evaluation of a circuit by several parties, even in the presence of an adversary who maliciously corrupts all but one of the parties. These MPC protocols are constructed using the well-known secret-sharing-based paradigm (SPDZ and SPDZ2k), where the protocols ensure security against a malicious adversary by computing Message Authenticatio…
▽ More
Secure Multiparty Computation (MPC) protocols enable secure evaluation of a circuit by several parties, even in the presence of an adversary who maliciously corrupts all but one of the parties. These MPC protocols are constructed using the well-known secret-sharing-based paradigm (SPDZ and SPDZ2k), where the protocols ensure security against a malicious adversary by computing Message Authentication Code (MAC) tags on the input shares and then evaluating the circuit with these input shares and tags. However, this tag computation adds a significant runtime overhead, particularly for machine learning (ML) applications with numerous linear computation layers such as convolutions and fully connected layers.
To alleviate the tag computation overhead, we introduce CompactTag, a lightweight algorithm for generating MAC tags specifically tailored for linear layers in ML. Linear layer operations in ML, including convolutions, can be transformed into Toeplitz matrix multiplications. For the multiplication of two matrices with dimensions T1 x T2 and T2 x T3 respectively, SPDZ2k required O(T1 x T2 x T3) local multiplications for the tag computation. In contrast, CompactTag only requires O(T1 x T2 + T1 x T3 + T2 x T3) local multiplications, resulting in a substantial performance boost for various ML models.
We empirically compared our protocol to the SPDZ2k protocol for various ML circuits, including ResNet Training-Inference, Transformer Training-Inference, and VGG16 Training-Inference. SPDZ2k dedicated around 30% of its online runtime for tag computation. CompactTag speeds up this tag computation bottleneck by up to 23x, resulting in up to 1.47x total online phase runtime speedups for various ML workloads.
△ Less
Submitted 7 November, 2023;
originally announced November 2023.
-
Data Leakage via Access Patterns of Sparse Features in Deep Learning-based Recommendation Systems
Authors:
Hanieh Hashemi,
Wenjie Xiong,
Liu Ke,
Kiwan Maeng,
Murali Annavaram,
G. Edward Suh,
Hsien-Hsin S. Lee
Abstract:
Online personalized recommendation services are generally hosted in the cloud where users query the cloud-based model to receive recommended input such as merchandise of interest or news feed. State-of-the-art recommendation models rely on sparse and dense features to represent users' profile information and the items they interact with. Although sparse features account for 99% of the total model…
▽ More
Online personalized recommendation services are generally hosted in the cloud where users query the cloud-based model to receive recommended input such as merchandise of interest or news feed. State-of-the-art recommendation models rely on sparse and dense features to represent users' profile information and the items they interact with. Although sparse features account for 99% of the total model size, there was not enough attention paid to the potential information leakage through sparse features. These sparse features are employed to track users' behavior, e.g., their click history, object interactions, etc., potentially carrying each user's private information. Sparse features are represented as learned embedding vectors that are stored in large tables, and personalized recommendation is performed by using a specific user's sparse feature to index through the tables. Even with recently-proposed methods that hides the computation happening in the cloud, an attacker in the cloud may be able to still track the access patterns to the embedding tables. This paper explores the private information that may be learned by tracking a recommendation model's sparse feature access patterns. We first characterize the types of attacks that can be carried out on sparse features in recommendation models in an untrusted cloud, followed by a demonstration of how each of these attacks leads to extracting users' private information or tracking users by their behavior over time.
△ Less
Submitted 12 December, 2022;
originally announced December 2022.
-
MPC-Pipe: an Efficient Pipeline Scheme for Secure Multi-party Machine Learning Inference
Authors:
Yongqin Wang,
Rachit Rajat,
Murali Annavaram
Abstract:
Multi-party computing (MPC) has been gaining popularity as a secure computing model over the past few years. However, prior works have demonstrated that MPC protocols still pay substantial performance penalties compared to plaintext, particularly when applied to ML algorithms. The overhead is due to added computation and communication costs. Prior studies, as well as our own analysis, found that m…
▽ More
Multi-party computing (MPC) has been gaining popularity as a secure computing model over the past few years. However, prior works have demonstrated that MPC protocols still pay substantial performance penalties compared to plaintext, particularly when applied to ML algorithms. The overhead is due to added computation and communication costs. Prior studies, as well as our own analysis, found that most MPC protocols today sequentially perform communication and computation. The participating parties must compute on their shares first and then perform data communication to allow the distribution of new secret shares before proceeding to the next computation step. In this work, we show that serialization is unnecessary, particularly in the context of ML computations (both in Convolutional neural networks and in Transformer-based models). We demonstrate that it is possible to carefully orchestrate the computation and communication steps to overlap.
We propose MPC-Pipe, an efficient MPC system for both training and inference of ML workloads, which pipelines computations and communications in an MPC protocol during the online phase. MPC-Pipe proposes three pipeline schemes to optimize the online phase of ML in the semi-honest majority adversary setting. We implement MPC-Pipe by augmenting a modified version of CrypTen, which separates online and offline phases. We evaluate the end-to-end system performance benefits of the online phase of MPC using deep neural networks (VGG16, ResNet50) and Transformers using different network settings. We show that MPC-Pipe can improve the throughput and latency of ML workloads.
△ Less
Submitted 27 August, 2024; v1 submitted 27 September, 2022;
originally announced September 2022.
-
DarKnight: An Accelerated Framework for Privacy and Integrity Preserving Deep Learning Using Trusted Hardware
Authors:
Hanieh Hashemi,
Yongqin Wang,
Murali Annavaram
Abstract:
Privacy and security-related concerns are growing as machine learning reaches diverse application domains. The data holders want to train or infer with private data while exploiting accelerators, such as GPUs, that are hosted in the cloud. Cloud systems are vulnerable to attackers that compromise the privacy of data and integrity of computations. Tackling such a challenge requires unifying theoret…
▽ More
Privacy and security-related concerns are growing as machine learning reaches diverse application domains. The data holders want to train or infer with private data while exploiting accelerators, such as GPUs, that are hosted in the cloud. Cloud systems are vulnerable to attackers that compromise the privacy of data and integrity of computations. Tackling such a challenge requires unifying theoretical privacy algorithms with hardware security capabilities. This paper presents DarKnight, a framework for large DNN training while protecting input privacy and computation integrity. DarKnight relies on cooperative execution between trusted execution environments (TEE) and accelerators, where the TEE provides privacy and integrity verification, while accelerators perform the bulk of the linear algebraic computation to optimize the performance. In particular, DarKnight uses a customized data encoding strategy based on matrix masking to create input obfuscation within a TEE. The obfuscated data is then offloaded to GPUs for fast linear algebraic computation. DarKnight's data obfuscation strategy provides provable data privacy and computation integrity in the cloud servers. While prior works tackle inference privacy and cannot be utilized for training, DarKnight's encoding scheme is designed to support both training and inference.
△ Less
Submitted 30 June, 2022;
originally announced July 2022.
-
High-Throughput Secure Multiparty Computation with an Honest Majority in Various Network Settings
Authors:
Christopher Harth-Kitzerow,
Ajith Suresh,
Yongqin Wang,
Hossein Yalame,
Georg Carle,
Murali Annavaram
Abstract:
In this work, we present novel protocols over rings for semi-honest secure three-party computation (3PC) and malicious four-party computation (4PC) with one corruption. While most existing works focus on improving total communication complexity, challenges such as network heterogeneity and computational complexity, which impact MPC performance in practice, remain underexplored. Our protocols addre…
▽ More
In this work, we present novel protocols over rings for semi-honest secure three-party computation (3PC) and malicious four-party computation (4PC) with one corruption. While most existing works focus on improving total communication complexity, challenges such as network heterogeneity and computational complexity, which impact MPC performance in practice, remain underexplored. Our protocols address these issues by tolerating multiple arbitrarily weak network links between parties without any substantial decrease in performance. Additionally, they significantly reduce computational complexity by requiring up to half the number of basic instructions per gate compared to related work. These improvements lead to up to twice the throughput of state-of-the-art protocols in homogeneous network settings and up to eight times higher throughput in real-world heterogeneous settings. These advantages come at no additional cost: Our protocols maintain the best-known total communication complexity per multiplication, requiring 3 elements for 3PC and 5 elements for 4PC. We implemented our protocols alongside several state-of-the-art protocols (Replicated 3PC, ASTRA, Fantastic Four, Tetrad) in a novel open-source C++ framework optimized for high throughput. Five out of six implemented 3PC and 4PC protocols achieve more than one billion 32-bit multiplications or over 32 billion AND gates per second using our implementation in a 25 Gbit/s LAN environment. This represents the highest throughput achieved in 3PC and 4PC so far, outperforming existing frameworks like MP-SPDZ, ABY3, MPyC, and MOTION by two to three orders of magnitude.
△ Less
Submitted 21 May, 2025; v1 submitted 8 June, 2022;
originally announced June 2022.
-
Attribute Inference Attack of Speech Emotion Recognition in Federated Learning Settings
Authors:
Tiantian Feng,
Hanieh Hashemi,
Rajat Hebbar,
Murali Annavaram,
Shrikanth S. Narayanan
Abstract:
Speech emotion recognition (SER) processes speech signals to detect and characterize expressed perceived emotions. Many SER application systems often acquire and transmit speech data collected at the client-side to remote cloud platforms for inference and decision making. However, speech data carry rich information not only about emotions conveyed in vocal expressions, but also other sensitive dem…
▽ More
Speech emotion recognition (SER) processes speech signals to detect and characterize expressed perceived emotions. Many SER application systems often acquire and transmit speech data collected at the client-side to remote cloud platforms for inference and decision making. However, speech data carry rich information not only about emotions conveyed in vocal expressions, but also other sensitive demographic traits such as gender, age and language background. Consequently, it is desirable for SER systems to have the ability to classify emotion constructs while preventing unintended/improper inferences of sensitive and demographic information. Federated learning (FL) is a distributed machine learning paradigm that coordinates clients to train a model collaboratively without sharing their local data. This training approach appears secure and can improve privacy for SER. However, recent works have demonstrated that FL approaches are still vulnerable to various privacy attacks like reconstruction attacks and membership inference attacks. Although most of these have focused on computer vision applications, such information leakages exist in the SER systems trained using the FL technique. To assess the information leakage of SER systems trained using FL, we propose an attribute inference attack framework that infers sensitive attribute information of the clients from shared gradients or model parameters, corresponding to the FedSGD and the FedAvg training algorithms, respectively. As a use case, we empirically evaluate our approach for predicting the client's gender information using three SER benchmark datasets: IEMOCAP, CREMA-D, and MSP-Improv. We show that the attribute inference attack is achievable for SER systems trained using FL. We further identify that most information leakage possibly comes from the first layer in the SER model.
△ Less
Submitted 22 December, 2022; v1 submitted 26 December, 2021;
originally announced December 2021.
-
Adaptive Verifiable Coded Computing: Towards Fast, Secure and Private Distributed Machine Learning
Authors:
Tingting Tang,
Ramy E. Ali,
Hanieh Hashemi,
Tynan Gangwani,
Salman Avestimehr,
Murali Annavaram
Abstract:
Stragglers, Byzantine workers, and data privacy are the main bottlenecks in distributed cloud computing. Some prior works proposed coded computing strategies to jointly address all three challenges. They require either a large number of workers, a significant communication cost or a significant computational complexity to tolerate Byzantine workers. Much of the overhead in prior schemes comes from…
▽ More
Stragglers, Byzantine workers, and data privacy are the main bottlenecks in distributed cloud computing. Some prior works proposed coded computing strategies to jointly address all three challenges. They require either a large number of workers, a significant communication cost or a significant computational complexity to tolerate Byzantine workers. Much of the overhead in prior schemes comes from the fact that they tightly couple coding for all three problems into a single framework. In this paper, we propose Adaptive Verifiable Coded Computing (AVCC) framework that decouples the Byzantine node detection challenge from the straggler tolerance. AVCC leverages coded computing just for handling stragglers and privacy, and then uses an orthogonal approach that leverages verifiable computing to mitigate Byzantine workers. Furthermore, AVCC dynamically adapts its coding scheme to trade-off straggler tolerance with Byzantine protection. We evaluate AVCC on a compute-intensive distributed logistic regression application. Our experiments show that AVCC achieves up to $4.2\times$ speedup and up to $5.1\%$ accuracy improvement over the state-of-the-art Lagrange coded computing approach (LCC). AVCC also speeds up the conventional uncoded implementation of distributed logistic regression by up to $7.6\times$, and improves the test accuracy by up to $12.1\%$.
△ Less
Submitted 22 March, 2022; v1 submitted 27 July, 2021;
originally announced July 2021.
-
LAORAM: A Look Ahead ORAM Architecture for Training Large Embedding Tables
Authors:
Rachit Rajat,
Yongqin Wang,
Murali Annavaram
Abstract:
Data confidentiality is becoming a significant concern, especially in the cloud computing era. Memory access patterns have been demonstrated to leak critical information such as security keys and a program's spatial and temporal information. This information leak poses an even more significant privacy challenge in machine learning models with embedding tables. Embedding tables are routinely used t…
▽ More
Data confidentiality is becoming a significant concern, especially in the cloud computing era. Memory access patterns have been demonstrated to leak critical information such as security keys and a program's spatial and temporal information. This information leak poses an even more significant privacy challenge in machine learning models with embedding tables. Embedding tables are routinely used to learn categorical features from training data. Even knowing the locations of the embedding table entries accessed, not the data within the embedding table, will compromise categorical input data to the model. Embedding entries are privacy-sensitive since they disclose valuable properties about the user. Oblivious RAM (ORAM), and its enhanced variants such as PathORAM have emerged as viable solutions to hide leakage from memory access streams.
In this work, we present LAORAM, an ORAM framework explicitly designed to protect user privacy during embedding table training. LAORAM exploits the unique property of training, the training samples used in the future are known beforehand. LAORAM preprocesses the training samples to identify the memory blocks which are accessed together in the near future. The system tries to assign these blocks to as few paths as possible within the PathORAM infrastructure.
LAORAM does this operation by combining multiple blocks accessed together as superblocks. To further increase performance, LAORAM uses a fat-tree structure for PathORAM reducing the number of background evictions required, which improves the stash usage. We have evaluated LAORAM using both a recommendation model (DLRM) and a NLP model (XLM-R) embedding table configurations. LAORAM performs 5 times faster than PathORAM on a recommendation dataset (Kaggle) and 5.4x faster on a NLP dataset (XNLI), while guaranteeing the same security guarantees as the original PathORAM.
△ Less
Submitted 29 June, 2022; v1 submitted 16 July, 2021;
originally announced July 2021.
-
SpreadGNN: Serverless Multi-task Federated Learning for Graph Neural Networks
Authors:
Chaoyang He,
Emir Ceyani,
Keshav Balasubramanian,
Murali Annavaram,
Salman Avestimehr
Abstract:
Graph Neural Networks (GNNs) are the first choice methods for graph machine learning problems thanks to their ability to learn state-of-the-art level representations from graph-structured data. However, centralizing a massive amount of real-world graph data for GNN training is prohibitive due to user-side privacy concerns, regulation restrictions, and commercial competition. Federated Learning is…
▽ More
Graph Neural Networks (GNNs) are the first choice methods for graph machine learning problems thanks to their ability to learn state-of-the-art level representations from graph-structured data. However, centralizing a massive amount of real-world graph data for GNN training is prohibitive due to user-side privacy concerns, regulation restrictions, and commercial competition. Federated Learning is the de-facto standard for collaborative training of machine learning models over many distributed edge devices without the need for centralization. Nevertheless, training graph neural networks in a federated setting is vaguely defined and brings statistical and systems challenges. This work proposes SpreadGNN, a novel multi-task federated training framework capable of operating in the presence of partial labels and absence of a central server for the first time in the literature. SpreadGNN extends federated multi-task learning to realistic serverless settings for GNNs, and utilizes a novel optimization algorithm with a convergence guarantee, Decentralized Periodic Averaging SGD (DPA-SGD), to solve decentralized multi-task learning problems. We empirically demonstrate the efficacy of our framework on a variety of non-I.I.D. distributed graph-level molecular property prediction datasets with partial labels. Our results show that SpreadGNN outperforms GNN models trained over a central server-dependent federated learning system, even in constrained topologies. The source code is publicly available at https://github.com/FedML-AI/SpreadGNN
△ Less
Submitted 4 June, 2021;
originally announced June 2021.
-
Byzantine-Robust and Privacy-Preserving Framework for FedML
Authors:
Hanieh Hashemi,
Yongqin Wang,
Chuan Guo,
Murali Annavaram
Abstract:
Federated learning has emerged as a popular paradigm for collaboratively training a model from data distributed among a set of clients. This learning setting presents, among others, two unique challenges: how to protect privacy of the clients' data during training, and how to ensure integrity of the trained model. We propose a two-pronged solution that aims to address both challenges under a singl…
▽ More
Federated learning has emerged as a popular paradigm for collaboratively training a model from data distributed among a set of clients. This learning setting presents, among others, two unique challenges: how to protect privacy of the clients' data during training, and how to ensure integrity of the trained model. We propose a two-pronged solution that aims to address both challenges under a single framework. First, we propose to create secure enclaves using a trusted execution environment (TEE) within the server. Each client can then encrypt their gradients and send them to verifiable enclaves. The gradients are decrypted within the enclave without the fear of privacy breaches. However, robustness check computations in a TEE are computationally prohibitive. Hence, in the second step, we perform a novel gradient encoding that enables TEEs to encode the gradients and then offloading Byzantine check computations to accelerators such as GPUs. Our proposed approach provides theoretical bounds on information leakage and offers a significant speed-up over the baseline in empirical evaluation.
△ Less
Submitted 5 May, 2021;
originally announced May 2021.
-
Privacy and Integrity Preserving Training Using Trusted Hardware
Authors:
Hanieh Hashemi,
Yongqin Wang,
Murali Annavaram
Abstract:
Privacy and security-related concerns are growing as machine learning reaches diverse application domains. The data holders want to train with private data while exploiting accelerators, such as GPUs, that are hosted in the cloud. However, Cloud systems are vulnerable to attackers that compromise the privacy of data and integrity of computations. This work presents DarKnight, a framework for large…
▽ More
Privacy and security-related concerns are growing as machine learning reaches diverse application domains. The data holders want to train with private data while exploiting accelerators, such as GPUs, that are hosted in the cloud. However, Cloud systems are vulnerable to attackers that compromise the privacy of data and integrity of computations. This work presents DarKnight, a framework for large DNN training while protecting input privacy and computation integrity. DarKnight relies on cooperative execution between trusted execution environments (TEE) and accelerators, where the TEE provides privacy and integrity verification, while accelerators perform the computation heavy linear algebraic operations.
△ Less
Submitted 1 May, 2021;
originally announced May 2021.
-
FedGraphNN: A Federated Learning System and Benchmark for Graph Neural Networks
Authors:
Chaoyang He,
Keshav Balasubramanian,
Emir Ceyani,
Carl Yang,
Han Xie,
Lichao Sun,
Lifang He,
Liangwei Yang,
Philip S. Yu,
Yu Rong,
Peilin Zhao,
Junzhou Huang,
Murali Annavaram,
Salman Avestimehr
Abstract:
Graph Neural Network (GNN) research is rapidly growing thanks to the capacity of GNNs in learning distributed representations from graph-structured data. However, centralizing a massive amount of real-world graph data for GNN training is prohibitive due to privacy concerns, regulation restrictions, and commercial competitions. Federated learning (FL), a trending distributed learning paradigm, prov…
▽ More
Graph Neural Network (GNN) research is rapidly growing thanks to the capacity of GNNs in learning distributed representations from graph-structured data. However, centralizing a massive amount of real-world graph data for GNN training is prohibitive due to privacy concerns, regulation restrictions, and commercial competitions. Federated learning (FL), a trending distributed learning paradigm, provides possibilities to solve this challenge while preserving data privacy. Despite recent advances in vision and language domains, there is no suitable platform for the FL of GNNs. To this end, we introduce FedGraphNN, an open FL benchmark system that can facilitate research on federated GNNs. FedGraphNN is built on a unified formulation of graph FL and contains a wide range of datasets from different domains, popular GNN models, and FL algorithms, with secure and efficient system support. Particularly for the datasets, we collect, preprocess, and partition 36 datasets from 7 domains, including both publicly available ones and specifically obtained ones such as hERG and Tencent. Our empirical analysis showcases the utility of our benchmark system, while exposing significant challenges in graph FL: federated GNNs perform worse in most datasets with a non-IID split than centralized GNNs; the GNN model that attains the best result in the centralized setting may not maintain its advantage in the FL setting. These results imply that more research efforts are needed to unravel the mystery behind federated GNNs. Moreover, our system performance analysis demonstrates that the FedGraphNN system is computationally efficient and secure to large-scale graphs datasets. We maintain the source code at https://github.com/FedML-AI/FedGraphNN.
△ Less
Submitted 7 September, 2021; v1 submitted 14 April, 2021;
originally announced April 2021.
-
Distributed Training of Graph Convolutional Networks using Subgraph Approximation
Authors:
Alexandra Angerd,
Keshav Balasubramanian,
Murali Annavaram
Abstract:
Modern machine learning techniques are successfully being adapted to data modeled as graphs. However, many real-world graphs are typically very large and do not fit in memory, often making the problem of training machine learning models on them intractable. Distributed training has been successfully employed to alleviate memory problems and speed up training in machine learning domains in which th…
▽ More
Modern machine learning techniques are successfully being adapted to data modeled as graphs. However, many real-world graphs are typically very large and do not fit in memory, often making the problem of training machine learning models on them intractable. Distributed training has been successfully employed to alleviate memory problems and speed up training in machine learning domains in which the input data is assumed to be independently identical distributed (i.i.d). However, distributing the training of non i.i.d data such as graphs that are used as training inputs in Graph Convolutional Networks (GCNs) causes accuracy problems since information is lost at the graph partitioning boundaries.
In this paper, we propose a training strategy that mitigates the lost information across multiple partitions of a graph through a subgraph approximation scheme. Our proposed approach augments each sub-graph with a small amount of edge and vertex information that is approximated from all other sub-graphs. The subgraph approximation approach helps the distributed training system converge at single-machine accuracy, while keeping the memory footprint low and minimizing synchronization overhead between the machines.
△ Less
Submitted 9 December, 2020;
originally announced December 2020.
-
Check-N-Run: A Checkpointing System for Training Deep Learning Recommendation Models
Authors:
Assaf Eisenman,
Kiran Kumar Matam,
Steven Ingram,
Dheevatsa Mudigere,
Raghuraman Krishnamoorthi,
Krishnakumar Nair,
Misha Smelyanskiy,
Murali Annavaram
Abstract:
Checkpoints play an important role in training long running machine learning (ML) models. Checkpoints take a snapshot of an ML model and store it in a non-volatile memory so that they can be used to recover from failures to ensure rapid training progress. In addition, they are used for online training to improve inference prediction accuracy with continuous learning. Given the large and ever incre…
▽ More
Checkpoints play an important role in training long running machine learning (ML) models. Checkpoints take a snapshot of an ML model and store it in a non-volatile memory so that they can be used to recover from failures to ensure rapid training progress. In addition, they are used for online training to improve inference prediction accuracy with continuous learning. Given the large and ever increasing model sizes, checkpoint frequency is often bottlenecked by the storage write bandwidth and capacity. When checkpoints are maintained on remote storage, as is the case with many industrial settings, they are also bottlenecked by network bandwidth. We present Check-N-Run, a scalable checkpointing system for training large ML models at Facebook. While Check-N-Run is applicable to long running ML jobs, we focus on checkpointing recommendation models which are currently the largest ML models with Terabytes of model size. Check-N-Run uses two primary techniques to address the size and bandwidth challenges. First, it applies incremental checkpointing, which tracks and checkpoints the modified part of the model. Incremental checkpointing is particularly valuable in the context of recommendation models where only a fraction of the model (stored as embedding tables) is updated on each iteration. Second, Check-N-Run leverages quantization techniques to significantly reduce the checkpoint size, without degrading training accuracy. These techniques allow Check-N-Run to reduce the required write bandwidth by 6-17x and the required capacity by 2.5-8x on real-world models at Facebook, and thereby significantly improve checkpoint capabilities while reducing the total cost of ownership.
△ Less
Submitted 4 May, 2021; v1 submitted 16 October, 2020;
originally announced October 2020.
-
Secure and Fault Tolerant Decentralized Learning
Authors:
Saurav Prakash,
Hanieh Hashemi,
Yongqin Wang,
Murali Annavaram,
Salman Avestimehr
Abstract:
Federated learning (FL) is a promising paradigm for training a global model over data distributed across multiple data owners without centralizing clients' raw data. However, sharing of local model updates can also reveal information of clients' local datasets. Trusted execution environments (TEEs) within the FL server have been recently deployed by companies like Meta for secure aggregation. Howe…
▽ More
Federated learning (FL) is a promising paradigm for training a global model over data distributed across multiple data owners without centralizing clients' raw data. However, sharing of local model updates can also reveal information of clients' local datasets. Trusted execution environments (TEEs) within the FL server have been recently deployed by companies like Meta for secure aggregation. However, secure aggregation can suffer from error-prone local updates sent by clients that become faulty during training due to underlying device malfunctions. Also, data heterogeneity across clients makes fault mitigation challenging, as even updates from normal clients are dissimilar. Thus, most of the prior fault tolerant methods, which treat any local update differing from the majority of other updates as faulty, perform poorly. We propose DiverseFL to make model aggregation secure as well as robust to faults. In DiverseFL, any client whose local model update diverges from its associated guiding update is tagged as being faulty. To implement our novel per-client criteria for fault mitigation, DiverseFL creates a TEE-based secure enclave within the FL server, which in addition to performing secure aggregation for carrying out the global model update step, securely receives a small representative sample of local data from each client only once before training, and computes guiding updates for each participating client during training. Thus, DiverseFL provides security against privacy leakage as well as robustness against faulty clients. In experiments, DiverseFL consistently achieves significant improvements in absolute test accuracy over prior fault mitigation benchmarks. DiverseFL also performs closely to OracleSGD, where server combines updates only from the normal clients. We also analyze the convergence rate of DiverseFL under non-IID data and standard convexity assumptions.
△ Less
Submitted 13 September, 2022; v1 submitted 15 October, 2020;
originally announced October 2020.
-
Group Knowledge Transfer: Federated Learning of Large CNNs at the Edge
Authors:
Chaoyang He,
Murali Annavaram,
Salman Avestimehr
Abstract:
Scaling up the convolutional neural network (CNN) size (e.g., width, depth, etc.) is known to effectively improve model accuracy. However, the large model size impedes training on resource-constrained edge devices. For instance, federated learning (FL) may place undue burden on the compute capability of edge nodes, even though there is a strong practical need for FL due to its privacy and confiden…
▽ More
Scaling up the convolutional neural network (CNN) size (e.g., width, depth, etc.) is known to effectively improve model accuracy. However, the large model size impedes training on resource-constrained edge devices. For instance, federated learning (FL) may place undue burden on the compute capability of edge nodes, even though there is a strong practical need for FL due to its privacy and confidentiality properties. To address the resource-constrained reality of edge devices, we reformulate FL as a group knowledge transfer training algorithm, called FedGKT. FedGKT designs a variant of the alternating minimization approach to train small CNNs on edge nodes and periodically transfer their knowledge by knowledge distillation to a large server-side CNN. FedGKT consolidates several advantages into a single framework: reduced demand for edge computation, lower communication bandwidth for large CNNs, and asynchronous training, all while maintaining model accuracy comparable to FedAvg. We train CNNs designed based on ResNet-56 and ResNet-110 using three distinct datasets (CIFAR-10, CIFAR-100, and CINIC-10) and their non-I.I.D. variants. Our results show that FedGKT can obtain comparable or even slightly higher accuracy than FedAvg. More importantly, FedGKT makes edge training affordable. Compared to the edge training using FedAvg, FedGKT demands 9 to 17 times less computational power (FLOPs) on edge devices and requires 54 to 105 times fewer parameters in the edge CNN. Our source code is released at FedML (https://fedml.ai).
△ Less
Submitted 5 November, 2020; v1 submitted 28 July, 2020;
originally announced July 2020.
-
FedML: A Research Library and Benchmark for Federated Machine Learning
Authors:
Chaoyang He,
Songze Li,
Jinhyun So,
Xiao Zeng,
Mi Zhang,
Hongyi Wang,
Xiaoyang Wang,
Praneeth Vepakomma,
Abhishek Singh,
Hang Qiu,
Xinghua Zhu,
Jianzong Wang,
Li Shen,
Peilin Zhao,
Yan Kang,
Yang Liu,
Ramesh Raskar,
Qiang Yang,
Murali Annavaram,
Salman Avestimehr
Abstract:
Federated learning (FL) is a rapidly growing research field in machine learning. However, existing FL libraries cannot adequately support diverse algorithmic development; inconsistent dataset and model usage make fair algorithm comparison challenging. In this work, we introduce FedML, an open research library and benchmark to facilitate FL algorithm development and fair performance comparison. Fed…
▽ More
Federated learning (FL) is a rapidly growing research field in machine learning. However, existing FL libraries cannot adequately support diverse algorithmic development; inconsistent dataset and model usage make fair algorithm comparison challenging. In this work, we introduce FedML, an open research library and benchmark to facilitate FL algorithm development and fair performance comparison. FedML supports three computing paradigms: on-device training for edge devices, distributed computing, and single-machine simulation. FedML also promotes diverse algorithmic research with flexible and generic API design and comprehensive reference baseline implementations (optimizer, models, and datasets). We hope FedML could provide an efficient and reproducible means for developing and evaluating FL algorithms that would benefit the FL research community. We maintain the source code, documents, and user community at https://fedml.ai.
△ Less
Submitted 8 November, 2020; v1 submitted 27 July, 2020;
originally announced July 2020.
-
DarKnight: A Data Privacy Scheme for Training and Inference of Deep Neural Networks
Authors:
Hanieh Hashemi,
Yongqin Wang,
Murali Annavaram
Abstract:
Protecting the privacy of input data is of growing importance as machine learning methods reach new application domains. In this paper, we provide a unified training and inference framework for large DNNs while protecting input privacy and computation integrity. Our approach called DarKnight uses a novel data blinding strategy using matrix masking to create input obfuscation within a trusted execu…
▽ More
Protecting the privacy of input data is of growing importance as machine learning methods reach new application domains. In this paper, we provide a unified training and inference framework for large DNNs while protecting input privacy and computation integrity. Our approach called DarKnight uses a novel data blinding strategy using matrix masking to create input obfuscation within a trusted execution environment (TEE). Our rigorous mathematical proof demonstrates that our blinding process provides information-theoretic privacy guarantee by bounding information leakage. The obfuscated data can then be offloaded to any GPU for accelerating linear operations on blinded data. The results from linear operations on blinded data are decoded before performing non-linear operations within the TEE. This cooperative execution allows DarKnight to exploit the computational power of GPUs to perform linear operations while exploiting TEEs to protect input privacy. We implement DarKnight on an Intel SGX TEE augmented with a GPU to evaluate its performance.
△ Less
Submitted 15 October, 2020; v1 submitted 1 June, 2020;
originally announced June 2020.
-
Towards Non-I.I.D. and Invisible Data with FedNAS: Federated Deep Learning via Neural Architecture Search
Authors:
Chaoyang He,
Murali Annavaram,
Salman Avestimehr
Abstract:
Federated Learning (FL) has been proved to be an effective learning framework when data cannot be centralized due to privacy, communication costs, and regulatory restrictions. When training deep learning models under an FL setting, people employ the predefined model architecture discovered in the centralized environment. However, this predefined architecture may not be the optimal choice because i…
▽ More
Federated Learning (FL) has been proved to be an effective learning framework when data cannot be centralized due to privacy, communication costs, and regulatory restrictions. When training deep learning models under an FL setting, people employ the predefined model architecture discovered in the centralized environment. However, this predefined architecture may not be the optimal choice because it may not fit data with non-identical and independent distribution (non-IID). Thus, we advocate automating federated learning (AutoFL) to improve model accuracy and reduce the manual design effort. We specifically study AutoFL via Neural Architecture Search (NAS), which can automate the design process. We propose a Federated NAS (FedNAS) algorithm to help scattered workers collaboratively searching for a better architecture with higher accuracy. We also build a system based on FedNAS. Our experiments on non-IID dataset show that the architecture searched by FedNAS can outperform the manually predefined architecture.
△ Less
Submitted 3 January, 2021; v1 submitted 18 April, 2020;
originally announced April 2020.
-
Jupiter: A Networked Computing Architecture
Authors:
Pradipta Ghosh,
Quynh Nguyen,
Pranav K Sakulkar,
Aleksandra Knezevic,
Jason A. Tran,
Jiatong Wang,
Zhifeng Lin,
Bhaskar Krishnamachari,
Murali Annavaram,
Salman Avestimehr
Abstract:
In the era of Internet of Things, there is an increasing demand for networked computing to support the requirements of the time-constrained, compute-intensive distributed applications such as multi-camera video processing and data fusion for security. We present Jupiter, an open source networked computing system that inputs a Directed Acyclic Graph (DAG)-based computational task graph to efficient…
▽ More
In the era of Internet of Things, there is an increasing demand for networked computing to support the requirements of the time-constrained, compute-intensive distributed applications such as multi-camera video processing and data fusion for security. We present Jupiter, an open source networked computing system that inputs a Directed Acyclic Graph (DAG)-based computational task graph to efficiently distribute the tasks among a set of networked compute nodes regardless of their geographical separations and orchestrates the execution of the DAG thereafter. This Kubernetes container-orchestration-based system supports both centralized and decentralized scheduling algorithms for optimally mapping the tasks based on information from a range of profilers: network profilers, resource profilers, and execution time profilers. While centralized scheduling algorithms with global knowledge have been popular among the grid/cloud computing community, we argue that a distributed scheduling approach is better suited for networked computing due to lower communication and computation overhead in the face of network dynamics. To this end, we propose and implement a new class of distributed scheduling algorithms called WAVE on the Jupiter system. We present a set of real world experiments on two separate testbeds - one a world-wide network of 90 cloud computers across 8 cities and the other a cluster of 30 Raspberry pi nodes, over a simple networked computing application called Distributed Network Anomaly Detector (DNAD). We show that despite using more localized knowledge, a distributed WAVE greedy algorithm can achieve similar performance as a classical centralized scheduling algorithm called Heterogeneous Earliest Finish Time (HEFT), suitably enhanced for the Jupiter system.
△ Less
Submitted 23 December, 2019;
originally announced December 2019.
-
Privacy-Preserving Inference in Machine Learning Services Using Trusted Execution Environments
Authors:
Krishna Giri Narra,
Zhifeng Lin,
Yongqin Wang,
Keshav Balasubramaniam,
Murali Annavaram
Abstract:
This work presents Origami, which provides privacy-preserving inference for large deep neural network (DNN) models through a combination of enclave execution, cryptographic blinding, interspersed with accelerator-based computation. Origami partitions the ML model into multiple partitions. The first partition receives the encrypted user input within an SGX enclave. The enclave decrypts the input an…
▽ More
This work presents Origami, which provides privacy-preserving inference for large deep neural network (DNN) models through a combination of enclave execution, cryptographic blinding, interspersed with accelerator-based computation. Origami partitions the ML model into multiple partitions. The first partition receives the encrypted user input within an SGX enclave. The enclave decrypts the input and then applies cryptographic blinding to the input data and the model parameters. Cryptographic blinding is a technique that adds noise to obfuscate data. Origami sends the obfuscated data for computation to an untrusted GPU/CPU. The blinding and de-blinding factors are kept private by the SGX enclave, thereby preventing any adversary from denoising the data, when the computation is offloaded to a GPU/CPU. The computed output is returned to the enclave, which decodes the computation on noisy data using the unblinding factors privately stored within SGX. This process may be repeated for each DNN layer, as has been done in prior work Slalom.
However, the overhead of blinding and unblinding the data is a limiting factor to scalability. Origami relies on the empirical observation that the feature maps after the first several layers can not be used, even by a powerful conditional GAN adversary to reconstruct input. Hence, Origami dynamically switches to executing the rest of the DNN layers directly on an accelerator without needing any further cryptographic blinding intervention to preserve privacy. We empirically demonstrate that using Origami, a conditional GAN adversary, even with an unlimited inference budget, cannot reconstruct the input. We implement and demonstrate the performance gains of Origami using the VGG-16 and VGG-19 models. Compared to running the entire VGG-19 model within SGX, Origami inference improves the performance of private inference from 11x while using Slalom to 15.1x.
△ Less
Submitted 7 December, 2019;
originally announced December 2019.
-
Train Where the Data is: A Case for Bandwidth Efficient Coded Training
Authors:
Zhifeng Lin,
Krishna Giri Narra,
Mingchao Yu,
Salman Avestimehr,
Murali Annavaram
Abstract:
Training a machine learning model is both compute and data-intensive. Most of the model training is performed on high performance compute nodes and the training data is stored near these nodes for faster training. But there is a growing interest in enabling training near the data. For instance, mobile devices are rich sources of training data. It may not be feasible to consolidate the data from mo…
▽ More
Training a machine learning model is both compute and data-intensive. Most of the model training is performed on high performance compute nodes and the training data is stored near these nodes for faster training. But there is a growing interest in enabling training near the data. For instance, mobile devices are rich sources of training data. It may not be feasible to consolidate the data from mobile devices into a cloud service, due to bandwidth and data privacy reasons. Training at mobile devices is however fraught with challenges. First mobile devices may join or leave the distributed setting, either voluntarily or due to environmental uncertainties, such as lack of power. Tolerating uncertainties is critical to the success of distributed mobile training. One proactive approach to tolerate computational uncertainty is to store data in a coded format and perform training on coded data. Encoding data is a challenging task since erasure codes require multiple devices to exchange their data to create a coded data partition, which places a significant bandwidth constraint. Furthermore, coded computing traditionally relied on a central node to encode and distribute data to all the worker nodes, which is not practical in a distributed mobile setting.
In this paper, we tackle the uncertainty in distributed mobile training using a bandwidth-efficient encoding strategy. We use a Random Linear Network coding (RLNC) which reduces the need to exchange data partitions across all participating mobile devices, while at the same time preserving the property of coded computing to tolerate uncertainties. We implement gradient descent for logistic regression and SVM to evaluate the effectiveness of our mobile training framework. We demonstrate a 50% reduction in total required communication bandwidth compared to MDS coded computation, one of the popular erasure codes.
△ Less
Submitted 22 October, 2019;
originally announced October 2019.
-
Collage Inference: Achieving low tail latency during distributed image classification using coded redundancy models
Authors:
Krishna Narra,
Zhifeng Lin,
Ganesh Ananthanarayanan,
Salman Avestimehr,
Murali Annavaram
Abstract:
Reducing the latency variance in machine learning inference is a key requirement in many applications. Variance is harder to control in a cloud deployment in the presence of stragglers. In spite of this challenge, inference is increasingly being done in the cloud, due to the advent of affordable machine learning as a service (MLaaS) platforms. Existing approaches to reduce variance rely on replica…
▽ More
Reducing the latency variance in machine learning inference is a key requirement in many applications. Variance is harder to control in a cloud deployment in the presence of stragglers. In spite of this challenge, inference is increasingly being done in the cloud, due to the advent of affordable machine learning as a service (MLaaS) platforms. Existing approaches to reduce variance rely on replication which is expensive and partially negates the affordability of MLaaS. In this work, we argue that MLaaS platforms also provide unique opportunities to cut the cost of redundancy. In MLaaS platforms, multiple inference requests are concurrently received by a load balancer which can then create a more cost-efficient redundancy coding across a larger collection of images. We propose a novel convolutional neural network model, Collage-CNN, to provide a low-cost redundancy framework. A Collage-CNN model takes a collage formed by combining multiple images and performs multi-image classification in one shot, albeit at slightly lower accuracy. We then augment a collection of traditional single image classifiers with a single Collage-CNN classifier which acts as a low-cost redundant backup. Collage-CNN then provides backup classification results if a single image classification straggles. Deploying the Collage-CNN models in the cloud, we demonstrate that the 99th percentile tail latency of inference can be reduced by 1.47X compared to replication based approaches while providing high accuracy. Also, variation in inference latency can be reduced by 9X with a slight increase in average inference latency.
△ Less
Submitted 5 June, 2019;
originally announced June 2019.
-
PartitionedVC: Partitioned External Memory Graph Analytics Framework for SSDs
Authors:
Kiran Kumar Matam,
Hanieh Hashemi,
Murali Annavaram
Abstract:
Graph analytics are at the heart of a broad range of applications such as drug discovery, page ranking, and recommendation systems. When graph size exceeds memory size, out-of-core graph processing is needed. For the widely used external memory graph processing systems, accessing storage becomes the bottleneck. We make the observation that nearly all graph algorithms have a dynamically varying num…
▽ More
Graph analytics are at the heart of a broad range of applications such as drug discovery, page ranking, and recommendation systems. When graph size exceeds memory size, out-of-core graph processing is needed. For the widely used external memory graph processing systems, accessing storage becomes the bottleneck. We make the observation that nearly all graph algorithms have a dynamically varying number of active vertices that must be processed in each iteration. However, existing graph processing frameworks, such as GraphChi, load the entire graph in each iteration even if a small fraction of the graph is active. This limitation is due to the structure of the data storage used by these systems. In this work, we propose to use a compressed sparse row (CSR) based graph storage that is more amenable for selectively loading only a few active vertices in each iteration. But CSR based systems suffers from random update propagation to many target vertices. To solve this challenge, we propose to use a multi-log update mechanism that logs updates separately, rather than directly update the active edges in a graph. Our proposed multi-log system maintains a separate log per each vertex interval. This separation enables us to efficiently process each vertex interval by just loading the corresponding log. Further, while accessing SSD pages with fewer active vertex data, we reduce the read amplification due to the page granular accesses in SSD by logging the active vertex data in the current iteration and efficiently reading the log in the next iteration. Over the current state of the art out-of-core graph processing framework, our PartitionedVC improves performance by up to $17.84\times$, $1.19\times$, $1.65\times$, $1.38\times$, $3.15\times$, and $6.00\times$ for the widely used bfs, pagerank, community detection, graph coloring, maximal independent set, and random-walk applications, respectively.
△ Less
Submitted 11 February, 2020; v1 submitted 10 May, 2019;
originally announced May 2019.
-
Collage Inference: Using Coded Redundancy for Low Variance Distributed Image Classification
Authors:
Krishna Giri Narra,
Zhifeng Lin,
Ganesh Ananthanarayanan,
Salman Avestimehr,
Murali Annavaram
Abstract:
MLaaS (ML-as-a-Service) offerings by cloud computing platforms are becoming increasingly popular. Hosting pre-trained machine learning models in the cloud enables elastic scalability as the demand grows. But providing low latency and reducing the latency variance is a key requirement. Variance is harder to control in a cloud deployment due to uncertainties in resource allocations across many virtu…
▽ More
MLaaS (ML-as-a-Service) offerings by cloud computing platforms are becoming increasingly popular. Hosting pre-trained machine learning models in the cloud enables elastic scalability as the demand grows. But providing low latency and reducing the latency variance is a key requirement. Variance is harder to control in a cloud deployment due to uncertainties in resource allocations across many virtual instances. We propose the collage inference technique which uses a novel convolutional neural network model, collage-cnn, to provide low-cost redundancy. A collage-cnn model takes a collage image formed by combining multiple images and performs multi-image classification in one shot, albeit at slightly lower accuracy. We augment a collection of traditional single image classifier models with a single collage-cnn classifier which acts as their low-cost redundant backup. Collage-cnn provides backup classification results if any single image classification requests experience slowdown. Deploying the collage-cnn models in the cloud, we demonstrate that the 99th percentile tail latency of inference can be reduced by 1.2x to 2x compared to replication based approaches while providing high accuracy. Variation in inference latency can be reduced by 1.8x to 15x.
△ Less
Submitted 10 September, 2019; v1 submitted 27 April, 2019;
originally announced April 2019.
-
Slack Squeeze Coded Computing for Adaptive Straggler Mitigation
Authors:
Krishna Giri Narra,
Zhifeng Lin,
Mehrdad Kiamari,
Salman Avestimehr,
Murali Annavaram
Abstract:
While performing distributed computations in today's cloud-based platforms, execution speed variations among compute nodes can significantly reduce the performance and create bottlenecks like stragglers. Coded computation techniques leverage coding theory to inject computational redundancy and mitigate stragglers in distributed computations. In this paper, we propose a dynamic workload distributio…
▽ More
While performing distributed computations in today's cloud-based platforms, execution speed variations among compute nodes can significantly reduce the performance and create bottlenecks like stragglers. Coded computation techniques leverage coding theory to inject computational redundancy and mitigate stragglers in distributed computations. In this paper, we propose a dynamic workload distribution strategy for coded computation called Slack Squeeze Coded Computation ($S^2C^2$). $S^2C^2$ squeezes the compute slack (i.e., overhead) that is built into the coded computing frameworks by efficiently assigning work for all fast and slow nodes according to their speeds and without needing to re-distribute data. We implement an LSTM-based speed prediction algorithm to predict speeds of compute nodes. We evaluate $S^2C^2$ on linear algebraic algorithms, gradient descent, graph ranking, and graph filtering algorithms. We demonstrate 19% to 39% reduction in total computation latency using $S^2C^2$ compared to job replication and coded computation. We further show how $S^2C^2$ can be applied beyond matrix-vector multiplication.
△ Less
Submitted 31 August, 2019; v1 submitted 15 April, 2019;
originally announced April 2019.
-
GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training
Authors:
Mingchao Yu,
Zhifeng Lin,
Krishna Narra,
Songze Li,
Youjie Li,
Nam Sung Kim,
Alexander Schwing,
Murali Annavaram,
Salman Avestimehr
Abstract:
Data parallelism can boost the training speed of convolutional neural networks (CNN), but could suffer from significant communication costs caused by gradient aggregation. To alleviate this problem, several scalar quantization techniques have been developed to compress the gradients. But these techniques could perform poorly when used together with decentralized aggregation protocols like ring all…
▽ More
Data parallelism can boost the training speed of convolutional neural networks (CNN), but could suffer from significant communication costs caused by gradient aggregation. To alleviate this problem, several scalar quantization techniques have been developed to compress the gradients. But these techniques could perform poorly when used together with decentralized aggregation protocols like ring all-reduce (RAR), mainly due to their inability to directly aggregate compressed gradients. In this paper, we empirically demonstrate the strong linear correlations between CNN gradients, and propose a gradient vector quantization technique, named GradiVeQ, to exploit these correlations through principal component analysis (PCA) for substantial gradient dimension reduction. GradiVeQ enables direct aggregation of compressed gradients, hence allows us to build a distributed learning system that parallelizes GradiVeQ gradient compression and RAR communications. Extensive experiments on popular CNNs demonstrate that applying GradiVeQ slashes the wall-clock gradient aggregation time of the original RAR by more than 5X without noticeable accuracy loss, and reduces the end-to-end training time by almost 50%. The results also show that GradiVeQ is compatible with scalar quantization techniques such as QSGD (Quantized SGD), and achieves a much higher speed-up gain under the same compression ratio.
△ Less
Submitted 31 December, 2018; v1 submitted 8 November, 2018;
originally announced November 2018.