Skip to main content

Showing 1–50 of 71 results for author: Papailiopoulos, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.09251  [pdf, ps, other

    cs.CL cs.AI

    Extrapolation by Association: Length Generalization Transfer in Transformers

    Authors: Ziyang Cai, Nayoung Lee, Avi Schwarzschild, Samet Oymak, Dimitris Papailiopoulos

    Abstract: Transformer language models have demonstrated impressive generalization capabilities in natural language domains, yet we lack a fine-grained understanding of how such generalization arises. In this paper, we investigate length generalization--the ability to extrapolate from shorter to longer inputs--through the lens of \textit{task association}. We find that length generalization can be \textit{tr… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

    Comments: 23 pages, 20 figures

  2. arXiv:2504.21318  [pdf, other

    cs.AI cs.CL

    Phi-4-reasoning Technical Report

    Authors: Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, Piero Kauffmann, Yash Lara, Caio César Teodoro Mendes, Arindam Mitra, Besmira Nushi, Dimitris Papailiopoulos, Olli Saarikivi, Shital Shah, Vaishnavi Shrivastava, Vibhav Vineet, Yue Wu, Safoora Yousefi, Guoqing Zheng

    Abstract: We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of "teachable" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectivel… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

  3. arXiv:2502.06737  [pdf, ps, other

    cs.LG

    VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data

    Authors: Thomas Zeng, Shuibai Zhang, Shutong Wu, Christian Classen, Daewon Chae, Ethan Ewer, Minjae Lee, Heeju Kim, Wonjun Kang, Jackson Kunde, Ying Fan, Jungtaek Kim, Hyung Il Koo, Kannan Ramchandran, Dimitris Papailiopoulos, Kangwook Lee

    Abstract: Process Reward Models (PRMs) have proven effective at enhancing mathematical reasoning for Large Language Models (LLMs) by leveraging increased inference-time computation. However, they are predominantly trained on mathematical data and their generalizability to non-mathematical domains has not been rigorously studied. In response, this work first shows that current PRMs have poor performance in o… ▽ More

    Submitted 26 June, 2025; v1 submitted 10 February, 2025; originally announced February 2025.

  4. arXiv:2502.01612  [pdf, other

    cs.LG cs.AI

    Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges

    Authors: Nayoung Lee, Ziyang Cai, Avi Schwarzschild, Kangwook Lee, Dimitris Papailiopoulos

    Abstract: Large language models often struggle with length generalization and solving complex problem instances beyond their training distribution. We present a self-improvement approach where models iteratively generate and learn from their own solutions, progressively tackling harder problems while maintaining a standard transformer architecture. Across diverse tasks including arithmetic, string manipulat… ▽ More

    Submitted 13 February, 2025; v1 submitted 3 February, 2025; originally announced February 2025.

    Comments: Added references

  5. arXiv:2501.09240  [pdf, other

    cs.LG

    Task Vectors in In-Context Learning: Emergence, Formation, and Benefit

    Authors: Liu Yang, Ziqian Lin, Kangwook Lee, Dimitris Papailiopoulos, Robert Nowak

    Abstract: In-context learning is a remarkable capability of transformers, referring to their ability to adapt to specific tasks based on a short history or context. Previous research has found that task-specific information is locally encoded within models, though their emergence and functionality remain unclear due to opaque pre-training processes. In this work, we investigate the formation of task vectors… ▽ More

    Submitted 15 January, 2025; originally announced January 2025.

  6. arXiv:2412.08890  [pdf, other

    cs.LG

    Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries

    Authors: Junhyuck Kim, Jongho Park, Jaewoong Cho, Dimitris Papailiopoulos

    Abstract: We introduce Lexico, a novel KV cache compression method that leverages sparse coding with a universal dictionary. Our key finding is that key-value cache in modern LLMs can be accurately approximated using sparse linear combination from a small, input-agnostic dictionary of ~4k atoms, enabling efficient compression across different input prompts, tasks and models. Using orthogonal matching pursui… ▽ More

    Submitted 11 December, 2024; originally announced December 2024.

    Comments: 18 pages, 7 figures

  7. arXiv:2410.05603  [pdf, other

    cs.LG cs.AI cs.CL

    Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition

    Authors: Zheyang Xiong, Ziyang Cai, John Cooper, Albert Ge, Vasilis Papageorgiou, Zack Sifakis, Angeliki Giannou, Ziqian Lin, Liu Yang, Saurabh Agarwal, Grigorios G Chrysos, Samet Oymak, Kangwook Lee, Dimitris Papailiopoulos

    Abstract: Large Language Models (LLMs) have demonstrated remarkable in-context learning (ICL) capabilities. In this study, we explore a surprising phenomenon related to ICL: LLMs can perform multiple, computationally distinct ICL tasks simultaneously, during a single inference call, a capability we term "task superposition". We provide empirical evidence of this phenomenon across various LLM families and sc… ▽ More

    Submitted 7 October, 2024; originally announced October 2024.

  8. arXiv:2406.19292  [pdf, other

    cs.LG cs.AI cs.CL

    From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

    Authors: Zheyang Xiong, Vasilis Papageorgiou, Kangwook Lee, Dimitris Papailiopoulos

    Abstract: Recent studies have shown that Large Language Models (LLMs) struggle to accurately retrieve information and maintain reasoning capabilities when processing long-context inputs. To address these limitations, we propose a finetuning approach utilizing a carefully designed synthetic dataset comprising numerical key-value retrieval tasks. Our experiments on models like GPT-3.5 Turbo and Mistral 7B dem… ▽ More

    Submitted 13 October, 2024; v1 submitted 27 June, 2024; originally announced June 2024.

  9. arXiv:2403.08058  [pdf, other

    cs.LG cs.CL

    CHAI: Clustered Head Attention for Efficient LLM Inference

    Authors: Saurabh Agarwal, Bilge Acun, Basil Hosmer, Mostafa Elhoushi, Yejin Lee, Shivaram Venkataraman, Dimitris Papailiopoulos, Carole-Jean Wu

    Abstract: Large Language Models (LLMs) with hundreds of billions of parameters have transformed the field of machine learning. However, serving these models at inference time is both compute and memory intensive, where a single request can require multiple GPUs and tens of Gigabytes of memory. Multi-Head Attention is one of the key components of LLMs, which can account for over 50% of LLMs memory and comput… ▽ More

    Submitted 27 April, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

  10. arXiv:2403.03183  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    How Well Can Transformers Emulate In-context Newton's Method?

    Authors: Angeliki Giannou, Liu Yang, Tianhao Wang, Dimitris Papailiopoulos, Jason D. Lee

    Abstract: Transformer-based models have demonstrated remarkable in-context learning capabilities, prompting extensive research into its underlying mechanisms. Recent studies have suggested that Transformers can implement first-order optimization algorithms for in-context learning and even second order ones for the case of linear regression. In this work, we study whether Transformers can perform higher orde… ▽ More

    Submitted 5 March, 2024; originally announced March 2024.

  11. arXiv:2402.04248  [pdf, other

    cs.LG

    Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks

    Authors: Jongho Park, Jaeseung Park, Zheyang Xiong, Nayoung Lee, Jaewoong Cho, Samet Oymak, Kangwook Lee, Dimitris Papailiopoulos

    Abstract: State-space models (SSMs), such as Mamba (Gu & Dao, 2023), have been proposed as alternatives to Transformer networks in language modeling, by incorporating gating, convolutions, and input-dependent token selection to mitigate the quadratic cost of multi-head attention. Although SSMs exhibit competitive performance, their in-context learning (ICL) capabilities, a remarkable emergent property of mo… ▽ More

    Submitted 25 April, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

    Comments: Changes in v2: experiments on formal language ICL and explorations of width vs. depth on ICL; code repo available (24 pages, 10 figures)

  12. arXiv:2311.12424  [pdf, other

    cs.LG cs.NE

    Looped Transformers are Better at Learning Learning Algorithms

    Authors: Liu Yang, Kangwook Lee, Robert Nowak, Dimitris Papailiopoulos

    Abstract: Transformers have demonstrated effectiveness in in-context solving data-fitting problems from various (latent) models, as reported by Garg et al. However, the absence of an inherent iterative structure in the transformer architecture presents a challenge in emulating the iterative algorithms, which are commonly employed in traditional machine learning methods. To address this, we propose the utili… ▽ More

    Submitted 16 March, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

    Comments: Accepted for publication at ICLR 2024

  13. arXiv:2307.05908  [pdf, other

    cs.CL cs.LG

    Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding

    Authors: Seongjun Yang, Gibbeum Lee, Jaewoong Cho, Dimitris Papailiopoulos, Kangwook Lee

    Abstract: This paper presents "Predictive Pipelined Decoding (PPD)," an approach that speeds up greedy decoding in Large Language Models (LLMs) while maintaining the exact same output as the original decoding. Unlike conventional strategies, PPD employs additional compute resources to parallelize the initiation of subsequent token decoding during the current token decoding. This method reduces decoding late… ▽ More

    Submitted 29 July, 2024; v1 submitted 12 July, 2023; originally announced July 2023.

    Comments: ES-FoMo Workshop at ICML 2023 / Published in TMLR

  14. arXiv:2307.05906  [pdf, other

    cs.LG

    Mini-Batch Optimization of Contrastive Loss

    Authors: Jaewoong Cho, Kartik Sreenivasan, Keon Lee, Kyunghoo Mun, Soheun Yi, Jeong-Gwan Lee, Anna Lee, Jy-yong Sohn, Dimitris Papailiopoulos, Kangwook Lee

    Abstract: Contrastive learning has gained significant attention as a method for self-supervised learning. The contrastive loss function ensures that embeddings of positive sample pairs (e.g., different samples from the same class or different views of the same object) are similar, while embeddings of negative pairs are dissimilar. Practical constraints such as large memory requirements make it challenging t… ▽ More

    Submitted 12 July, 2023; originally announced July 2023.

  15. arXiv:2307.03381  [pdf, other

    cs.LG

    Teaching Arithmetic to Small Transformers

    Authors: Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, Dimitris Papailiopoulos

    Abstract: Large language models like GPT-4 exhibit emergent capabilities across general-purpose tasks, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, next-token prediction objective. This study investigates how small transformers, trained from random initialization, can efficiently learn arithmetic operations such as add… ▽ More

    Submitted 7 July, 2023; originally announced July 2023.

  16. arXiv:2305.18869  [pdf, other

    cs.LG cs.AI cs.CL

    Dissecting Chain-of-Thought: Compositionality through In-Context Filtering and Learning

    Authors: Yingcong Li, Kartik Sreenivasan, Angeliki Giannou, Dimitris Papailiopoulos, Samet Oymak

    Abstract: Chain-of-thought (CoT) is a method that enables language models to handle complex reasoning tasks by decomposing them into simpler steps. Despite its success, the underlying mechanics of CoT are not yet fully understood. In an attempt to shed light on this, our study investigates the impact of CoT on the ability of transformers to in-context learn a simple to study, yet general family of compositi… ▽ More

    Submitted 7 November, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

    Comments: Accepted for NeurIPS 2023. Changes in this version: refined title, restructured content, included new out-of-distribution experiments, and code now available

  17. Prompted LLMs as Chatbot Modules for Long Open-domain Conversation

    Authors: Gibbeum Lee, Volker Hartmann, Jongho Park, Dimitris Papailiopoulos, Kangwook Lee

    Abstract: In this paper, we propose MPC (Modular Prompted Chatbot), a new approach for creating high-quality conversational agents without the need for fine-tuning. Our method utilizes pre-trained large language models (LLMs) as individual modules for long-term consistency and flexibility, by using techniques such as few-shot prompting, chain-of-thought (CoT), and external memory. Our human evaluation resul… ▽ More

    Submitted 8 May, 2023; originally announced May 2023.

    Comments: Accepted to the Findings of ACL2023. The camera-ready version with additional experimental results will be uploaded

  18. arXiv:2305.02538  [pdf, other

    cs.LG

    Cuttlefish: Low-Rank Model Training without All the Tuning

    Authors: Hongyi Wang, Saurabh Agarwal, Pongsakorn U-chupala, Yoshiki Tanaka, Eric P. Xing, Dimitris Papailiopoulos

    Abstract: Recent research has shown that training low-rank neural networks can effectively reduce the total number of trainable parameters without sacrificing predictive accuracy, resulting in end-to-end speedups. However, low-rank model training necessitates adjusting several additional factorization hyperparameters, such as the rank of the factorization at each layer. In this paper, we tackle this challen… ▽ More

    Submitted 5 May, 2023; v1 submitted 4 May, 2023; originally announced May 2023.

    Comments: Accepted for presentation at MLSys 2023

  19. arXiv:2302.07937  [pdf, other

    cs.LG cs.AI stat.ML

    The Expressive Power of Tuning Only the Normalization Layers

    Authors: Angeliki Giannou, Shashank Rajput, Dimitris Papailiopoulos

    Abstract: Feature normalization transforms such as Batch and Layer-Normalization have become indispensable ingredients of state-of-the-art deep neural networks. Recent studies on fine-tuning large pretrained models indicate that just tuning the parameters of these affine transforms can achieve high accuracy for downstream tasks. These findings open the questions about the expressive power of tuning the norm… ▽ More

    Submitted 4 July, 2023; v1 submitted 15 February, 2023; originally announced February 2023.

  20. arXiv:2301.13196  [pdf, other

    cs.LG cs.AI

    Looped Transformers as Programmable Computers

    Authors: Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D. Lee, Dimitris Papailiopoulos

    Abstract: We present a framework for using transformer networks as universal computers by programming them with specific weights and placing them in a loop. Our input sequence acts as a punchcard, consisting of instructions and memory for data read/writes. We demonstrate that a constant number of encoder layers can emulate basic computing blocks, including embedding edit operations, non-linear functions, fu… ▽ More

    Submitted 30 January, 2023; originally announced January 2023.

  21. arXiv:2301.07067  [pdf, other

    cs.LG cs.CL stat.ML

    Transformers as Algorithms: Generalization and Stability in In-context Learning

    Authors: Yingcong Li, M. Emrullah Ildiz, Dimitris Papailiopoulos, Samet Oymak

    Abstract: In-context learning (ICL) is a type of prompting where a transformer model operates on a sequence of (input, output) examples and performs inference on-the-fly. In this work, we formalize in-context learning as an algorithm learning problem where a transformer model implicitly constructs a hypothesis function at inference-time. We first explore the statistical aspects of this abstraction through t… ▽ More

    Submitted 6 February, 2023; v1 submitted 17 January, 2023; originally announced January 2023.

    Comments: Revised version significantly improves the stability guarantees and provides new experiments

  22. arXiv:2210.03069  [pdf, other

    cs.LG

    PathProx: A Proximal Gradient Algorithm for Weight Decay Regularized Deep Neural Networks

    Authors: Liu Yang, Jifan Zhang, Joseph Shenouda, Dimitris Papailiopoulos, Kangwook Lee, Robert D. Nowak

    Abstract: Weight decay is one of the most widely used forms of regularization in deep learning, and has been shown to improve generalization and robustness. The optimization objective driving weight decay is a sum of losses plus a term proportional to the sum of squared weights. This paper argues that stochastic gradient descent (SGD) may be an inefficient algorithm for this objective. For neural networks w… ▽ More

    Submitted 5 July, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

  23. arXiv:2206.06565  [pdf, other

    cs.LG cs.CL

    LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks

    Authors: Tuan Dinh, Yuchen Zeng, Ruisu Zhang, Ziqian Lin, Michael Gira, Shashank Rajput, Jy-yong Sohn, Dimitris Papailiopoulos, Kangwook Lee

    Abstract: Fine-tuning pretrained language models (LMs) without making any architectural changes has become a norm for learning various language downstream tasks. However, for non-language downstream tasks, a common practice is to employ task-specific designs for input, output layers, and loss functions. For instance, it is possible to fine-tune an LM into an MNIST classifier by replacing the word embedding… ▽ More

    Submitted 30 October, 2022; v1 submitted 13 June, 2022; originally announced June 2022.

    Comments: Accepted at NeurIPS 2022

  24. arXiv:2205.11616  [pdf, other

    cs.CL cs.LG

    Utilizing Language-Image Pretraining for Efficient and Robust Bilingual Word Alignment

    Authors: Tuan Dinh, Jy-yong Sohn, Shashank Rajput, Timothy Ossowski, Yifei Ming, Junjie Hu, Dimitris Papailiopoulos, Kangwook Lee

    Abstract: Word translation without parallel corpora has become feasible, rivaling the performance of supervised methods. Recent findings have shown that the accuracy and robustness of unsupervised word translation (UWT) can be improved by making use of visual observations, which are universal representations across languages. In this work, we investigate the potential of using not only visual observations b… ▽ More

    Submitted 7 November, 2022; v1 submitted 23 May, 2022; originally announced May 2022.

    Comments: In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP Findings)

  25. arXiv:2202.12002  [pdf, other

    cs.LG cs.AI cs.CV

    Rare Gems: Finding Lottery Tickets at Initialization

    Authors: Kartik Sreenivasan, Jy-yong Sohn, Liu Yang, Matthew Grinde, Alliot Nagle, Hongyi Wang, Eric Xing, Kangwook Lee, Dimitris Papailiopoulos

    Abstract: Large neural networks can be pruned to a small fraction of their original size, with little loss in accuracy, by following a time-consuming "train, prune, re-train" approach. Frankle & Carbin conjecture that we can avoid this by training "lottery tickets", i.e., special sparse subnetworks found at initialization, that can be trained to high accuracy. However, a subsequent line of work by Frankle e… ▽ More

    Submitted 2 June, 2022; v1 submitted 24 February, 2022; originally announced February 2022.

  26. arXiv:2201.02354  [pdf, other

    cs.LG

    GenLabel: Mixup Relabeling using Generative Models

    Authors: Jy-yong Sohn, Liang Shang, Hongxu Chen, Jaekyun Moon, Dimitris Papailiopoulos, Kangwook Lee

    Abstract: Mixup is a data augmentation method that generates new data points by mixing a pair of input data. While mixup generally improves the prediction performance, it sometimes degrades the performance. In this paper, we first identify the main causes of this phenomenon by theoretically and empirically analyzing the mixup algorithm. To resolve this, we propose GenLabel, a simple yet effective relabeling… ▽ More

    Submitted 7 January, 2022; originally announced January 2022.

  27. arXiv:2110.08996  [pdf, other

    cs.LG cs.AI

    Finding Everything within Random Binary Networks

    Authors: Kartik Sreenivasan, Shashank Rajput, Jy-yong Sohn, Dimitris Papailiopoulos

    Abstract: A recent work by Ramanujan et al. (2020) provides significant empirical evidence that sufficiently overparameterized, random neural networks contain untrained subnetworks that achieve state-of-the-art accuracy on several predictive tasks. A follow-up line of theoretical work provides justification of these findings by proving that slightly overparameterized neural networks, with commonly used cont… ▽ More

    Submitted 22 October, 2021; v1 submitted 17 October, 2021; originally announced October 2021.

  28. arXiv:2106.07724  [pdf, other

    cs.LG cs.IT stat.ML

    An Exponential Improvement on the Memorization Capacity of Deep Threshold Networks

    Authors: Shashank Rajput, Kartik Sreenivasan, Dimitris Papailiopoulos, Amin Karbasi

    Abstract: It is well known that modern deep neural networks are powerful enough to memorize datasets even when the labels have been randomized. Recently, Vershynin (2020) settled a long standing question by Baum (1988), proving that \emph{deep threshold} networks can memorize $n$ points in $d$ dimensions using $\widetilde{\mathcal{O}}(e^{1/δ^2}+\sqrt{n})$ neurons and… ▽ More

    Submitted 14 June, 2021; originally announced June 2021.

  29. arXiv:2103.03936  [pdf, other

    cs.LG

    Pufferfish: Communication-efficient Models At No Extra Cost

    Authors: Hongyi Wang, Saurabh Agarwal, Dimitris Papailiopoulos

    Abstract: To mitigate communication overheads in distributed model training, several studies propose the use of compressed stochastic gradients, usually achieved by sparsification or quantization. Such techniques achieve high compression ratios, but in many cases incur either significant computational overheads or some accuracy loss. In this work, we present Pufferfish, a communication and computation effic… ▽ More

    Submitted 5 March, 2021; originally announced March 2021.

    Comments: Accepted by MLSys 2021

  30. arXiv:2103.00543  [pdf, other

    cs.DC cs.LG

    On the Utility of Gradient Compression in Distributed Training Systems

    Authors: Saurabh Agarwal, Hongyi Wang, Shivaram Venkataraman, Dimitris Papailiopoulos

    Abstract: A rich body of prior work has highlighted the existence of communication bottlenecks in synchronous data-parallel training. To alleviate these bottlenecks, a long line of recent work proposes gradient and model compression methods. In this work, we evaluate the efficacy of gradient compression methods and compare their scalability with optimized implementations of synchronous data-parallel SGD acr… ▽ More

    Submitted 29 June, 2021; v1 submitted 28 February, 2021; originally announced March 2021.

  31. arXiv:2102.09718  [pdf, other

    cs.LG math.OC stat.ML

    Permutation-Based SGD: Is Random Optimal?

    Authors: Shashank Rajput, Kangwook Lee, Dimitris Papailiopoulos

    Abstract: A recent line of ground-breaking results for permutation-based SGD has corroborated a widely observed phenomenon: random permutations offer faster convergence than with-replacement sampling. However, is random optimal? We show that this depends heavily on what functions we are optimizing, and the convergence gap between optimal and random permutations can vary from exponential to nonexistent. We f… ▽ More

    Submitted 24 November, 2021; v1 submitted 18 February, 2021; originally announced February 2021.

  32. arXiv:2010.16248  [pdf, other

    cs.LG

    Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification

    Authors: Saurabh Agarwal, Hongyi Wang, Kangwook Lee, Shivaram Venkataraman, Dimitris Papailiopoulos

    Abstract: Distributed model training suffers from communication bottlenecks due to frequent model updates transmitted across compute nodes. To alleviate these bottlenecks, practitioners use gradient compression techniques like sparsification, quantization, or low-rank updates. The techniques usually require choosing a static compression ratio, often requiring users to balance the trade-off between model acc… ▽ More

    Submitted 29 October, 2020; originally announced October 2020.

  33. arXiv:2007.05084  [pdf, other

    cs.LG cs.CR cs.DC stat.ML

    Attack of the Tails: Yes, You Really Can Backdoor Federated Learning

    Authors: Hongyi Wang, Kartik Sreenivasan, Shashank Rajput, Harit Vishwakarma, Saurabh Agarwal, Jy-yong Sohn, Kangwook Lee, Dimitris Papailiopoulos

    Abstract: Due to its decentralized nature, Federated Learning (FL) lends itself to adversarial attacks in the form of backdoors during training. The goal of a backdoor is to corrupt the performance of the trained model on specific sub-tasks (e.g., by classifying green cars as frogs). A range of FL backdoor attacks have been introduced in the literature, but also methods to defend against them, and it is cur… ▽ More

    Submitted 9 July, 2020; originally announced July 2020.

  34. arXiv:2006.07990  [pdf, other

    cs.LG cs.IT stat.ML

    Optimal Lottery Tickets via SubsetSum: Logarithmic Over-Parameterization is Sufficient

    Authors: Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, Dimitris Papailiopoulos

    Abstract: The strong {\it lottery ticket hypothesis} (LTH) postulates that one can approximate any target neural network by only pruning the weights of a sufficiently over-parameterized random network. A recent work by Malach et al. \cite{MalachEtAl20} establishes the first theoretical analysis for the strong LTH: one can provably approximate a neural network of width $d$ and depth $l$, by pruning a random… ▽ More

    Submitted 11 March, 2021; v1 submitted 14 June, 2020; originally announced June 2020.

  35. arXiv:2002.10400  [pdf, other

    cs.LG math.OC stat.ML

    Closing the convergence gap of SGD without replacement

    Authors: Shashank Rajput, Anant Gupta, Dimitris Papailiopoulos

    Abstract: Stochastic gradient descent without replacement sampling is widely used in practice for model training. However, the vast majority of SGD analyses assumes data is sampled with replacement, and when the function minimized is strongly convex, an $\mathcal{O}\left(\frac{1}{T}\right)$ rate can be established when SGD is run for $T$ iterations. A recent line of breakthrough works on SGD without replace… ▽ More

    Submitted 9 July, 2020; v1 submitted 24 February, 2020; originally announced February 2020.

    Comments: Simplified some proofs and fixed typos

  36. arXiv:2002.06440  [pdf, other

    cs.LG stat.ML

    Federated Learning with Matched Averaging

    Authors: Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, Yasaman Khazaeni

    Abstract: Federated learning allows edge devices to collaboratively learn a shared model while keeping the training data on device, decoupling the ability to do model training from the need to store the data in the cloud. We propose Federated matched averaging (FedMA) algorithm designed for federated learning of modern neural network architectures e.g. convolutional neural networks (CNNs) and LSTMs. FedMA c… ▽ More

    Submitted 15 February, 2020; originally announced February 2020.

    Comments: Accepted by ICLR 2020

  37. arXiv:1907.12205  [pdf, other

    cs.LG cs.DC stat.ML

    DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation

    Authors: Shashank Rajput, Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos

    Abstract: To improve the resilience of distributed training to worst-case, or Byzantine node failures, several recent approaches have replaced gradient averaging with robust aggregation methods. Such techniques can have high computational costs, often quadratic in the number of compute nodes, and only have limited robustness guarantees. Other methods have instead used redundancy to guarantee robustness, but… ▽ More

    Submitted 7 March, 2020; v1 submitted 29 July, 2019; originally announced July 2019.

  38. arXiv:1906.02613  [pdf, other

    cs.LG stat.ML

    Bad Global Minima Exist and SGD Can Reach Them

    Authors: Shengchao Liu, Dimitris Papailiopoulos, Dimitris Achlioptas

    Abstract: Several works have aimed to explain why overparameterized neural networks generalize well when trained by Stochastic Gradient Descent (SGD). The consensus explanation that has emerged credits the randomized nature of SGD for the bias of the training process towards low-complexity models and, thus, for implicit regularization. We take a careful look at this explanation in the context of image class… ▽ More

    Submitted 22 February, 2021; v1 submitted 6 June, 2019; originally announced June 2019.

  39. arXiv:1905.09209  [pdf, other

    cs.LG math.OC stat.ML

    Convergence and Margin of Adversarial Training on Separable Data

    Authors: Zachary Charles, Shashank Rajput, Stephen Wright, Dimitris Papailiopoulos

    Abstract: Adversarial training is a technique for training robust machine learning models. To encourage robustness, it iteratively computes adversarial examples for the model, and then re-trains on these examples via some update rule. This work analyzes the performance of adversarial training on linearly separable data, and provides bounds on the number of iterations required for large margin. We show that… ▽ More

    Submitted 22 May, 2019; originally announced May 2019.

  40. arXiv:1905.03177  [pdf, other

    cs.LG stat.ML

    Does Data Augmentation Lead to Positive Margin?

    Authors: Shashank Rajput, Zhili Feng, Zachary Charles, Po-Ling Loh, Dimitris Papailiopoulos

    Abstract: Data augmentation (DA) is commonly used during model training, as it significantly improves test error and model robustness. DA artificially expands the training set by applying random noise, rotations, crops, or even adversarial perturbations to the input data. Although DA is widely used, its capacity to provably improve robustness is not fully understood. In this work, we analyze the robustness… ▽ More

    Submitted 8 May, 2019; originally announced May 2019.

    Comments: ICML 2019

  41. arXiv:1904.03257  [pdf, ps, other

    cs.LG cs.DB cs.DC cs.SE stat.ML

    MLSys: The New Frontier of Machine Learning Systems

    Authors: Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung, Bill Dally, Jeff Dean, Inderjit S. Dhillon, Alexandros Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood , et al. (44 additional authors not shown)

    Abstract: Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a ne… ▽ More

    Submitted 1 December, 2019; v1 submitted 29 March, 2019; originally announced April 2019.

  42. arXiv:1901.09671  [pdf, other

    cs.LG cs.DC cs.IT math.OC stat.ML

    ErasureHead: Distributed Gradient Descent without Delays Using Approximate Gradient Coding

    Authors: Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos

    Abstract: We present ErasureHead, a new approach for distributed gradient descent (GD) that mitigates system delays by employing approximate gradient coding. Gradient coded distributed GD uses redundancy to exactly recover the gradient at each iteration from a subset of compute nodes. ErasureHead instead uses approximate gradient codes to recover an inexact gradient at each iteration, but with higher delay… ▽ More

    Submitted 28 January, 2019; originally announced January 2019.

  43. arXiv:1811.03531  [pdf, other

    cs.LG stat.ML

    A Geometric Perspective on the Transferability of Adversarial Directions

    Authors: Zachary Charles, Harrison Rosenberg, Dimitris Papailiopoulos

    Abstract: State-of-the-art machine learning models frequently misclassify inputs that have been perturbed in an adversarial manner. Adversarial perturbations generated for a given input and a specific classifier often seem to be effective on other inputs and even different classifiers. In other words, adversarial perturbations seem to transfer between different inputs, models, and even different neural netw… ▽ More

    Submitted 8 November, 2018; originally announced November 2018.

  44. arXiv:1806.04090  [pdf, other

    stat.ML cs.DC cs.LG

    ATOMO: Communication-efficient Learning via Atomic Sparsification

    Authors: Hongyi Wang, Scott Sievert, Zachary Charles, Shengchao Liu, Stephen Wright, Dimitris Papailiopoulos

    Abstract: Distributed model training suffers from communication overheads due to frequent gradient updates transmitted between compute nodes. To mitigate these overheads, several studies propose the use of sparsified stochastic gradients. We argue that these are facets of a general sparsification method that can operate on any possible atomic decomposition. Notable examples include element-wise, singular va… ▽ More

    Submitted 8 November, 2018; v1 submitted 11 June, 2018; originally announced June 2018.

  45. arXiv:1806.03791  [pdf, other

    stat.ML cs.DC cs.LG math.OC stat.CO

    The Effect of Network Width on the Performance of Large-batch Training

    Authors: Lingjiao Chen, Hongyi Wang, Jinman Zhao, Dimitris Papailiopoulos, Paraschos Koutris

    Abstract: Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however, large batches can affect the convergence properties and generalization performance of SGD. In this work, we take a first step towards a… ▽ More

    Submitted 10 June, 2018; originally announced June 2018.

  46. arXiv:1805.10378  [pdf, other

    stat.ML cs.DC cs.IT cs.LG stat.CO

    Gradient Coding via the Stochastic Block Model

    Authors: Zachary Charles, Dimitris Papailiopoulos

    Abstract: Gradient descent and its many variants, including mini-batch stochastic gradient descent, form the algorithmic foundation of modern large-scale machine learning. Due to the size and scale of modern data, gradient computations are often distributed across multiple compute nodes. Unfortunately, such distributed implementations can face significant delays caused by straggler nodes, i.e., nodes that a… ▽ More

    Submitted 25 May, 2018; originally announced May 2018.

  47. arXiv:1803.09877  [pdf, other

    stat.ML cs.DC cs.IT cs.LG cs.NE

    DRACO: Byzantine-resilient Distributed Training via Redundant Gradients

    Authors: Lingjiao Chen, Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos

    Abstract: Distributed model training is vulnerable to byzantine system failures and adversarial compute nodes, i.e., nodes that use malicious updates to corrupt the global model stored at a parameter server (PS). To guarantee some form of robustness, recent work suggests using variants of the geometric median as an aggregation rule, in place of gradient averaging. Unfortunately, median-based rules can incur… ▽ More

    Submitted 21 June, 2018; v1 submitted 26 March, 2018; originally announced March 2018.

    Comments: Accepted by ICML 2018

  48. arXiv:1711.06771  [pdf, other

    stat.ML cs.DC cs.IT cs.LG stat.CO

    Approximate Gradient Coding via Sparse Random Graphs

    Authors: Zachary Charles, Dimitris Papailiopoulos, Jordan Ellenberg

    Abstract: Distributed algorithms are often beset by the straggler effect, where the slowest compute nodes in the system dictate the overall running time. Coding-theoretic techniques have been recently proposed to mitigate stragglers via algorithmic redundancy. Prior work in coded computation and gradient coding has mainly focused on exact recovery of the desired output. However, slightly inexact solutions c… ▽ More

    Submitted 17 November, 2017; originally announced November 2017.

  49. arXiv:1710.08402  [pdf, other

    stat.ML cs.IT cs.LG math.OC

    Stability and Generalization of Learning Algorithms that Converge to Global Optima

    Authors: Zachary Charles, Dimitris Papailiopoulos

    Abstract: We establish novel generalization bounds for learning algorithms that converge to global minima. We do so by deriving black-box stability results that only depend on the convergence of a learning algorithm and the geometry around the minimizers of the loss function. The results are shown for nonconvex loss functions satisfying the Polyak-Łojasiewicz (PL) and the quadratic growth (QG) conditions. W… ▽ More

    Submitted 23 October, 2017; originally announced October 2017.

    Comments: 27 pages, 5 figures

  50. arXiv:1706.05699  [pdf, other

    cs.LG cs.DC

    Gradient Diversity: a Key Ingredient for Scalable Distributed Learning

    Authors: Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran, Peter Bartlett

    Abstract: It has been experimentally observed that distributed implementations of mini-batch stochastic gradient descent (SGD) algorithms exhibit speedup saturation and decaying generalization ability beyond a particular batch-size. In this work, we present an analysis hinting that high similarity between concurrently processed gradients may be a cause of this performance degradation. We introduce the notio… ▽ More

    Submitted 6 January, 2018; v1 submitted 18 June, 2017; originally announced June 2017.