Skip to main content

Showing 1–50 of 171 results for author: Kakade, S

.
  1. arXiv:2506.15535  [pdf, ps, other

    cs.LG math.OC stat.ML

    A Simplified Analysis of SGD for Linear Regression with Weight Averaging

    Authors: Alexandru Meterez, Depen Morwani, Costin-Andrei Oncescu, Jingfeng Wu, Cengiz Pehlevan, Sham Kakade

    Abstract: Theoretically understanding stochastic gradient descent (SGD) in overparameterized models has led to the development of several optimization algorithms that are widely used in practice today. Recent work by~\citet{zou2021benign} provides sharp rates for SGD optimization in linear regression using constant learning rate, both with and without tail iterate averaging, based on a bias-variance decompo… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  2. arXiv:2506.10378  [pdf, ps, other

    cs.LG cs.AI cs.CL stat.ML

    Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning

    Authors: Jikai Jin, Vasilis Syrgkanis, Sham Kakade, Hanlin Zhang

    Abstract: Faithful evaluation of language model capabilities is crucial for deriving actionable insights that can inform model development. However, rigorous causal evaluations in this domain face significant methodological challenges, including complex confounding effects and prohibitive computational costs associated with extensive retraining. To tackle these challenges, we propose a causal representation… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  3. arXiv:2504.11695  [pdf, ps, other

    cs.CV cs.MM

    Interpreting the linear structure of vision-language model embedding spaces

    Authors: Isabel Papadimitriou, Huangyuan Su, Thomas Fel, Sham Kakade, Stephanie Gil

    Abstract: Vision-language models encode images and text in a joint space, minimizing the distance between corresponding image and text pairs. How are language and images organized in this joint space, and how do the models encode meaning and modality? To investigate this, we train and release sparse autoencoders (SAEs) on the embedding spaces of four vision-language models (CLIP, SigLIP, SigLIP2, and AIMv2)… ▽ More

    Submitted 28 May, 2025; v1 submitted 15 April, 2025; originally announced April 2025.

  4. arXiv:2504.07912  [pdf, other

    cs.LG

    Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining

    Authors: Rosie Zhao, Alexandru Meterez, Sham Kakade, Cengiz Pehlevan, Samy Jelassi, Eran Malach

    Abstract: Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models for advanced mathematical reasoning and coding. Following the success of frontier reasoning models, recent work has demonstrated that RL fine-tuning consistently improves performance, even in smaller-scale models; however, the underlying mechanisms driving these improvements are not well-unders… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

    ACM Class: I.2.7

  5. arXiv:2502.18822  [pdf, other

    cs.AI cs.MA

    Data-Efficient Multi-Agent Spatial Planning with LLMs

    Authors: Huangyuan Su, Aaron Walsman, Daniel Garces, Sham Kakade, Stephanie Gil

    Abstract: In this project, our goal is to determine how to leverage the world-knowledge of pretrained large language models for efficient and robust learning in multiagent decision making. We examine this in a taxi routing and assignment problem where agents must decide how to best pick up passengers in order to minimize overall waiting time. While this problem is situated on a graphical road network, we sh… ▽ More

    Submitted 25 February, 2025; originally announced February 2025.

  6. arXiv:2502.17356  [pdf, other

    cs.LG

    Distributional Scaling for Emergent Capabilities

    Authors: Rosie Zhao, Tian Qin, David Alvarez-Melis, Sham Kakade, Naomi Saphra

    Abstract: This paper explores the nature of sudden breakthroughs in language model performance at scale, which stand in contrast to smooth improvements governed by scaling laws. While advocates of "emergence" view breakthroughs as unlocked capabilities, others attribute them to thresholding effects on noncontinuous metrics. We propose that breakthroughs are instead driven by continuous changes in the probab… ▽ More

    Submitted 27 May, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

    Comments: 18 pages

    ACM Class: I.2.7

  7. arXiv:2502.16792  [pdf, other

    cs.LG cs.AI cs.CL

    The Role of Sparsity for Length Generalization in Transformers

    Authors: Noah Golowich, Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach

    Abstract: Training large language models to predict beyond their training context lengths has drawn much attention in recent years, yet the principles driving such behavior of length generalization remain underexplored. We propose a new theoretical framework to study length generalization for the next-token prediction task, as performed by decoder-only transformers. Conceptually, we show that length general… ▽ More

    Submitted 23 February, 2025; originally announced February 2025.

  8. arXiv:2502.06768  [pdf, other

    cs.LG

    Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions

    Authors: Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham Kakade, Sitan Chen

    Abstract: In recent years, masked diffusion models (MDMs) have emerged as a promising alternative approach for generative modeling over discrete domains. Compared to autoregressive models (ARMs), MDMs trade off complexity at training time with flexibility at inference time. At training time, they must learn to solve an exponentially large number of infilling problems, but at inference time, they can decode… ▽ More

    Submitted 5 March, 2025; v1 submitted 10 February, 2025; originally announced February 2025.

  9. arXiv:2502.02431  [pdf, other

    cs.LG cs.AI

    Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants

    Authors: Depen Morwani, Nikhil Vyas, Hanlin Zhang, Sham Kakade

    Abstract: Recent advancements in deep learning optimization have introduced new algorithms, such as Schedule-Free optimizers, AdEMAMix, MARS and Lion which modify traditional momentum mechanisms. In a separate line of work, theoretical acceleration of stochastic gradient descent (SGD) in noise-dominated regime has been achieved by decoupling the momentum coefficient from the current gradient's weight. In th… ▽ More

    Submitted 4 February, 2025; originally announced February 2025.

  10. arXiv:2501.05559  [pdf, other

    cs.LG cs.AI

    Soup to go: mitigating forgetting during continual learning with model averaging

    Authors: Anat Kleiman, Gintare Karolina Dziugaite, Jonathan Frankle, Sham Kakade, Mansheej Paul

    Abstract: In continual learning, where task data arrives in a sequence, fine-tuning on later tasks will often lead to performance degradation on earlier tasks. This is especially pronounced when these tasks come from diverse domains. In this setting, how can we mitigate catastrophic forgetting of earlier tasks and retain what the model has learned with minimal computational expenses? Inspired by other mergi… ▽ More

    Submitted 9 January, 2025; originally announced January 2025.

  11. arXiv:2412.07770  [pdf, other

    cs.CV cs.LG

    From an Image to a Scene: Learning to Imagine the World from a Million 360 Videos

    Authors: Matthew Wallingford, Anand Bhattad, Aditya Kusupati, Vivek Ramanujan, Matt Deitke, Sham Kakade, Aniruddha Kembhavi, Roozbeh Mottaghi, Wei-Chiu Ma, Ali Farhadi

    Abstract: Three-dimensional (3D) understanding of objects and scenes play a key role in humans' ability to interact with the world and has been an active area of research in computer vision, graphics, and robotics. Large scale synthetic and object-centric 3D datasets have shown to be effective in training models that have 3D understanding of objects. However, applying a similar approach to real-world object… ▽ More

    Submitted 10 December, 2024; originally announced December 2024.

    Comments: NeurIPS 2024. For project page, see https://mattwallingford.github.io/ODIN

  12. arXiv:2412.02674  [pdf, other

    cs.CL cs.LG

    Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models

    Authors: Yuda Song, Hanlin Zhang, Carson Eisenach, Sham Kakade, Dean Foster, Udaya Ghai

    Abstract: Self-improvement is a mechanism in Large Language Model (LLM) pre-training, post-training and test-time inference. We explore a framework where the model verifies its own outputs, filters or reweights data based on this verification, and distills the filtered data. Despite several empirical successes, a fundamental understanding is still lacking. In this work, we initiate a comprehensive, modular… ▽ More

    Submitted 25 February, 2025; v1 submitted 3 December, 2024; originally announced December 2024.

    Comments: ICLR 2025; 41 pages, 19 figures

  13. arXiv:2411.12925  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    Loss-to-Loss Prediction: Scaling Laws for All Datasets

    Authors: David Brandfonbrener, Nikhil Anand, Nikhil Vyas, Eran Malach, Sham Kakade

    Abstract: While scaling laws provide a reliable methodology for predicting train loss across compute scales for a single data distribution, less is known about how these predictions should change as we change the distribution. In this paper, we derive a strategy for predicting one loss from another and apply it to predict across different pre-training datasets and from pre-training data to downstream task d… ▽ More

    Submitted 19 November, 2024; originally announced November 2024.

  14. arXiv:2410.21676  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    How Does Critical Batch Size Scale in Pre-training?

    Authors: Hanlin Zhang, Depen Morwani, Nikhil Vyas, Jingfeng Wu, Difan Zou, Udaya Ghai, Dean Foster, Sham Kakade

    Abstract: Training large-scale models under given resources requires careful design of parallelism strategies. In particular, the efficiency notion of critical batch size (CBS), concerning the compromise between time and compute, marks the threshold beyond which greater data parallelism leads to diminishing returns. To operationalize it, we propose a measure of CBS and pre-train a series of auto-regressive… ▽ More

    Submitted 21 April, 2025; v1 submitted 28 October, 2024; originally announced October 2024.

    Comments: ICLR 2025, Blog post: https://kempnerinstitute.harvard.edu/research/deeper-learning/how-does-critical-batch-size-scale-in-pre-training-decoupling-data-and-model-size

  15. arXiv:2410.19034  [pdf, other

    cs.LG

    Mixture of Parrots: Experts improve memorization more than reasoning

    Authors: Samy Jelassi, Clara Mohri, David Brandfonbrener, Alex Gu, Nikhil Vyas, Nikhil Anand, David Alvarez-Melis, Yuanzhi Li, Sham M. Kakade, Eran Malach

    Abstract: The Mixture-of-Experts (MoE) architecture enables a significant increase in the total number of model parameters with minimal computational overhead. However, it is not clear what performance tradeoffs, if any, exist between MoEs and standard dense transformers. In this paper, we show that as we increase the number of experts (while fixing the number of active parameters), the memorization perform… ▽ More

    Submitted 28 February, 2025; v1 submitted 24 October, 2024; originally announced October 2024.

  16. arXiv:2410.13025  [pdf, other

    cs.CL cs.LG

    LoRA Soups: Merging LoRAs for Practical Skill Composition Tasks

    Authors: Akshara Prabhakar, Yuanzhi Li, Karthik Narasimhan, Sham Kakade, Eran Malach, Samy Jelassi

    Abstract: Low-Rank Adaptation (LoRA) is a popular technique for parameter-efficient fine-tuning of Large Language Models (LLMs). We study how different LoRA modules can be merged to achieve skill composition -- testing the performance of the merged model on a target task that involves combining multiple skills, each skill coming from a single LoRA. This setup is favorable when it is difficult to obtain trai… ▽ More

    Submitted 2 December, 2024; v1 submitted 16 October, 2024; originally announced October 2024.

    Comments: COLING 2025 Industry track; 9 pages plus references and appendices

  17. arXiv:2410.12982  [pdf, other

    cs.LG cs.AI

    Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond

    Authors: Costin-Andrei Oncescu, Sanket Purandare, Stratos Idreos, Sham Kakade

    Abstract: While transformers have been at the core of most recent advancements in sequence generative models, their computational cost remains quadratic in sequence length. Several subquadratic architectures have been proposed to address this computational issue. Some of them, including long convolution sequence models (LCSMs), such as Hyena, address this issue at training time but remain quadratic during i… ▽ More

    Submitted 16 October, 2024; originally announced October 2024.

    Comments: 15 pages, 9 figures, 5 algorithms

  18. arXiv:2410.02817  [pdf, other

    eess.SY cs.LG stat.ML

    Neural Coordination and Capacity Control for Inventory Management

    Authors: Carson Eisenach, Udaya Ghai, Dhruv Madeka, Kari Torkkola, Dean Foster, Sham Kakade

    Abstract: This paper addresses the capacitated periodic review inventory control problem, focusing on a retailer managing multiple products with limited shared resources, such as storage or inbound labor at a facility. Specifically, this paper is motivated by the questions of (1) what does it mean to backtest a capacity control mechanism, (2) can we devise and backtest a capacity control mechanism that is c… ▽ More

    Submitted 24 September, 2024; originally announced October 2024.

  19. arXiv:2409.11321  [pdf, other

    cs.LG cs.AI

    SOAP: Improving and Stabilizing Shampoo using Adam

    Authors: Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, Sham Kakade

    Abstract: There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks include additional hyperparameters and computational overhead when compared to Adam, which only updates running averages of first- and second-moment quantities. This work establishes a formal connection between Shampoo (implem… ▽ More

    Submitted 31 January, 2025; v1 submitted 17 September, 2024; originally announced September 2024.

  20. arXiv:2409.00717  [pdf, other

    cs.LG cs.AI cs.GT cs.MA

    Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and Algorithmic Techniques

    Authors: Natalia Zhang, Xinqi Wang, Qiwen Cui, Runlong Zhou, Sham M. Kakade, Simon S. Du

    Abstract: We initiate the study of Preference-Based Multi-Agent Reinforcement Learning (PbMARL), exploring both theoretical foundations and empirical validations. We define the task as identifying the Nash equilibrium from a preference-only offline dataset in general-sum games, a problem marked by the challenge of sparse feedback signals. Our theory establishes the upper complexity bounds for Nash Equilibri… ▽ More

    Submitted 9 January, 2025; v1 submitted 1 September, 2024; originally announced September 2024.

    Comments: 9 pages

  21. arXiv:2407.07972  [pdf, other

    cs.LG cs.AI

    Deconstructing What Makes a Good Optimizer for Language Models

    Authors: Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas, Sham Kakade

    Abstract: Training language models becomes increasingly expensive with scale, prompting numerous attempts to improve optimization efficiency. Despite these efforts, the Adam optimizer remains the most widely used, due to a prevailing view that it is the most effective approach. We aim to compare several optimization algorithms, including SGD, Adafactor, Adam, Lion, and Sophia in the context of autoregressiv… ▽ More

    Submitted 27 February, 2025; v1 submitted 10 July, 2024; originally announced July 2024.

    Comments: 21 pages, ICLR 2025

  22. arXiv:2407.03310  [pdf, other

    cs.LG

    Universal Length Generalization with Turing Programs

    Authors: Kaiying Hou, David Brandfonbrener, Sham Kakade, Samy Jelassi, Eran Malach

    Abstract: Length generalization refers to the ability to extrapolate from short training sequences to long test sequences and is a challenge for current large language models. While prior work has proposed some architecture or data format changes to achieve length generalization, these proposals typically apply to a limited set of tasks. Building on prior scratchpad and Chain-of-Thought (CoT) techniques, we… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  23. arXiv:2407.01100  [pdf, other

    cs.CL cs.LG

    Eliminating Position Bias of Language Models: A Mechanistic Approach

    Authors: Ziqi Wang, Hanlin Zhang, Xiner Li, Kuan-Hao Huang, Chi Han, Shuiwang Ji, Sham M. Kakade, Hao Peng, Heng Ji

    Abstract: Position bias has proven to be a prevalent issue of modern language models (LMs), where the models prioritize content based on its position within the given context. This bias often leads to unexpected model failures and hurts performance, robustness, and reliability across various applications. Our mechanistic analysis attributes the position bias to two components employed in nearly all state-of… ▽ More

    Submitted 31 March, 2025; v1 submitted 1 July, 2024; originally announced July 2024.

    Comments: 26 pages, 6 figures, 15 tables

  24. arXiv:2406.17748  [pdf, other

    cs.LG math.OC stat.ML

    A New Perspective on Shampoo's Preconditioner

    Authors: Depen Morwani, Itai Shapira, Nikhil Vyas, Eran Malach, Sham Kakade, Lucas Janson

    Abstract: Shampoo, a second-order optimization algorithm which uses a Kronecker product preconditioner, has recently garnered increasing attention from the machine learning community. The preconditioner used by Shampoo can be viewed either as an approximation of the Gauss--Newton component of the Hessian or the covariance matrix of the gradients maintained by Adagrad. We provide an explicit and novel connec… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  25. arXiv:2406.11794  [pdf, other

    cs.LG cs.CL

    DataComp-LM: In search of the next generation of training sets for language models

    Authors: Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardner , et al. (34 additional authors not shown)

    Abstract: We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with dat… ▽ More

    Submitted 21 April, 2025; v1 submitted 17 June, 2024; originally announced June 2024.

    Comments: Project page: https://www.datacomp.ai/dclm/

  26. arXiv:2406.11741  [pdf, other

    cs.LG cs.AI

    Transcendence: Generative Models Can Outperform The Experts That Train Them

    Authors: Edwin Zhang, Vincent Zhu, Naomi Saphra, Anat Kleiman, Benjamin L. Edelman, Milind Tambe, Sham M. Kakade, Eran Malach

    Abstract: Generative models are trained with the simple objective of imitating the conditional probability distribution induced by the data they are trained on. Therefore, when trained on data generated by humans, we may not expect the artificial model to outperform the humans on their original objectives. In this work, we study the phenomenon of transcendence: when a generative model achieves capabilities… ▽ More

    Submitted 12 October, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

    Comments: Code, models, and data at https://transcendence.eddie.win

  27. arXiv:2406.10670  [pdf, other

    cs.LG cs.AI cs.CL

    CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training

    Authors: David Brandfonbrener, Hanlin Zhang, Andreas Kirsch, Jonathan Richard Schwarz, Sham Kakade

    Abstract: Selecting high-quality data for pre-training is crucial in shaping the downstream task performance of language models. A major challenge lies in identifying this optimal subset, a problem generally considered intractable, thus necessitating scalable and effective heuristics. In this work, we propose a data selection method, CoLoR-Filter (Conditional Loss Reduction Filtering), which leverages an em… ▽ More

    Submitted 29 October, 2024; v1 submitted 15 June, 2024; originally announced June 2024.

  28. arXiv:2406.08466  [pdf, ps, other

    cs.LG cs.AI math.ST stat.ML

    Scaling Laws in Linear Regression: Compute, Parameters, and Data

    Authors: Licong Lin, Jingfeng Wu, Sham M. Kakade, Peter L. Bartlett, Jason D. Lee

    Abstract: Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow. However, conventional wisdom suggests the test error consists of approximation, bias, and variance errors, where the variance error increases with model size. This disagrees with the general form of neural scaling laws, wh… ▽ More

    Submitted 10 June, 2025; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: fixed typos

  29. arXiv:2405.18400  [pdf, other

    cs.CL cs.LG

    Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass

    Authors: Ethan Shen, Alan Fan, Sarah M. Pratt, Jae Sung Park, Matthew Wallingford, Sham M. Kakade, Ari Holtzman, Ranjay Krishna, Ali Farhadi, Aditya Kusupati

    Abstract: Many applications today provide users with multiple auto-complete drafts as they type, including GitHub's code completion, Gmail's smart compose, and Apple's messaging auto-suggestions. Under the hood, language models support this by running an autoregressive inference pass to provide a draft. Consequently, providing $k$ drafts to the user requires running an expensive language model $k$ times. To… ▽ More

    Submitted 30 October, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: 23 pages, 16 figures, accepted at NeurIPS 2024

  30. arXiv:2404.12376  [pdf, other

    cs.LG math.OC stat.ML

    Matching the Statistical Query Lower Bound for $k$-Sparse Parity Problems with Sign Stochastic Gradient Descent

    Authors: Yiwen Kou, Zixiang Chen, Quanquan Gu, Sham M. Kakade

    Abstract: The $k$-sparse parity problem is a classical problem in computational complexity and algorithmic theory, serving as a key benchmark for understanding computational classes. In this paper, we solve the $k$-sparse parity problem with sign stochastic gradient descent, a variant of stochastic gradient descent (SGD) on two-layer fully-connected neural networks. We demonstrate that this approach can eff… ▽ More

    Submitted 5 December, 2024; v1 submitted 18 April, 2024; originally announced April 2024.

    Comments: 37 pages, 7 figures, 3 tables. In NeurIPS 2024

  31. arXiv:2402.17840  [pdf, other

    cs.CL cs.AI cs.CR cs.LG

    Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems

    Authors: Zhenting Qi, Hanlin Zhang, Eric Xing, Sham Kakade, Himabindu Lakkaraju

    Abstract: Retrieval-Augmented Generation (RAG) improves pre-trained models by incorporating external knowledge at test time to enable customized adaptation. We study the risk of datastore leakage in Retrieval-In-Context RAG Language Models (LMs). We show that an adversary can exploit LMs' instruction-following capabilities to easily extract text data verbatim from the datastore of RAG systems built with ins… ▽ More

    Submitted 6 October, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

  32. arXiv:2402.14688  [pdf, other

    cs.LG

    Q-Probe: A Lightweight Approach to Reward Maximization for Language Models

    Authors: Kenneth Li, Samy Jelassi, Hugh Zhang, Sham Kakade, Martin Wattenberg, David Brandfonbrener

    Abstract: We present an approach called Q-probing to adapt a pre-trained language model to maximize a task-specific reward function. At a high level, Q-probing sits between heavier approaches such as finetuning and lighter approaches such as few shot prompting, but can also be combined with either. The idea is to learn a simple linear function on a model's embedding space that can be used to reweight candid… ▽ More

    Submitted 2 June, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

  33. arXiv:2402.01032  [pdf, other

    cs.LG cs.AI cs.CL

    Repeat After Me: Transformers are Better than State Space Models at Copying

    Authors: Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach

    Abstract: Transformers are the dominant architecture for sequence modeling, but there is growing interest in models that use a fixed-size latent state that does not depend on the sequence length, which we refer to as "generalized state space models" (GSSMs). In this paper we show that while GSSMs are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks th… ▽ More

    Submitted 3 June, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  34. arXiv:2312.04021  [pdf, other

    cs.CL cs.AI cs.LG

    A Study on the Calibration of In-context Learning

    Authors: Hanlin Zhang, Yi-Fan Zhang, Yaodong Yu, Dhruv Madeka, Dean Foster, Eric Xing, Himabindu Lakkaraju, Sham Kakade

    Abstract: Accurate uncertainty quantification is crucial for the safe deployment of machine learning models, and prior research has demonstrated improvements in the calibration of modern language models (LMs). We study in-context learning (ICL), a prevalent method for adapting static LMs through tailored prompts, and examine the balance between performance and calibration across a broad spectrum of natural… ▽ More

    Submitted 27 March, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

    Comments: NAACL 2024

  35. arXiv:2311.07568  [pdf, other

    cs.LG

    Feature emergence via margin maximization: case studies in algebraic tasks

    Authors: Depen Morwani, Benjamin L. Edelman, Costin-Andrei Oncescu, Rosie Zhao, Sham Kakade

    Abstract: Understanding the internal representations learned by neural networks is a cornerstone challenge in the science of machine learning. While there have been significant recent strides in some cases towards understanding how neural networks implement specific target functions, this paper explores a complementary question -- why do networks arrive at particular computational strategies? Our inquiry fo… ▽ More

    Submitted 19 February, 2024; v1 submitted 13 November, 2023; originally announced November 2023.

    Comments: Accepted as Spotlight at ICLR 2024

    ACM Class: I.5.1; I.2.6

  36. arXiv:2310.17168  [pdf, other

    cs.LG stat.ML

    Learning an Inventory Control Policy with General Inventory Arrival Dynamics

    Authors: Sohrab Andaz, Carson Eisenach, Dhruv Madeka, Kari Torkkola, Randy Jia, Dean Foster, Sham Kakade

    Abstract: In this paper we address the problem of learning and backtesting inventory control policies in the presence of general arrival dynamics -- which we term as a quantity-over-time arrivals model (QOT). We also allow for order quantities to be modified as a post-processing step to meet vendor constraints such as order minimum and batch size constraints -- a common practice in real supply chains. To th… ▽ More

    Submitted 21 January, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

  37. arXiv:2310.07707  [pdf, other

    cs.LG cs.CL cs.CV

    MatFormer: Nested Transformer for Elastic Inference

    Authors: Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain

    Abstract: Foundation models are applied in a broad spectrum of settings with different inference constraints, from massive multi-accelerator clusters to resource-constrained standalone mobile devices. However, the substantial costs associated with training these models often limit the number of unique model sizes that can be offered. Consequently, practitioners are compelled to select a model that may not b… ▽ More

    Submitted 14 December, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

    Comments: 30 pages, 11 figures, first three authors contributed equally. NeurIPS, 2024

  38. arXiv:2309.03800  [pdf, other

    cs.LG cs.AI stat.ML

    Pareto Frontiers in Neural Feature Learning: Data, Compute, Width, and Luck

    Authors: Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran Malach, Cyril Zhang

    Abstract: In modern deep learning, algorithmic choices (such as width, depth, and learning rate) are known to modulate nuanced resource tradeoffs. This work investigates how these complexities necessarily arise for feature learning in the presence of computational-statistical gaps. We begin by considering offline sparse parity learning, a supervised classification problem which admits a statistical query lo… ▽ More

    Submitted 30 October, 2023; v1 submitted 7 September, 2023; originally announced September 2023.

    Comments: v2: NeurIPS 2023 camera-ready updates

  39. arXiv:2307.09423  [pdf, other

    cs.LG cs.AI stat.ML

    Scaling Laws for Imitation Learning in Single-Agent Games

    Authors: Jens Tuyls, Dhruv Madeka, Kari Torkkola, Dean Foster, Karthik Narasimhan, Sham Kakade

    Abstract: Imitation Learning (IL) is one of the most widely used methods in machine learning. Yet, many works find it is often unable to fully recover the underlying expert behavior, even in constrained environments like single-agent games. However, none of these works deeply investigate the role of scaling up the model and data size. Inspired by recent work in Natural Language Processing (NLP) where "scali… ▽ More

    Submitted 19 December, 2024; v1 submitted 18 July, 2023; originally announced July 2023.

    Comments: Accepted at TMLR 2024

  40. arXiv:2306.08590  [pdf, other

    cs.LG stat.ML

    Beyond Implicit Bias: The Insignificance of SGD Noise in Online Learning

    Authors: Nikhil Vyas, Depen Morwani, Rosie Zhao, Gal Kaplun, Sham Kakade, Boaz Barak

    Abstract: The success of SGD in deep learning has been ascribed by prior works to the implicit bias induced by finite batch sizes ("SGD noise"). While prior works focused on offline learning (i.e., multiple-epoch training), we study the impact of SGD noise on online (i.e., single epoch) learning. Through an extensive empirical analysis of image and language data, we demonstrate that small batch sizes do not… ▽ More

    Submitted 7 June, 2024; v1 submitted 14 June, 2023; originally announced June 2023.

  41. arXiv:2305.19435  [pdf, other

    cs.LG cs.IR

    AdANNS: A Framework for Adaptive Semantic Search

    Authors: Aniket Rege, Aditya Kusupati, Sharan Ranjit S, Alan Fan, Qingqing Cao, Sham Kakade, Prateek Jain, Ali Farhadi

    Abstract: Web-scale search systems learn an encoder to embed a given query which is then hooked into an approximate nearest neighbor search (ANNS) pipeline to retrieve similar data points. To accurately capture tail queries and data points, learned representations typically are rigid, high-dimensional vectors that are generally used as-is in the entire ANNS pipeline and can lead to computationally expensive… ▽ More

    Submitted 18 October, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

    Comments: 25 pages, 15 figures. NeurIPS 2023 camera ready publication

  42. arXiv:2305.10634  [pdf, other

    math.OC cs.LG

    Modified Gauss-Newton Algorithms under Noise

    Authors: Krishna Pillutla, Vincent Roulet, Sham Kakade, Zaid Harchaoui

    Abstract: Gauss-Newton methods and their stochastic version have been widely used in machine learning and signal processing. Their nonsmooth counterparts, modified Gauss-Newton or prox-linear algorithms, can lead to contrasting outcomes when compared to gradient descent in large-scale statistical settings. We explore the contrasting performance of these two classes of algorithms in theory on a stylized stat… ▽ More

    Submitted 17 May, 2023; originally announced May 2023.

    Comments: IEEE SSP 2023

  43. arXiv:2303.12287  [pdf, ps, other

    cs.LG cs.AI cs.GT stat.ML

    Hardness of Independent Learning and Sparse Equilibrium Computation in Markov Games

    Authors: Dylan J. Foster, Noah Golowich, Sham M. Kakade

    Abstract: We consider the problem of decentralized multi-agent reinforcement learning in Markov games. A fundamental question is whether there exist algorithms that, when adopted by all agents and run independently in a decentralized fashion, lead to no-regret for each player, analogous to celebrated convergence results in normal-form games. While recent work has shown that such algorithms exist for restric… ▽ More

    Submitted 21 March, 2023; originally announced March 2023.

    Comments: 51 pages

  44. arXiv:2303.02255  [pdf, other

    cs.LG math.OC stat.ML

    Finite-Sample Analysis of Learning High-Dimensional Single ReLU Neuron

    Authors: Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

    Abstract: This paper considers the problem of learning a single ReLU neuron with squared loss (a.k.a., ReLU regression) in the overparameterized regime, where the input dimension can exceed the number of samples. We analyze a Perceptron-type algorithm called GLM-tron (Kakade et al., 2011) and provide its dimension-free risk upper bounds for high-dimensional ReLU regression in both well-specified and misspec… ▽ More

    Submitted 26 June, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

    Comments: ICML 2023 camera ready

  45. arXiv:2302.14753  [pdf, other

    cs.LG cs.AI stat.ML

    Learning Hidden Markov Models Using Conditional Samples

    Authors: Sham M. Kakade, Akshay Krishnamurthy, Gaurav Mahajan, Cyril Zhang

    Abstract: This paper is concerned with the computational complexity of learning the Hidden Markov Model (HMM). Although HMMs are some of the most widely used tools in sequential and time series modeling, they are cryptographically hard to learn in the standard setting where one has access to i.i.d. samples of observation sequences. In this paper, we depart from this setup and consider an interactive access… ▽ More

    Submitted 24 February, 2024; v1 submitted 28 February, 2023; originally announced February 2023.

  46. arXiv:2302.10870  [pdf, other

    cs.LG stat.ML

    On Provable Copyright Protection for Generative Models

    Authors: Nikhil Vyas, Sham Kakade, Boaz Barak

    Abstract: There is a growing concern that learned conditional generative models may output samples that are substantially similar to some copyrighted data $C$ that was in their training set. We give a formal definition of $\textit{near access-freeness (NAF)}$ and prove bounds on the probability that a model satisfying this definition outputs a sample similar to $C$, even if $C$ is included in its training s… ▽ More

    Submitted 21 July, 2023; v1 submitted 21 February, 2023; originally announced February 2023.

    Comments: Accepted at ICML 2023

  47. arXiv:2210.09579  [pdf, other

    cs.LG cs.AI

    Unpacking Reward Shaping: Understanding the Benefits of Reward Engineering on Sample Complexity

    Authors: Abhishek Gupta, Aldo Pacchiano, Yuexiang Zhai, Sham M. Kakade, Sergey Levine

    Abstract: Reinforcement learning provides an automated framework for learning behaviors from high-level reward specifications, but in practice the choice of reward function can be crucial for good results -- while in principle the reward only needs to specify what the task is, in reality practitioners often need to design more detailed rewards that provide the agent with some hints about how the task should… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

  48. arXiv:2210.04157  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    The Role of Coverage in Online Reinforcement Learning

    Authors: Tengyang Xie, Dylan J. Foster, Yu Bai, Nan Jiang, Sham M. Kakade

    Abstract: Coverage conditions -- which assert that the data logging distribution adequately covers the state space -- play a fundamental role in determining the sample complexity of offline reinforcement learning. While such conditions might seem irrelevant to online reinforcement learning at first glance, we establish a new connection by showing -- somewhat surprisingly -- that the mere existence of a data… ▽ More

    Submitted 8 October, 2022; originally announced October 2022.

  49. arXiv:2210.03137  [pdf, other

    cs.LG math.OC

    Deep Inventory Management

    Authors: Dhruv Madeka, Kari Torkkola, Carson Eisenach, Anna Luo, Dean P. Foster, Sham M. Kakade

    Abstract: This work provides a Deep Reinforcement Learning approach to solving a periodic review inventory control system with stochastic vendor lead times, lost sales, correlated demand, and price matching. While this dynamic program has historically been considered intractable, our results show that several policy learning approaches are competitive with or outperform classical methods. In order to train… ▽ More

    Submitted 28 November, 2022; v1 submitted 6 October, 2022; originally announced October 2022.

  50. arXiv:2209.00735  [pdf, other

    cs.LG stat.ML

    Recurrent Convolutional Neural Networks Learn Succinct Learning Algorithms

    Authors: Surbhi Goel, Sham Kakade, Adam Tauman Kalai, Cyril Zhang

    Abstract: Neural networks (NNs) struggle to efficiently solve certain problems, such as learning parities, even when there are simple learning algorithms for those problems. Can NNs discover learning algorithms on their own? We exhibit a NN architecture that, in polynomial time, learns as well as any efficient learning algorithm describable by a constant-sized program. For example, on parity problems, the N… ▽ More

    Submitted 15 January, 2023; v1 submitted 1 September, 2022; originally announced September 2022.

    Comments: v2: final camera-ready revisions for NeurIPS 2022