Skip to main content

Showing 1–12 of 12 results for author: Morwani, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.15535  [pdf, ps, other

    cs.LG math.OC stat.ML

    A Simplified Analysis of SGD for Linear Regression with Weight Averaging

    Authors: Alexandru Meterez, Depen Morwani, Costin-Andrei Oncescu, Jingfeng Wu, Cengiz Pehlevan, Sham Kakade

    Abstract: Theoretically understanding stochastic gradient descent (SGD) in overparameterized models has led to the development of several optimization algorithms that are widely used in practice today. Recent work by~\citet{zou2021benign} provides sharp rates for SGD optimization in linear regression using constant learning rate, both with and without tail iterate averaging, based on a bias-variance decompo… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  2. arXiv:2502.02431  [pdf, other

    cs.LG cs.AI

    Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants

    Authors: Depen Morwani, Nikhil Vyas, Hanlin Zhang, Sham Kakade

    Abstract: Recent advancements in deep learning optimization have introduced new algorithms, such as Schedule-Free optimizers, AdEMAMix, MARS and Lion which modify traditional momentum mechanisms. In a separate line of work, theoretical acceleration of stochastic gradient descent (SGD) in noise-dominated regime has been achieved by decoupling the momentum coefficient from the current gradient's weight. In th… ▽ More

    Submitted 4 February, 2025; originally announced February 2025.

  3. arXiv:2410.21676  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    How Does Critical Batch Size Scale in Pre-training?

    Authors: Hanlin Zhang, Depen Morwani, Nikhil Vyas, Jingfeng Wu, Difan Zou, Udaya Ghai, Dean Foster, Sham Kakade

    Abstract: Training large-scale models under given resources requires careful design of parallelism strategies. In particular, the efficiency notion of critical batch size (CBS), concerning the compromise between time and compute, marks the threshold beyond which greater data parallelism leads to diminishing returns. To operationalize it, we propose a measure of CBS and pre-train a series of auto-regressive… ▽ More

    Submitted 21 April, 2025; v1 submitted 28 October, 2024; originally announced October 2024.

    Comments: ICLR 2025, Blog post: https://kempnerinstitute.harvard.edu/research/deeper-learning/how-does-critical-batch-size-scale-in-pre-training-decoupling-data-and-model-size

  4. arXiv:2409.11321  [pdf, other

    cs.LG cs.AI

    SOAP: Improving and Stabilizing Shampoo using Adam

    Authors: Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, Sham Kakade

    Abstract: There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks include additional hyperparameters and computational overhead when compared to Adam, which only updates running averages of first- and second-moment quantities. This work establishes a formal connection between Shampoo (implem… ▽ More

    Submitted 31 January, 2025; v1 submitted 17 September, 2024; originally announced September 2024.

  5. arXiv:2407.07972  [pdf, other

    cs.LG cs.AI

    Deconstructing What Makes a Good Optimizer for Language Models

    Authors: Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas, Sham Kakade

    Abstract: Training language models becomes increasingly expensive with scale, prompting numerous attempts to improve optimization efficiency. Despite these efforts, the Adam optimizer remains the most widely used, due to a prevailing view that it is the most effective approach. We aim to compare several optimization algorithms, including SGD, Adafactor, Adam, Lion, and Sophia in the context of autoregressiv… ▽ More

    Submitted 27 February, 2025; v1 submitted 10 July, 2024; originally announced July 2024.

    Comments: 21 pages, ICLR 2025

  6. arXiv:2406.17748  [pdf, other

    cs.LG math.OC stat.ML

    A New Perspective on Shampoo's Preconditioner

    Authors: Depen Morwani, Itai Shapira, Nikhil Vyas, Eran Malach, Sham Kakade, Lucas Janson

    Abstract: Shampoo, a second-order optimization algorithm which uses a Kronecker product preconditioner, has recently garnered increasing attention from the machine learning community. The preconditioner used by Shampoo can be viewed either as an approximation of the Gauss--Newton component of the Hessian or the covariance matrix of the gradients maintained by Adagrad. We provide an explicit and novel connec… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  7. arXiv:2311.07568  [pdf, other

    cs.LG

    Feature emergence via margin maximization: case studies in algebraic tasks

    Authors: Depen Morwani, Benjamin L. Edelman, Costin-Andrei Oncescu, Rosie Zhao, Sham Kakade

    Abstract: Understanding the internal representations learned by neural networks is a cornerstone challenge in the science of machine learning. While there have been significant recent strides in some cases towards understanding how neural networks implement specific target functions, this paper explores a complementary question -- why do networks arrive at particular computational strategies? Our inquiry fo… ▽ More

    Submitted 19 February, 2024; v1 submitted 13 November, 2023; originally announced November 2023.

    Comments: Accepted as Spotlight at ICLR 2024

    ACM Class: I.5.1; I.2.6

  8. arXiv:2306.08590  [pdf, other

    cs.LG stat.ML

    Beyond Implicit Bias: The Insignificance of SGD Noise in Online Learning

    Authors: Nikhil Vyas, Depen Morwani, Rosie Zhao, Gal Kaplun, Sham Kakade, Boaz Barak

    Abstract: The success of SGD in deep learning has been ascribed by prior works to the implicit bias induced by finite batch sizes ("SGD noise"). While prior works focused on offline learning (i.e., multiple-epoch training), we study the impact of SGD noise on online (i.e., single epoch) learning. Through an extensive empirical analysis of image and language data, we demonstrate that small batch sizes do not… ▽ More

    Submitted 7 June, 2024; v1 submitted 14 June, 2023; originally announced June 2023.

  9. arXiv:2305.18411  [pdf, other

    cs.LG

    Feature-Learning Networks Are Consistent Across Widths At Realistic Scales

    Authors: Nikhil Vyas, Alexander Atanasov, Blake Bordelon, Depen Morwani, Sabarish Sainathan, Cengiz Pehlevan

    Abstract: We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets. Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training. For simple tasks such as CIFAR-5m this holds throughout training for networks of realistic widths.… ▽ More

    Submitted 5 December, 2023; v1 submitted 28 May, 2023; originally announced May 2023.

    Comments: 24 pages, 19 figures. NeurIPS 2023. Revised based on reviewer feedback

  10. arXiv:2302.00457  [pdf, other

    cs.LG cs.AI stat.ML

    Simplicity Bias in 1-Hidden Layer Neural Networks

    Authors: Depen Morwani, Jatin Batra, Prateek Jain, Praneeth Netrapalli

    Abstract: Recent works have demonstrated that neural networks exhibit extreme simplicity bias(SB). That is, they learn only the simplest features to solve a task at hand, even in the presence of other, more robust but more complex features. Due to the lack of a general and rigorous definition of features, these works showcase SB on semi-synthetic datasets such as Color-MNIST, MNIST-CIFAR where defining feat… ▽ More

    Submitted 1 February, 2023; originally announced February 2023.

    ACM Class: I.5.1; I.2.6

  11. arXiv:2012.08854  [pdf, ps, other

    cs.LG stat.ML

    Using noise resilience for ranking generalization of deep neural networks

    Authors: Depen Morwani, Rahul Vashisht, Harish G. Ramaswamy

    Abstract: Recent papers have shown that sufficiently overparameterized neural networks can perfectly fit even random labels. Thus, it is crucial to understand the underlying reason behind the generalization performance of a network on real-world data. In this work, we propose several measures to predict the generalization error of a network given the training data and its parameters. Using one of these meas… ▽ More

    Submitted 16 December, 2020; originally announced December 2020.

    ACM Class: I.5.1

  12. arXiv:2010.12909  [pdf, other

    cs.LG stat.ML

    Inductive Bias of Gradient Descent for Weight Normalized Smooth Homogeneous Neural Nets

    Authors: Depen Morwani, Harish G. Ramaswamy

    Abstract: We analyze the inductive bias of gradient descent for weight normalized smooth homogeneous neural nets, when trained on exponential or cross-entropy loss. We analyse both standard weight normalization (SWN) and exponential weight normalization (EWN), and show that the gradient flow path with EWN is equivalent to gradient flow on standard networks with an adaptive learning rate. We extend these res… ▽ More

    Submitted 31 January, 2023; v1 submitted 24 October, 2020; originally announced October 2020.

    Comments: Accepted to ALT 2022

    ACM Class: I.5.1; I.2.6