Skip to main content

Showing 1–18 of 18 results for author: Reddi, S J

Searching in archive math. Search in all archives.
.
  1. arXiv:2410.21698  [pdf, other

    cs.LG math.ST stat.ML

    On the Role of Depth and Looping for In-Context Learning with Task Diversity

    Authors: Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, Sanjiv Kumar

    Abstract: The intriguing in-context learning (ICL) abilities of deep Transformer models have lately garnered significant attention. By studying in-context linear regression on unimodal Gaussian data, recent empirical and theoretical works have argued that ICL emerges from Transformers' abilities to simulate learning algorithms like gradient descent. However, these works fail to capture the remarkable abilit… ▽ More

    Submitted 28 October, 2024; originally announced October 2024.

  2. arXiv:2410.16401  [pdf, other

    cs.LG math.ST stat.ML

    Simplicity Bias via Global Convergence of Sharpness Minimization

    Authors: Khashayar Gatmiry, Zhiyuan Li, Sashank J. Reddi, Stefanie Jegelka

    Abstract: The remarkable generalization ability of neural networks is usually attributed to the implicit bias of SGD, which often yields models with lower complexity using simpler (e.g. linear) and low-rank features. Recent works have provided empirical and theoretical evidence for the bias of particular variants of SGD (such as label noise SGD) toward flatter regions of the loss landscape. Despite the folk… ▽ More

    Submitted 21 October, 2024; originally announced October 2024.

  3. arXiv:2211.03970  [pdf, other

    cs.LG math.OC

    On the Algorithmic Stability and Generalization of Adaptive Optimization Methods

    Authors: Han Nguyen, Hai Pham, Sashank J. Reddi, Barnabás Póczos

    Abstract: Despite their popularity in deep learning and machine learning in general, the theoretical properties of adaptive optimizers such as Adagrad, RMSProp, Adam or AdamW are not yet fully understood. In this paper, we develop a novel framework to study the stability and generalization of these optimization methods. Based on this framework, we show provable guarantees about such properties that depend h… ▽ More

    Submitted 7 November, 2022; originally announced November 2022.

    Comments: 21 pages including appendix

  4. arXiv:2008.03606  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning

    Authors: Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh

    Abstract: Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which gives rise to the client drift phenomenon. In fact, obtaining an algorithm for FL which is uniformly better than simple centralized training has been a major open problem thus far. In this work, we propose a general algorithmic framework, Mime, which i) mitigates cl… ▽ More

    Submitted 8 June, 2021; v1 submitted 8 August, 2020; originally announced August 2020.

    Comments: Version 2 provides stronger theoretical results and more thorough experiments

    MSC Class: 68W40; 68W15; 90C25; 90C06 ACM Class: G.1.6; F.2.1; E.4

  5. arXiv:2002.08528  [pdf, other

    cs.LG math.OC stat.ML

    Adaptive Sampling Distributed Stochastic Variance Reduced Gradient for Heterogeneous Distributed Datasets

    Authors: Ilqar Ramazanli, Han Nguyen, Hai Pham, Sashank J. Reddi, Barnabas Poczos

    Abstract: We study distributed optimization algorithms for minimizing the average of \emph{heterogeneous} functions distributed across several machines with a focus on communication efficiency. In such settings, naively using the classical stochastic gradient descent (SGD) or its variants (e.g., SVRG) with a uniform sampling of machines typically yields poor performance. It often leads to the dependence of… ▽ More

    Submitted 17 November, 2020; v1 submitted 19 February, 2020; originally announced February 2020.

  6. arXiv:1912.03194  [pdf, other

    math.OC cs.LG

    Why are Adaptive Methods Good for Attention Models?

    Authors: Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J Reddi, Sanjiv Kumar, Suvrit Sra

    Abstract: While stochastic gradient descent (SGD) is still the \emph{de facto} algorithm in deep learning, adaptive methods like Clipped SGD/Adam have been observed to outperform SGD across important tasks, such as attention models. The settings under which SGD performs poorly in comparison to adaptive methods are not well understood yet. In this paper, we provide empirical and theoretical evidence that a h… ▽ More

    Submitted 23 October, 2020; v1 submitted 6 December, 2019; originally announced December 2019.

  7. arXiv:1910.06378  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    SCAFFOLD: Stochastic Controlled Averaging for Federated Learning

    Authors: Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh

    Abstract: Federated Averaging (FedAvg) has emerged as the algorithm of choice for federated learning due to its simplicity and low communication cost. However, in spite of recent research efforts, its performance is not fully understood. We obtain tight convergence rates for FedAvg and prove that it suffers from `client-drift' when the data is heterogeneous (non-iid), resulting in unstable and slow converge… ▽ More

    Submitted 9 April, 2021; v1 submitted 14 October, 2019; originally announced October 2019.

    Comments: v2 contains analysis of FedAvg, non-convex rates of Scaffold, and experimental evaluation. v3 fixes typos, ICML version. v4 slightly improves rate of SCAFFOLD for general convex functions

    MSC Class: 68W40; 68W15; 90C25; 90C06 ACM Class: G.1.6; F.2.1; E.4

  8. arXiv:1904.09237  [pdf, other

    cs.LG math.OC stat.ML

    On the Convergence of Adam and Beyond

    Authors: Sashank J. Reddi, Satyen Kale, Sanjiv Kumar

    Abstract: Several recently proposed stochastic optimization methods that have been successfully used in training deep networks such as RMSProp, Adam, Adadelta, Nadam are based on using gradient updates scaled by square roots of exponential moving averages of squared past gradients. In many applications, e.g. learning with large output spaces, it has been empirically observed that these algorithms fail to co… ▽ More

    Submitted 19 April, 2019; originally announced April 2019.

    Comments: Appeared in ICLR 2018

  9. arXiv:1901.09149  [pdf, other

    cs.LG math.OC stat.ML

    Escaping Saddle Points with Adaptive Gradient Methods

    Authors: Matthew Staib, Sashank J. Reddi, Satyen Kale, Sanjiv Kumar, Suvrit Sra

    Abstract: Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own,… ▽ More

    Submitted 3 February, 2020; v1 submitted 25 January, 2019; originally announced January 2019.

    Comments: Update Theorem 4.1 and proof to use martingale concentration bounds, i.e. matrix Freedman

  10. arXiv:1608.06879  [pdf, other

    math.OC cs.LG stat.ML

    AIDE: Fast and Communication Efficient Distributed Optimization

    Authors: Sashank J. Reddi, Jakub Konečný, Peter Richtárik, Barnabás Póczós, Alex Smola

    Abstract: In this paper, we present two new communication-efficient methods for distributed minimization of an average of functions. The first algorithm is an inexact variant of the DANE algorithm that allows any local algorithm to return an approximate solution to a local subproblem. We show that such a strategy does not affect the theoretical guarantees of DANE significantly. In fact, our approach can be… ▽ More

    Submitted 24 August, 2016; originally announced August 2016.

  11. arXiv:1607.08254  [pdf, other

    math.OC cs.LG stat.ML

    Stochastic Frank-Wolfe Methods for Nonconvex Optimization

    Authors: Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola

    Abstract: We study Frank-Wolfe methods for nonconvex stochastic and finite-sum optimization problems. Frank-Wolfe methods (in the convex case) have gained tremendous recent interest in machine learning and optimization communities due to their projection-free property and their ability to exploit structured constraints. However, our understanding of these algorithms in the nonconvex setting is fairly limite… ▽ More

    Submitted 29 July, 2016; v1 submitted 27 July, 2016; originally announced July 2016.

  12. arXiv:1605.07147  [pdf, other

    math.OC cs.LG

    Riemannian SVRG: Fast Stochastic Optimization on Riemannian Manifolds

    Authors: Hongyi Zhang, Sashank J. Reddi, Suvrit Sra

    Abstract: We study optimization of finite sums of geodesically smooth functions on Riemannian manifolds. Although variance reduction techniques for optimizing finite-sums have witnessed tremendous attention in the recent years, existing work is limited to vector space problems. We introduce Riemannian SVRG (RSVRG), a new variance reduced Riemannian optimization method. We analyze RSVRG for both geodesically… ▽ More

    Submitted 7 April, 2017; v1 submitted 23 May, 2016; originally announced May 2016.

    Comments: This is the final version that appeared in NIPS 2016. Our proof of Lemma 2 was incorrect in the previous arXiv version. (9 pages paper + 6 pages appendix)

    Journal ref: Advances in Neural Information Processing Systems 29 (NIPS 2016)

  13. arXiv:1605.06900  [pdf, other

    math.OC cs.LG stat.ML

    Fast Stochastic Methods for Nonsmooth Nonconvex Optimization

    Authors: Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola

    Abstract: We analyze stochastic algorithms for optimizing nonconvex, nonsmooth finite-sum problems, where the nonconvex part is smooth and the nonsmooth part is convex. Surprisingly, unlike the smooth case, our knowledge of this fundamental problem is very limited. For example, it is not known whether the proximal stochastic gradient method with constant minibatch converges to a stationary point. To tackle… ▽ More

    Submitted 23 May, 2016; originally announced May 2016.

  14. arXiv:1603.06160  [pdf, other

    math.OC cs.LG cs.NE stat.ML

    Stochastic Variance Reduction for Nonconvex Optimization

    Authors: Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, Alex Smola

    Abstract: We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary po… ▽ More

    Submitted 4 April, 2016; v1 submitted 19 March, 2016; originally announced March 2016.

    Comments: Minor feedback changes

  15. arXiv:1603.06159  [pdf, other

    math.OC cs.LG stat.ML

    Fast Incremental Method for Nonconvex Optimization

    Authors: Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola

    Abstract: We analyze a fast incremental aggregated gradient method for optimizing nonconvex problems of the form $\min_x \sum_i f_i(x)$. Specifically, we analyze the SAGA algorithm within an Incremental First-order Oracle framework, and show that it converges to a stationary point provably faster than both gradient descent and stochastic gradient descent. We also discuss a Polyak's special class of nonconve… ▽ More

    Submitted 19 March, 2016; originally announced March 2016.

  16. arXiv:1508.00655  [pdf, other

    math.ST cs.AI cs.IT cs.LG stat.ML

    Adaptivity and Computation-Statistics Tradeoffs for Kernel and Distance based High Dimensional Two Sample Testing

    Authors: Aaditya Ramdas, Sashank J. Reddi, Barnabas Poczos, Aarti Singh, Larry Wasserman

    Abstract: Nonparametric two sample testing is a decision theoretic problem that involves identifying differences between two random variables without making parametric assumptions about their underlying distributions. We refer to the most common settings as mean difference alternatives (MDA), for testing differences only in first moments, and general difference alternatives (GDA), which is about testing for… ▽ More

    Submitted 4 August, 2015; originally announced August 2015.

    Comments: 35 pages, 4 figures

  17. arXiv:1411.6314  [pdf, other

    math.ST cs.AI cs.IT cs.LG stat.ML

    On the High-dimensional Power of Linear-time Kernel Two-Sample Testing under Mean-difference Alternatives

    Authors: Aaditya Ramdas, Sashank J. Reddi, Barnabas Poczos, Aarti Singh, Larry Wasserman

    Abstract: Nonparametric two sample testing deals with the question of consistently deciding if two distributions are different, given samples from both, without making any parametric assumptions about the form of the distributions. The current literature is split into two kinds of tests - those which are consistent without any assumptions about how the distributions may differ (\textit{general} alternatives… ▽ More

    Submitted 23 November, 2014; originally announced November 2014.

    Comments: 25 pages, 5 figures

  18. arXiv:1406.2083  [pdf, other

    stat.ML cs.IT cs.LG math.ST stat.ME

    On the Decreasing Power of Kernel and Distance based Nonparametric Hypothesis Tests in High Dimensions

    Authors: Sashank J. Reddi, Aaditya Ramdas, Barnabás Póczos, Aarti Singh, Larry Wasserman

    Abstract: This paper is about two related decision theoretic problems, nonparametric two-sample testing and independence testing. There is a belief that two recently proposed solutions, based on kernels and distances between pairs of points, behave well in high-dimensional settings. We identify different sources of misconception that give rise to the above belief. Specifically, we differentiate the hardness… ▽ More

    Submitted 23 November, 2014; v1 submitted 9 June, 2014; originally announced June 2014.

    Comments: 19 pages, 9 figures, published in AAAI-15: The 29th AAAI Conference on Artificial Intelligence (with author order reversed from ArXiv)