Skip to main content

Showing 1–36 of 36 results for author: Stich, S U

Searching in archive stat. Search in all archives.
.
  1. arXiv:2501.15259  [pdf, other

    cs.LG math.OC stat.ML

    Scalable Decentralized Learning with Teleportation

    Authors: Yuki Takezawa, Sebastian U. Stich

    Abstract: Decentralized SGD can run with low communication costs, but its sparse communication characteristics deteriorate the convergence rate, especially when the number of nodes is large. In decentralized learning settings, communication is assumed to occur on only a given topology, while in many practical cases, the topology merely represents a preferred communication pattern, and connecting to arbitrar… ▽ More

    Submitted 27 February, 2025; v1 submitted 25 January, 2025; originally announced January 2025.

    Comments: ICLR 2025

  2. arXiv:2405.20114  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    Towards Faster Decentralized Stochastic Optimization with Communication Compression

    Authors: Rustem Islamov, Yuan Gao, Sebastian U. Stich

    Abstract: Communication efficiency has garnered significant attention as it is considered the main bottleneck for large-scale decentralized Machine Learning applications in distributed and federated settings. In this regime, clients are restricted to transmitting small amounts of quantized information to their neighbors over a communication graph. Numerous endeavors have been made to address this challengin… ▽ More

    Submitted 25 November, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

  3. arXiv:2405.11667  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication

    Authors: Kumar Kshitij Patel, Margalit Glasgow, Ali Zindari, Lingxiao Wang, Sebastian U. Stich, Ziheng Cheng, Nirmit Joshi, Nathan Srebro

    Abstract: Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice, including mini-batch SGD. Despite this success, theoretically proving the dominance of local SGD in settings with reasonable data heterogeneity has been difficult, creating a significant gap between theory and practice. In this paper, we provide new lower bounds for local SGD under… ▽ More

    Submitted 19 May, 2024; originally announced May 2024.

  4. arXiv:2308.06058  [pdf, other

    cs.LG math.OC stat.ML

    Adaptive SGD with Polyak stepsize and Line-search: Robust Convergence and Variance Reduction

    Authors: Xiaowen Jiang, Sebastian U. Stich

    Abstract: The recently proposed stochastic Polyak stepsize (SPS) and stochastic line-search (SLS) for SGD have shown remarkable effectiveness when training over-parameterized models. However, in non-interpolation settings, both algorithms only guarantee convergence to a neighborhood of a solution which may result in a worse output than the initial guess. While artificially decreasing the adaptive stepsize h… ▽ More

    Submitted 21 August, 2023; v1 submitted 11 August, 2023; originally announced August 2023.

  5. arXiv:2307.06306  [pdf, other

    cs.LG math.OC stat.ML

    Locally Adaptive Federated Learning

    Authors: Sohom Mukherjee, Nicolas Loizou, Sebastian U. Stich

    Abstract: Federated learning is a paradigm of distributed machine learning in which multiple clients coordinate with a central server to learn a model, without sharing their own training data. Standard federated optimization methods such as Federated Averaging (FedAvg) ensure balance among the clients by using the same stepsize for local updates on all clients. However, this means that all clients need to r… ▽ More

    Submitted 14 May, 2024; v1 submitted 12 July, 2023; originally announced July 2023.

    Comments: 29 pages, 9 figures

  6. arXiv:2306.05100  [pdf, other

    math.OC cs.LG stat.ML

    Communication-Efficient Gradient Descent-Accent Methods for Distributed Variational Inequalities: Unified Analysis and Local Updates

    Authors: Siqi Zhang, Sayantan Choudhury, Sebastian U Stich, Nicolas Loizou

    Abstract: Distributed and federated learning algorithms and techniques associated primarily with minimization problems. However, with the increase of minimax optimization and variational inequality problems in machine learning, the necessity of designing efficient distributed/federated learning approaches for these problems is becoming more apparent. In this paper, we provide a unified convergence analysis… ▽ More

    Submitted 2 June, 2024; v1 submitted 8 June, 2023; originally announced June 2023.

    Comments: ICLR 2024

  7. arXiv:2305.19259  [pdf, other

    cs.LG math.OC stat.ML

    On Convergence of Incremental Gradient for Non-Convex Smooth Functions

    Authors: Anastasia Koloskova, Nikita Doikov, Sebastian U. Stich, Martin Jaggi

    Abstract: In machine learning and neural network optimization, algorithms like incremental gradient, and shuffle SGD are popular due to minimizing the number of cache misses and good practical convergence behavior. However, their optimization properties in theory, especially for non-convex smooth functions, remain incompletely explored. This paper delves into the convergence properties of SGD algorithms w… ▽ More

    Submitted 12 February, 2024; v1 submitted 30 May, 2023; originally announced May 2023.

  8. arXiv:2305.01588  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees

    Authors: Anastasia Koloskova, Hadrien Hendrikx, Sebastian U. Stich

    Abstract: Gradient clipping is a popular modification to standard (stochastic) gradient descent, at every iteration limiting the gradient norm to a certain value $c >0$. It is widely used for example for stabilizing the training of deep learning models (Goodfellow et al., 2016), or for enforcing differential privacy (Abadi et al., 2016). Despite popularity and simplicity of the clipping mechanism, its conve… ▽ More

    Submitted 9 November, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

  9. arXiv:2204.06477  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    Data-heterogeneity-aware Mixing for Decentralized Learning

    Authors: Yatin Dandi, Anastasia Koloskova, Martin Jaggi, Sebastian U. Stich

    Abstract: Decentralized learning provides an effective framework to train machine learning models with data distributed over arbitrary communication graphs. However, most existing approaches toward decentralized learning disregard the interaction between data heterogeneity and graph topology. In this paper, we characterize the dependence of convergence on the relationship between the mixing weights of the g… ▽ More

    Submitted 13 April, 2022; originally announced April 2022.

  10. arXiv:2202.09052  [pdf, other

    cs.LG stat.ML

    Tackling benign nonconvexity with smoothing and stochastic gradients

    Authors: Harsh Vardhan, Sebastian U. Stich

    Abstract: Non-convex optimization problems are ubiquitous in machine learning, especially in Deep Learning. While such complex problems can often be successfully optimized in practice by using stochastic gradient descent (SGD), theoretical analysis cannot adequately explain this success. In particular, the standard analyses do not show global convergence of SGD on non-convex functions, and instead show conv… ▽ More

    Submitted 18 February, 2022; originally announced February 2022.

  11. arXiv:2112.05000  [pdf, other

    cs.LG stat.ML

    The Peril of Popular Deep Learning Uncertainty Estimation Methods

    Authors: Yehao Liu, Matteo Pagliardini, Tatjana Chavdarova, Sebastian U. Stich

    Abstract: Uncertainty estimation (UE) techniques -- such as the Gaussian process (GP), Bayesian neural networks (BNN), Monte Carlo dropout (MCDropout) -- aim to improve the interpretability of machine learning models by assigning an estimated uncertainty value to each of their prediction outputs. However, since too high uncertainty estimates can have fatal consequences in practice, this paper analyzes the a… ▽ More

    Submitted 9 December, 2021; originally announced December 2021.

    Comments: Presented at the Bayesian Deep Learning Workshop at NeurIPS 2021

  12. arXiv:2110.04175  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    RelaySum for Decentralized Deep Learning on Heterogeneous Data

    Authors: Thijs Vogels, Lie He, Anastasia Koloskova, Tao Lin, Sai Praneeth Karimireddy, Sebastian U. Stich, Martin Jaggi

    Abstract: In decentralized machine learning, workers compute model updates on their local data. Because the workers only communicate with few neighbors without central coordination, these updates propagate progressively over the network. This paradigm enables distributed training on networks without all-to-all connectivity, helping to protect data privacy as well as to reduce the communication cost of distr… ▽ More

    Submitted 31 January, 2022; v1 submitted 8 October, 2021; originally announced October 2021.

    Comments: Presented at NeurIPS 2021

    Journal ref: Advances in Neural Information Processing Systems 34, 2021

  13. arXiv:2108.07958  [pdf, other

    stat.ML cs.LG

    Semantic Perturbations with Normalizing Flows for Improved Generalization

    Authors: Oguz Kaan Yuksel, Sebastian U. Stich, Martin Jaggi, Tatjana Chavdarova

    Abstract: Data augmentation is a widely adopted technique for avoiding overfitting when training deep neural networks. However, this approach requires domain-specific knowledge and is often limited to a fixed set of hard-coded transformations. Recently, several works proposed to use generative models for generating semantically meaningful perturbations to train a classifier. However, because accurate encodi… ▽ More

    Submitted 17 August, 2021; originally announced August 2021.

    Comments: In Proceedings of the IEEE International Conference on Computer Vision

  14. arXiv:2103.02351  [pdf, other

    cs.LG cs.DC stat.ML

    Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous Updates

    Authors: Sebastian U. Stich, Amirkeivan Mohtashami, Martin Jaggi

    Abstract: It has been experimentally observed that the efficiency of distributed training with stochastic gradient (SGD) depends decisively on the batch size and -- in asynchronous implementations -- on the gradient staleness. Especially, it has been observed that the speedup saturates beyond a certain batch size and/or when the delays grow too large. We identify a data-dependent parameter that explains the… ▽ More

    Submitted 3 March, 2021; originally announced March 2021.

    Comments: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS) 2021

  15. arXiv:2009.02388  [pdf, ps, other

    cs.LG cs.DS stat.ML

    On Communication Compression for Distributed Optimization on Heterogeneous Data

    Authors: Sebastian U. Stich

    Abstract: Lossy gradient compression, with either unbiased or biased compressors, has become a key tool to avoid the communication bottleneck in centrally coordinated distributed training of machine learning models. We analyze the performance of two standard and general types of methods: (i) distributed quantized SGD (D-QSGD) with arbitrary unbiased quantizers and (ii) distributed SGD with error-feedback an… ▽ More

    Submitted 22 December, 2020; v1 submitted 4 September, 2020; originally announced September 2020.

  16. arXiv:2008.03606  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning

    Authors: Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh

    Abstract: Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which gives rise to the client drift phenomenon. In fact, obtaining an algorithm for FL which is uniformly better than simple centralized training has been a major open problem thus far. In this work, we propose a general algorithmic framework, Mime, which i) mitigates cl… ▽ More

    Submitted 8 June, 2021; v1 submitted 8 August, 2020; originally announced August 2020.

    Comments: Version 2 provides stronger theoretical results and more thorough experiments

    MSC Class: 68W40; 68W15; 90C25; 90C06 ACM Class: G.1.6; F.2.1; E.4

  17. arXiv:2008.00051  [pdf, other

    cs.LG math.OC stat.ML

    On the Convergence of SGD with Biased Gradients

    Authors: Ahmad Ajalloeian, Sebastian U. Stich

    Abstract: We analyze the complexity of biased stochastic gradient methods (SGD), where individual updates are corrupted by deterministic, i.e. biased error terms. We derive convergence results for smooth (non-convex) functions and give improved rates under the Polyak-Lojasiewicz condition. We quantify how the magnitude of the bias impacts the attainable accuracy and the convergence rates (sometimes leading… ▽ More

    Submitted 9 May, 2021; v1 submitted 31 July, 2020; originally announced August 2020.

    Comments: Accepted to ICML 2020 Workshop "Beyond First Order Methods in ML Systems", updated 2021

  18. arXiv:2006.14567  [pdf, other

    stat.ML cs.LG

    Taming GANs with Lookahead-Minmax

    Authors: Tatjana Chavdarova, Matteo Pagliardini, Sebastian U. Stich, Francois Fleuret, Martin Jaggi

    Abstract: Generative Adversarial Networks are notoriously challenging to train. The underlying minmax optimization is highly susceptible to the variance of the stochastic gradient and the rotational component of the associated game vector field. To tackle these challenges, we propose the Lookahead algorithm for minmax optimization, originally developed for single objective minimization only. The backtrackin… ▽ More

    Submitted 23 June, 2021; v1 submitted 25 June, 2020; originally announced June 2020.

    Journal ref: ICLR 2021

  19. arXiv:2006.07253  [pdf, other

    cs.LG stat.ML

    Dynamic Model Pruning with Feedback

    Authors: Tao Lin, Sebastian U. Stich, Luis Barba, Daniil Dmitriev, Martin Jaggi

    Abstract: Deep neural networks often have millions of parameters. This can hinder their deployment to low-end devices, not only due to high memory requirements but also because of increased latency at inference. We propose a novel model compression method that generates a sparse trained model without additional overhead: by allowing (i) dynamic allocation of the sparsity pattern and (ii) incorporating feedb… ▽ More

    Submitted 12 June, 2020; originally announced June 2020.

    Comments: appearing at ICLR 2020

  20. arXiv:2006.07242  [pdf, other

    cs.LG stat.ML

    Ensemble Distillation for Robust Model Fusion in Federated Learning

    Authors: Tao Lin, Lingjing Kong, Sebastian U. Stich, Martin Jaggi

    Abstract: Federated Learning (FL) is a machine learning setting where many devices collaboratively train a machine learning model while keeping the training data decentralized. In most of the current training schemes the central model is refined by averaging the parameters of the server model and the updated parameters from the client side. However, directly averaging model parameters is only possible if al… ▽ More

    Submitted 27 March, 2021; v1 submitted 12 June, 2020; originally announced June 2020.

  21. arXiv:2006.05720  [pdf, other

    cs.LG stat.ML

    Extrapolation for Large-batch Training in Deep Learning

    Authors: Tao Lin, Lingjing Kong, Sebastian U. Stich, Martin Jaggi

    Abstract: Deep learning networks are typically trained by Stochastic Gradient Descent (SGD) methods that iteratively improve the model parameters by estimating a gradient on a very small fraction of the training data. A major roadblock faced when increasing the batch size to a substantial fraction of the training data for improving training time is the persistent degradation in performance (generalization g… ▽ More

    Submitted 10 June, 2020; originally announced June 2020.

  22. arXiv:2003.10422  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    A Unified Theory of Decentralized SGD with Changing Topology and Local Updates

    Authors: Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, Sebastian U. Stich

    Abstract: Decentralized stochastic optimization methods have gained a lot of attention recently, mainly because of their cheap per iteration cost, data locality, and their communication-efficiency. In this paper we introduce a unified convergence analysis that covers a large variety of decentralized SGD methods which so far have required different intuitions, have different applications, and which have been… ▽ More

    Submitted 2 March, 2021; v1 submitted 23 March, 2020; originally announced March 2020.

    MSC Class: 68W10; 68W15; 68W40; 90C06; 90C35 ACM Class: G.1.6; F.2.1

  23. arXiv:2002.07839  [pdf, other

    cs.LG math.OC stat.ML

    Is Local SGD Better than Minibatch SGD?

    Authors: Blake Woodworth, Kumar Kshitij Patel, Sebastian U. Stich, Zhen Dai, Brian Bullins, H. Brendan McMahan, Ohad Shamir, Nathan Srebro

    Abstract: We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD. (1) For quadratic objectives we prove that local SGD strictly dominates minibat… ▽ More

    Submitted 20 July, 2020; v1 submitted 18 February, 2020; originally announced February 2020.

    Comments: 29 pages

  24. arXiv:1912.04977  [pdf, other

    cs.LG cs.CR stat.ML

    Advances and Open Problems in Federated Learning

    Authors: Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D'Oliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson , et al. (34 additional authors not shown)

    Abstract: Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs re… ▽ More

    Submitted 8 March, 2021; v1 submitted 10 December, 2019; originally announced December 2019.

    Comments: Published in Foundations and Trends in Machine Learning Vol 4 Issue 1. See: https://www.nowpublishers.com/article/Details/MAL-083

  25. arXiv:1910.06378  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    SCAFFOLD: Stochastic Controlled Averaging for Federated Learning

    Authors: Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh

    Abstract: Federated Averaging (FedAvg) has emerged as the algorithm of choice for federated learning due to its simplicity and low communication cost. However, in spite of recent research efforts, its performance is not fully understood. We obtain tight convergence rates for FedAvg and prove that it suffers from `client-drift' when the data is heterogeneous (non-iid), resulting in unstable and slow converge… ▽ More

    Submitted 9 April, 2021; v1 submitted 14 October, 2019; originally announced October 2019.

    Comments: v2 contains analysis of FedAvg, non-convex rates of Scaffold, and experimental evaluation. v3 fixes typos, ICML version. v4 slightly improves rate of SCAFFOLD for general convex functions

    MSC Class: 68W40; 68W15; 90C25; 90C06 ACM Class: G.1.6; F.2.1; E.4

  26. arXiv:1909.05350  [pdf, ps, other

    cs.LG cs.DC math.OC stat.ML

    The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication

    Authors: Sebastian U. Stich, Sai Praneeth Karimireddy

    Abstract: We analyze (stochastic) gradient descent (SGD) with delayed updates on smooth quasi-convex and non-convex functions and derive concise, non-asymptotic, convergence rates. We show that the rate of convergence in all cases consists of two terms: (i) a stochastic term which is not affected by the delay, and (ii) a higher order deterministic term which is only linearly slowed down by the delay. Thus,… ▽ More

    Submitted 16 June, 2021; v1 submitted 11 September, 2019; originally announced September 2019.

    Comments: Submitted 9/19, Published 9/20

    MSC Class: 68W40; 68W15; 90C25; 90C06 ACM Class: G.1.6; F.2.1; E.4

    Journal ref: Journal of Machine Learning Research (JMLR), 21(237):1-36, 2020

  27. arXiv:1907.09356  [pdf, other

    cs.LG cs.DC cs.DS math.OC stat.ML

    Decentralized Deep Learning with Arbitrary Communication Compression

    Authors: Anastasia Koloskova, Tao Lin, Sebastian U. Stich, Martin Jaggi

    Abstract: Decentralized training of deep learning models is a key element for enabling data privacy and on-device learning over networks, as well as for efficient scaling to large compute clusters. As current approaches suffer from limited bandwidth of the network, we propose the use of communication compression in the decentralized training context. We show that Choco-SGD $-$ recently introduced and analyz… ▽ More

    Submitted 11 November, 2020; v1 submitted 22 July, 2019; originally announced July 2019.

    MSC Class: 68W10; 68W15; 68W40; 90C06; 90C25; 90C35 ACM Class: G.1.6; F.2.1; E.4

  28. arXiv:1907.04232  [pdf, ps, other

    cs.LG math.NA math.OC stat.ML

    Unified Optimal Analysis of the (Stochastic) Gradient Method

    Authors: Sebastian U. Stich

    Abstract: In this note we give a simple proof for the convergence of stochastic gradient (SGD) methods on $μ$-convex functions under a (milder than standard) $L$-smoothness assumption. We show that for carefully chosen stepsizes SGD converges after $T$ iterations as $O\left( LR^2 \exp \bigl[-\fracμ{4L}T\bigr] + \frac{σ^2}{μT} \right)$ where $σ^2$ measures the variance in the stochastic noise. For determinis… ▽ More

    Submitted 23 December, 2019; v1 submitted 9 July, 2019; originally announced July 2019.

    Comments: 11 pages, version 2 fixes typos and case distinction in the proof

  29. arXiv:1902.00340  [pdf, other

    cs.LG cs.DC cs.DS math.OC stat.ML

    Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication

    Authors: Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi

    Abstract: We consider decentralized stochastic optimization with the objective function (e.g. data samples for machine learning task) being distributed over $n$ machines that can only communicate to their neighbors on a fixed communication graph. To reduce the communication bottleneck, the nodes compress (e.g. quantize or sparsify) their model updates. We cover both unbiased and biased compression operators… ▽ More

    Submitted 1 February, 2019; originally announced February 2019.

    MSC Class: 68W10; 68W15; 68W40; 90C06; 90C25; 90C35 ACM Class: G.1.6; F.2.1; E.4

  30. arXiv:1901.09847  [pdf, other

    cs.LG math.OC stat.ML

    Error Feedback Fixes SignSGD and other Gradient Compression Schemes

    Authors: Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U. Stich, Martin Jaggi

    Abstract: Sign-based algorithms (e.g. signSGD) have been proposed as a biased gradient compression technique to alleviate the communication bottleneck in training large neural networks across multiple workers. We show simple convex counter-examples where signSGD does not converge to the optimum. Further, even when it does converge, signSGD may generalize poorly when compared with SGD. These issues arise bec… ▽ More

    Submitted 29 May, 2019; v1 submitted 28 January, 2019; originally announced January 2019.

    Comments: ICML 2019 (long talk)

    ACM Class: I.2.6; I.5.1

  31. arXiv:1810.06999  [pdf, other

    math.OC cs.LG stat.CO stat.ML

    Efficient Greedy Coordinate Descent for Composite Problems

    Authors: Sai Praneeth Karimireddy, Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi

    Abstract: Coordinate descent with random coordinate selection is the current state of the art for many large scale optimization problems. However, greedy selection of the steepest coordinate on smooth problems can yield convergence rates independent of the dimension $n$, and requiring upto $n$ times fewer iterations. In this paper, we consider greedy updates that are based on subgradients for a class of n… ▽ More

    Submitted 16 October, 2018; originally announced October 2018.

    Comments: 44 pages, 17 figures, 3 tables

    MSC Class: 90C25; 68Q25 ACM Class: G.1.6

  32. arXiv:1809.07599  [pdf, other

    cs.LG cs.DC cs.DS stat.ML

    Sparsified SGD with Memory

    Authors: Sebastian U. Stich, Jean-Baptiste Cordonnier, Martin Jaggi

    Abstract: Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for… ▽ More

    Submitted 28 November, 2018; v1 submitted 20 September, 2018; originally announced September 2018.

    Comments: to appear at NIPS 2018

    MSC Class: 68W40; 68W15; 90C25; 90C06 ACM Class: G.1.6; F.2.1; E.4

  33. arXiv:1808.07217  [pdf, other

    cs.LG stat.ML

    Don't Use Large Mini-Batches, Use Local SGD

    Authors: Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, Martin Jaggi

    Abstract: Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a major roadblock, as models trained with large batches often do not generalize well, i.e. they do not show good accuracy on new data. As a remedy, we… ▽ More

    Submitted 17 February, 2020; v1 submitted 22 August, 2018; originally announced August 2018.

    Comments: To appear in ICLR 2020

  34. arXiv:1806.00413  [pdf, ps, other

    cs.LG math.OC stat.ML

    Global linear convergence of Newton's method without strong-convexity or Lipschitz gradients

    Authors: Sai Praneeth Karimireddy, Sebastian U. Stich, Martin Jaggi

    Abstract: We show that Newton's method converges globally at a linear rate for objective functions whose Hessians are stable. This class of problems includes many functions which are not strongly convex, such as logistic regression. Our linear convergence result is (i) affine-invariant, and holds even if an (ii) approximate Hessian is used, and if the subproblems are (iii) only solved approximately. Thus we… ▽ More

    Submitted 1 June, 2018; originally announced June 2018.

    Comments: 19 pages

    MSC Class: 90C25; 68Q25 ACM Class: G.1.6

  35. arXiv:1805.00982  [pdf, other

    math.OC cs.LG stat.ML

    k-SVRG: Variance Reduction for Large Scale Optimization

    Authors: Anant Raj, Sebastian U. Stich

    Abstract: Variance reduced stochastic gradient (SGD) methods converge significantly faster than the vanilla SGD counterpart. However, these methods are not very practical on large scale problems, as they either i) require frequent passes over the full data to recompute gradients---without making any progress during this time (like for SVRG), or ii)~they require additional memory that can surpass the size of… ▽ More

    Submitted 16 October, 2018; v1 submitted 2 May, 2018; originally announced May 2018.

    Comments: The title of the previous version of the manuscript was "SVRG meets SAGA: k-SVRG A Tale of Limited Memory"

    MSC Class: 90C06; 68W40; 68W20 ACM Class: G.1.6; F.2.1

  36. arXiv:1803.09539  [pdf, other

    stat.ML cs.LG math.OC

    On Matching Pursuit and Coordinate Descent

    Authors: Francesco Locatello, Anant Raj, Sai Praneeth Karimireddy, Gunnar Rätsch, Bernhard Schölkopf, Sebastian U. Stich, Martin Jaggi

    Abstract: Two popular examples of first-order optimization methods over linear spaces are coordinate descent and matching pursuit algorithms, with their randomized variants. While the former targets the optimization by moving along coordinates, the latter considers a generalized notion of directions. Exploiting the connection between the two algorithms, we present a unified analysis of both, providing affin… ▽ More

    Submitted 31 May, 2019; v1 submitted 26 March, 2018; originally announced March 2018.

    Journal ref: ICML 2018 - Proceedings of the 35th International Conference on Machine Learning