Skip to main content

Showing 1–46 of 46 results for author: Stich, S U

Searching in archive math. Search in all archives.
.
  1. arXiv:2509.26337  [pdf, ps, other

    cs.LG math.OC

    FedMuon: Federated Learning with Bias-corrected LMO-based Optimization

    Authors: Yuki Takezawa, Anastasia Koloskova, Xiaowen Jiang, Sebastian U. Stich

    Abstract: Recently, a new optimization method based on the linear minimization oracle (LMO), called Muon, has been attracting increasing attention since it can train neural networks faster than existing adaptive optimization methods, such as Adam. In this paper, we study how Muon can be utilized in federated learning. We first show that straightforwardly using Muon as the local optimizer of FedAvg does not… ▽ More

    Submitted 30 September, 2025; originally announced September 2025.

  2. arXiv:2506.12397  [pdf, ps, other

    math.OC

    Monotone and nonmonotone linearized block coordinate descent methods for nonsmooth composite optimization problems

    Authors: Yassine Nabou, Lahcen El Bourkhissi, Sebastian U. Stich, Tuomo Valkonen

    Abstract: In this paper, we introduce both monotone and nonmonotone variants of LiBCoD, a \textbf{Li}nearized \textbf{B}lock \textbf{Co}ordinate \textbf{D}escent method for solving composite optimization problems. At each iteration, a random block is selected, and the smooth components of the objective are linearized along the chosen block in a Gauss-Newton approach. For the monotone variant, we establish a… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  3. arXiv:2506.05791  [pdf, ps, other

    cs.LG math.OC

    Exploiting Similarity for Computation and Communication-Efficient Decentralized Optimization

    Authors: Yuki Takezawa, Xiaowen Jiang, Anton Rodomanov, Sebastian U. Stich

    Abstract: Reducing communication complexity is critical for efficient decentralized optimization. The proximal decentralized optimization (PDO) framework is particularly appealing, as methods within this framework can exploit functional similarity among nodes to reduce communication rounds. Specifically, when local functions at different nodes are similar, these methods achieve faster convergence with fewer… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

    Comments: ICML 2025

  4. arXiv:2503.08427  [pdf, other

    math.OC cs.LG

    Accelerated Distributed Optimization with Compression and Error Feedback

    Authors: Yuan Gao, Anton Rodomanov, Jeremy Rack, Sebastian U. Stich

    Abstract: Modern machine learning tasks often involve massive datasets and models, necessitating distributed optimization algorithms with reduced communication overhead. Communication compression, where clients transmit compressed updates to a central server, has emerged as a key technique to mitigate communication bottlenecks. However, the theoretical understanding of stochastic distributed optimization wi… ▽ More

    Submitted 29 March, 2025; v1 submitted 11 March, 2025; originally announced March 2025.

  5. arXiv:2501.15259  [pdf, other

    cs.LG math.OC stat.ML

    Scalable Decentralized Learning with Teleportation

    Authors: Yuki Takezawa, Sebastian U. Stich

    Abstract: Decentralized SGD can run with low communication costs, but its sparse communication characteristics deteriorate the convergence rate, especially when the number of nodes is large. In decentralized learning settings, communication is assumed to occur on only a given topology, while in many practical cases, the topology merely represents a preferred communication pattern, and connecting to arbitrar… ▽ More

    Submitted 27 February, 2025; v1 submitted 25 January, 2025; originally announced January 2025.

    Comments: ICLR 2025

  6. arXiv:2501.04443  [pdf, ps, other

    math.OC cs.DC cs.LG

    Revisiting LocalSGD and SCAFFOLD: Improved Rates and Missing Analysis

    Authors: Ruichen Luo, Sebastian U Stich, Samuel Horváth, Martin Takáč

    Abstract: LocalSGD and SCAFFOLD are widely used methods in distributed stochastic optimization, with numerous applications in machine learning, large-scale data processing, and federated learning. However, rigorously establishing their theoretical advantages over simpler methods, such as minibatch SGD (MbSGD), has proven challenging, as existing analyses often rely on strong assumptions, unrealistic premise… ▽ More

    Submitted 24 February, 2025; v1 submitted 8 January, 2025; originally announced January 2025.

  7. arXiv:2410.10800  [pdf, other

    math.OC

    Optimizing $(L_0, L_1)$-Smooth Functions by Gradient Methods

    Authors: Daniil Vankov, Anton Rodomanov, Angelia Nedich, Lalitha Sankar, Sebastian U. Stich

    Abstract: We study gradient methods for optimizing $(L_0, L_1)$-smooth functions, a class that generalizes Lipschitz-smooth functions and has gained attention for its relevance in machine learning. We provide new insights into the structure of this function class and develop a principled framework for analyzing optimization methods in this setting. While our convergence rate estimates recover existing resul… ▽ More

    Submitted 7 March, 2025; v1 submitted 14 October, 2024; originally announced October 2024.

  8. arXiv:2407.07084  [pdf, other

    cs.LG math.OC

    Stabilized Proximal-Point Methods for Federated Optimization

    Authors: Xiaowen Jiang, Anton Rodomanov, Sebastian U. Stich

    Abstract: In developing efficient optimization algorithms, it is crucial to account for communication constraints -- a significant challenge in modern Federated Learning. The best-known communication complexity among non-accelerated algorithms is achieved by DANE, a distributed proximal-point algorithm that solves local subproblems at each iteration and that can exploit second-order similarity among individ… ▽ More

    Submitted 3 November, 2024; v1 submitted 9 July, 2024; originally announced July 2024.

    Comments: Adaptive methods are added

  9. arXiv:2405.20114  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    Towards Faster Decentralized Stochastic Optimization with Communication Compression

    Authors: Rustem Islamov, Yuan Gao, Sebastian U. Stich

    Abstract: Communication efficiency has garnered significant attention as it is considered the main bottleneck for large-scale decentralized Machine Learning applications in distributed and federated settings. In this regime, clients are restricted to transmitting small amounts of quantized information to their neighbors over a communication graph. Numerous endeavors have been made to address this challengin… ▽ More

    Submitted 25 November, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

  10. arXiv:2405.11667  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication

    Authors: Kumar Kshitij Patel, Margalit Glasgow, Ali Zindari, Lingxiao Wang, Sebastian U. Stich, Ziheng Cheng, Nirmit Joshi, Nathan Srebro

    Abstract: Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice, including mini-batch SGD. Despite this success, theoretically proving the dominance of local SGD in settings with reasonable data heterogeneity has been difficult, creating a significant gap between theory and practice. In this paper, we provide new lower bounds for local SGD under… ▽ More

    Submitted 19 May, 2024; originally announced May 2024.

  11. arXiv:2404.08447  [pdf, other

    cs.LG math.OC

    Federated Optimization with Doubly Regularized Drift Correction

    Authors: Xiaowen Jiang, Anton Rodomanov, Sebastian U. Stich

    Abstract: Federated learning is a distributed optimization paradigm that allows training machine learning models across decentralized devices while keeping the data localized. The standard method, FedAvg, suffers from client drift which can hamper performance and increase communication costs over centralized methods. Previous works proposed various strategies to mitigate drift, yet none have shown uniformly… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

  12. arXiv:2403.02967  [pdf, other

    math.OC cs.LG

    Non-convex Stochastic Composite Optimization with Polyak Momentum

    Authors: Yuan Gao, Anton Rodomanov, Sebastian U. Stich

    Abstract: The stochastic proximal gradient method is a powerful generalization of the widely used stochastic gradient descent (SGD) method and has found numerous applications in Machine Learning. However, it is notoriously known that this method fails to converge in non-convex settings where the stochastic noise is significant (i.e. when only small or bounded batch sizes are used). In this paper, we focus o… ▽ More

    Submitted 8 December, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

  13. arXiv:2402.04843  [pdf, other

    math.OC

    Spectral Preconditioning for Gradient Methods on Graded Non-convex Functions

    Authors: Nikita Doikov, Sebastian U. Stich, Martin Jaggi

    Abstract: The performance of optimization methods is often tied to the spectrum of the objective Hessian. Yet, conventional assumptions, such as smoothness, do often not enable us to make finely-grained convergence statements -- particularly not for non-convex problems. Striving for a more intricate characterization of complexity, we introduce a unique concept termed graded non-convexity. This allows to par… ▽ More

    Submitted 7 February, 2024; originally announced February 2024.

  14. arXiv:2308.06058  [pdf, other

    cs.LG math.OC stat.ML

    Adaptive SGD with Polyak stepsize and Line-search: Robust Convergence and Variance Reduction

    Authors: Xiaowen Jiang, Sebastian U. Stich

    Abstract: The recently proposed stochastic Polyak stepsize (SPS) and stochastic line-search (SLS) for SGD have shown remarkable effectiveness when training over-parameterized models. However, in non-interpolation settings, both algorithms only guarantee convergence to a neighborhood of a solution which may result in a worse output than the initial guess. While artificially decreasing the adaptive stepsize h… ▽ More

    Submitted 21 August, 2023; v1 submitted 11 August, 2023; originally announced August 2023.

  15. arXiv:2307.06306  [pdf, other

    cs.LG math.OC stat.ML

    Locally Adaptive Federated Learning

    Authors: Sohom Mukherjee, Nicolas Loizou, Sebastian U. Stich

    Abstract: Federated learning is a paradigm of distributed machine learning in which multiple clients coordinate with a central server to learn a model, without sharing their own training data. Standard federated optimization methods such as Federated Averaging (FedAvg) ensure balance among the clients by using the same stepsize for local updates on all clients. However, this means that all clients need to r… ▽ More

    Submitted 14 May, 2024; v1 submitted 12 July, 2023; originally announced July 2023.

    Comments: 29 pages, 9 figures

  16. arXiv:2306.05100  [pdf, other

    math.OC cs.LG stat.ML

    Communication-Efficient Gradient Descent-Accent Methods for Distributed Variational Inequalities: Unified Analysis and Local Updates

    Authors: Siqi Zhang, Sayantan Choudhury, Sebastian U Stich, Nicolas Loizou

    Abstract: Distributed and federated learning algorithms and techniques associated primarily with minimization problems. However, with the increase of minimax optimization and variational inequality problems in machine learning, the necessity of designing efficient distributed/federated learning approaches for these problems is becoming more apparent. In this paper, we provide a unified convergence analysis… ▽ More

    Submitted 2 June, 2024; v1 submitted 8 June, 2023; originally announced June 2023.

    Comments: ICLR 2024

  17. arXiv:2305.19259  [pdf, other

    cs.LG math.OC stat.ML

    On Convergence of Incremental Gradient for Non-Convex Smooth Functions

    Authors: Anastasia Koloskova, Nikita Doikov, Sebastian U. Stich, Martin Jaggi

    Abstract: In machine learning and neural network optimization, algorithms like incremental gradient, and shuffle SGD are popular due to minimizing the number of cache misses and good practical convergence behavior. However, their optimization properties in theory, especially for non-convex smooth functions, remain incompletely explored. This paper delves into the convergence properties of SGD algorithms w… ▽ More

    Submitted 12 February, 2024; v1 submitted 30 May, 2023; originally announced May 2023.

  18. arXiv:2305.01588  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees

    Authors: Anastasia Koloskova, Hadrien Hendrikx, Sebastian U. Stich

    Abstract: Gradient clipping is a popular modification to standard (stochastic) gradient descent, at every iteration limiting the gradient norm to a certain value $c >0$. It is widely used for example for stabilizing the training of deep learning models (Goodfellow et al., 2016), or for enforcing differential privacy (Abadi et al., 2016). Despite popularity and simplicity of the clipping mechanism, its conve… ▽ More

    Submitted 9 November, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

  19. arXiv:2301.01313  [pdf, other

    math.OC cs.DC cs.LG

    Decentralized Gradient Tracking with Local Steps

    Authors: Yue Liu, Tao Lin, Anastasia Koloskova, Sebastian U. Stich

    Abstract: Gradient tracking (GT) is an algorithm designed for solving decentralized optimization problems over a network (such as training a machine learning model). A key feature of GT is a tracking mechanism that allows to overcome data heterogeneity between nodes. We develop a novel decentralized tracking mechanism, $K$-GT, that enables communication-efficient local updates in GT while inheriting the d… ▽ More

    Submitted 3 January, 2023; originally announced January 2023.

  20. arXiv:2206.08307  [pdf, other

    cs.LG cs.DC math.OC

    Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning

    Authors: Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi

    Abstract: We study the asynchronous stochastic gradient descent algorithm for distributed training over $n$ workers which have varying computation and communication frequency over time. In this algorithm, workers compute stochastic gradients in parallel at their own pace and return those to the server without any synchronization. Existing convergence rates of this algorithm for non-convex smooth objectives… ▽ More

    Submitted 16 June, 2022; originally announced June 2022.

  21. arXiv:2204.06477  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    Data-heterogeneity-aware Mixing for Decentralized Learning

    Authors: Yatin Dandi, Anastasia Koloskova, Martin Jaggi, Sebastian U. Stich

    Abstract: Decentralized learning provides an effective framework to train machine learning models with data distributed over arbitrary communication graphs. However, most existing approaches toward decentralized learning disregard the interaction between data heterogeneity and graph topology. In this paper, we characterize the dependence of convergence on the relationship between the mixing weights of the g… ▽ More

    Submitted 13 April, 2022; originally announced April 2022.

  22. arXiv:2202.03836  [pdf, other

    cs.DC cs.LG math.OC

    An Improved Analysis of Gradient Tracking for Decentralized Machine Learning

    Authors: Anastasia Koloskova, Tao Lin, Sebastian U. Stich

    Abstract: We consider decentralized machine learning over a network where the training data is distributed across $n$ agents, each of which can compute stochastic model updates on their local data. The agent's common goal is to find a model that minimizes the average of all local loss functions. While gradient tracking (GT) algorithms can overcome a key challenge, namely accounting for differences between w… ▽ More

    Submitted 8 February, 2022; originally announced February 2022.

    Comments: published at NeurIPS 2021

    MSC Class: 68W10; 68W15; 68W40; 90C06; 90C35 ACM Class: G.1.6; F.2.1

    Journal ref: 35th Conference on Neural Information Processing Systems (NeurIPS 2021)

  23. arXiv:2110.04175  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    RelaySum for Decentralized Deep Learning on Heterogeneous Data

    Authors: Thijs Vogels, Lie He, Anastasia Koloskova, Tao Lin, Sai Praneeth Karimireddy, Sebastian U. Stich, Martin Jaggi

    Abstract: In decentralized machine learning, workers compute model updates on their local data. Because the workers only communicate with few neighbors without central coordination, these updates propagate progressively over the network. This paradigm enables distributed training on networks without all-to-all connectivity, helping to protect data privacy as well as to reduce the communication cost of distr… ▽ More

    Submitted 31 January, 2022; v1 submitted 8 October, 2021; originally announced October 2021.

    Comments: Presented at NeurIPS 2021

    Journal ref: Advances in Neural Information Processing Systems 34, 2021

  24. arXiv:2106.08315  [pdf, other

    math.OC cs.DC cs.LG

    Decentralized Local Stochastic Extra-Gradient for Variational Inequalities

    Authors: Aleksandr Beznosikov, Pavel Dvurechensky, Anastasia Koloskova, Valentin Samokhin, Sebastian U Stich, Alexander Gasnikov

    Abstract: We consider distributed stochastic variational inequalities (VIs) on unbounded domains with the problem data that is heterogeneous (non-IID) and distributed across many devices. We make a very general assumption on the computational network that, in particular, covers the settings of fully decentralized calculations with time-varying networks and centralized topologies commonly used in Federated L… ▽ More

    Submitted 2 April, 2023; v1 submitted 15 June, 2021; originally announced June 2021.

    Comments: Appears in: Advances in Neural Information Processing Systems 35 (NeurIPS 2022). Minor modifications with respect to the NeurIPS version. 43 pages, 1 algorithm, 6 figures, 2 tables

    Journal ref: https://proceedings.neurips.cc/paper_files/paper/2022/hash/f9379afacdbabfdc6b060972b60f9ab8-Abstract-Conference.html

  25. arXiv:2011.01697  [pdf, other

    math.OC cs.LG

    A Linearly Convergent Algorithm for Decentralized Optimization: Sending Less Bits for Free!

    Authors: Dmitry Kovalev, Anastasia Koloskova, Martin Jaggi, Peter Richtarik, Sebastian U. Stich

    Abstract: Decentralized optimization methods enable on-device training of machine learning models without a central coordinator. In many scenarios communication between devices is energy demanding and time consuming and forms the bottleneck of the entire system. We propose a new randomized first-order method which tackles the communication bottleneck by applying randomized compression operators to the com… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

  26. arXiv:2008.03606  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning

    Authors: Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh

    Abstract: Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which gives rise to the client drift phenomenon. In fact, obtaining an algorithm for FL which is uniformly better than simple centralized training has been a major open problem thus far. In this work, we propose a general algorithmic framework, Mime, which i) mitigates cl… ▽ More

    Submitted 8 June, 2021; v1 submitted 8 August, 2020; originally announced August 2020.

    Comments: Version 2 provides stronger theoretical results and more thorough experiments

    MSC Class: 68W40; 68W15; 90C25; 90C06 ACM Class: G.1.6; F.2.1; E.4

  27. arXiv:2008.00051  [pdf, other

    cs.LG math.OC stat.ML

    On the Convergence of SGD with Biased Gradients

    Authors: Ahmad Ajalloeian, Sebastian U. Stich

    Abstract: We analyze the complexity of biased stochastic gradient methods (SGD), where individual updates are corrupted by deterministic, i.e. biased error terms. We derive convergence results for smooth (non-convex) functions and give improved rates under the Polyak-Lojasiewicz condition. We quantify how the magnitude of the bias impacts the attainable accuracy and the convergence rates (sometimes leading… ▽ More

    Submitted 9 May, 2021; v1 submitted 31 July, 2020; originally announced August 2020.

    Comments: Accepted to ICML 2020 Workshop "Beyond First Order Methods in ML Systems", updated 2021

  28. arXiv:2003.10422  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    A Unified Theory of Decentralized SGD with Changing Topology and Local Updates

    Authors: Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, Sebastian U. Stich

    Abstract: Decentralized stochastic optimization methods have gained a lot of attention recently, mainly because of their cheap per iteration cost, data locality, and their communication-efficiency. In this paper we introduce a unified convergence analysis that covers a large variety of decentralized SGD methods which so far have required different intuitions, have different applications, and which have been… ▽ More

    Submitted 2 March, 2021; v1 submitted 23 March, 2020; originally announced March 2020.

    MSC Class: 68W10; 68W15; 68W40; 90C06; 90C35 ACM Class: G.1.6; F.2.1

  29. arXiv:2002.07839  [pdf, other

    cs.LG math.OC stat.ML

    Is Local SGD Better than Minibatch SGD?

    Authors: Blake Woodworth, Kumar Kshitij Patel, Sebastian U. Stich, Zhen Dai, Brian Bullins, H. Brendan McMahan, Ohad Shamir, Nathan Srebro

    Abstract: We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD. (1) For quadratic objectives we prove that local SGD strictly dominates minibat… ▽ More

    Submitted 20 July, 2020; v1 submitted 18 February, 2020; originally announced February 2020.

    Comments: 29 pages

  30. arXiv:1910.06378  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    SCAFFOLD: Stochastic Controlled Averaging for Federated Learning

    Authors: Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh

    Abstract: Federated Averaging (FedAvg) has emerged as the algorithm of choice for federated learning due to its simplicity and low communication cost. However, in spite of recent research efforts, its performance is not fully understood. We obtain tight convergence rates for FedAvg and prove that it suffers from `client-drift' when the data is heterogeneous (non-iid), resulting in unstable and slow converge… ▽ More

    Submitted 9 April, 2021; v1 submitted 14 October, 2019; originally announced October 2019.

    Comments: v2 contains analysis of FedAvg, non-convex rates of Scaffold, and experimental evaluation. v3 fixes typos, ICML version. v4 slightly improves rate of SCAFFOLD for general convex functions

    MSC Class: 68W40; 68W15; 90C25; 90C06 ACM Class: G.1.6; F.2.1; E.4

  31. arXiv:1909.05350  [pdf, ps, other

    cs.LG cs.DC math.OC stat.ML

    The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication

    Authors: Sebastian U. Stich, Sai Praneeth Karimireddy

    Abstract: We analyze (stochastic) gradient descent (SGD) with delayed updates on smooth quasi-convex and non-convex functions and derive concise, non-asymptotic, convergence rates. We show that the rate of convergence in all cases consists of two terms: (i) a stochastic term which is not affected by the delay, and (ii) a higher order deterministic term which is only linearly slowed down by the delay. Thus,… ▽ More

    Submitted 16 June, 2021; v1 submitted 11 September, 2019; originally announced September 2019.

    Comments: Submitted 9/19, Published 9/20

    MSC Class: 68W40; 68W15; 90C25; 90C06 ACM Class: G.1.6; F.2.1; E.4

    Journal ref: Journal of Machine Learning Research (JMLR), 21(237):1-36, 2020

  32. arXiv:1907.09356  [pdf, other

    cs.LG cs.DC cs.DS math.OC stat.ML

    Decentralized Deep Learning with Arbitrary Communication Compression

    Authors: Anastasia Koloskova, Tao Lin, Sebastian U. Stich, Martin Jaggi

    Abstract: Decentralized training of deep learning models is a key element for enabling data privacy and on-device learning over networks, as well as for efficient scaling to large compute clusters. As current approaches suffer from limited bandwidth of the network, we propose the use of communication compression in the decentralized training context. We show that Choco-SGD $-$ recently introduced and analyz… ▽ More

    Submitted 11 November, 2020; v1 submitted 22 July, 2019; originally announced July 2019.

    MSC Class: 68W10; 68W15; 68W40; 90C06; 90C25; 90C35 ACM Class: G.1.6; F.2.1; E.4

  33. arXiv:1907.04232  [pdf, ps, other

    cs.LG math.NA math.OC stat.ML

    Unified Optimal Analysis of the (Stochastic) Gradient Method

    Authors: Sebastian U. Stich

    Abstract: In this note we give a simple proof for the convergence of stochastic gradient (SGD) methods on $μ$-convex functions under a (milder than standard) $L$-smoothness assumption. We show that for carefully chosen stepsizes SGD converges after $T$ iterations as $O\left( LR^2 \exp \bigl[-\fracμ{4L}T\bigr] + \frac{σ^2}{μT} \right)$ where $σ^2$ measures the variance in the stochastic noise. For determinis… ▽ More

    Submitted 23 December, 2019; v1 submitted 9 July, 2019; originally announced July 2019.

    Comments: 11 pages, version 2 fixes typos and case distinction in the proof

  34. arXiv:1902.00340  [pdf, other

    cs.LG cs.DC cs.DS math.OC stat.ML

    Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication

    Authors: Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi

    Abstract: We consider decentralized stochastic optimization with the objective function (e.g. data samples for machine learning task) being distributed over $n$ machines that can only communicate to their neighbors on a fixed communication graph. To reduce the communication bottleneck, the nodes compress (e.g. quantize or sparsify) their model updates. We cover both unbiased and biased compression operators… ▽ More

    Submitted 1 February, 2019; originally announced February 2019.

    MSC Class: 68W10; 68W15; 68W40; 90C06; 90C25; 90C35 ACM Class: G.1.6; F.2.1; E.4

  35. arXiv:1901.09847  [pdf, other

    cs.LG math.OC stat.ML

    Error Feedback Fixes SignSGD and other Gradient Compression Schemes

    Authors: Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U. Stich, Martin Jaggi

    Abstract: Sign-based algorithms (e.g. signSGD) have been proposed as a biased gradient compression technique to alleviate the communication bottleneck in training large neural networks across multiple workers. We show simple convex counter-examples where signSGD does not converge to the optimum. Further, even when it does converge, signSGD may generalize poorly when compared with SGD. These issues arise bec… ▽ More

    Submitted 29 May, 2019; v1 submitted 28 January, 2019; originally announced January 2019.

    Comments: ICML 2019 (long talk)

    ACM Class: I.2.6; I.5.1

  36. arXiv:1810.06999  [pdf, other

    math.OC cs.LG stat.CO stat.ML

    Efficient Greedy Coordinate Descent for Composite Problems

    Authors: Sai Praneeth Karimireddy, Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi

    Abstract: Coordinate descent with random coordinate selection is the current state of the art for many large scale optimization problems. However, greedy selection of the steepest coordinate on smooth problems can yield convergence rates independent of the dimension $n$, and requiring upto $n$ times fewer iterations. In this paper, we consider greedy updates that are based on subgradients for a class of n… ▽ More

    Submitted 16 October, 2018; originally announced October 2018.

    Comments: 44 pages, 17 figures, 3 tables

    MSC Class: 90C25; 68Q25 ACM Class: G.1.6

  37. arXiv:1806.00413  [pdf, ps, other

    cs.LG math.OC stat.ML

    Global linear convergence of Newton's method without strong-convexity or Lipschitz gradients

    Authors: Sai Praneeth Karimireddy, Sebastian U. Stich, Martin Jaggi

    Abstract: We show that Newton's method converges globally at a linear rate for objective functions whose Hessians are stable. This class of problems includes many functions which are not strongly convex, such as logistic regression. Our linear convergence result is (i) affine-invariant, and holds even if an (ii) approximate Hessian is used, and if the subproblems are (iii) only solved approximately. Thus we… ▽ More

    Submitted 1 June, 2018; originally announced June 2018.

    Comments: 19 pages

    MSC Class: 90C25; 68Q25 ACM Class: G.1.6

  38. arXiv:1805.09767  [pdf, other

    math.OC cs.DC cs.LG

    Local SGD Converges Fast and Communicates Little

    Authors: Sebastian U. Stich

    Abstract: Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training. The scheme can reach a linear speedup with respect to the number of workers, but this is rarely seen in practice as the scheme often suffers from large network delays and bandwidth limits. To overcome this communication bottleneck recent works propose to reduce the communication frequency. An algo… ▽ More

    Submitted 3 May, 2019; v1 submitted 24 May, 2018; originally announced May 2018.

    Comments: to appear at ICLR 2019, 19 pages

    MSC Class: 90C06; 68W40; 68W10 ACM Class: G.1.6; F.2.1

  39. arXiv:1805.00982  [pdf, other

    math.OC cs.LG stat.ML

    k-SVRG: Variance Reduction for Large Scale Optimization

    Authors: Anant Raj, Sebastian U. Stich

    Abstract: Variance reduced stochastic gradient (SGD) methods converge significantly faster than the vanilla SGD counterpart. However, these methods are not very practical on large scale problems, as they either i) require frequent passes over the full data to recompute gradients---without making any progress during this time (like for SVRG), or ii)~they require additional memory that can surpass the size of… ▽ More

    Submitted 16 October, 2018; v1 submitted 2 May, 2018; originally announced May 2018.

    Comments: The title of the previous version of the manuscript was "SVRG meets SAGA: k-SVRG A Tale of Limited Memory"

    MSC Class: 90C06; 68W40; 68W20 ACM Class: G.1.6; F.2.1

  40. arXiv:1803.09539  [pdf, other

    stat.ML cs.LG math.OC

    On Matching Pursuit and Coordinate Descent

    Authors: Francesco Locatello, Anant Raj, Sai Praneeth Karimireddy, Gunnar Rätsch, Bernhard Schölkopf, Sebastian U. Stich, Martin Jaggi

    Abstract: Two popular examples of first-order optimization methods over linear spaces are coordinate descent and matching pursuit algorithms, with their randomized variants. While the former targets the optimization by moving along coordinates, the latter considers a generalized notion of directions. Exploiting the connection between the two algorithms, we present a unified analysis of both, providing affin… ▽ More

    Submitted 31 May, 2019; v1 submitted 26 March, 2018; originally announced March 2018.

    Journal ref: ICML 2018 - Proceedings of the 35th International Conference on Machine Learning

  41. arXiv:1711.02637  [pdf, other

    cs.LG math.OC

    Safe Adaptive Importance Sampling

    Authors: Sebastian U. Stich, Anant Raj, Martin Jaggi

    Abstract: Importance sampling has become an indispensable strategy to speed up optimization algorithms for large-scale applications. Improved adaptive variants - using importance values defined by the complete gradient information which changes during optimization - enjoy favorable theoretical properties, but are typically computationally infeasible. In this paper we propose an efficient approximation of gr… ▽ More

    Submitted 7 November, 2017; originally announced November 2017.

    Comments: To appear at NIPS 2017

    MSC Class: 90C25; 68W20; 68Q25 ACM Class: G.1.6

  42. arXiv:1706.08427  [pdf, other

    cs.LG math.OC

    Approximate Steepest Coordinate Descent

    Authors: Sebastian U. Stich, Anant Raj, Martin Jaggi

    Abstract: We propose a new selection rule for the coordinate selection in coordinate descent methods for huge-scale optimization. The efficiency of this novel scheme is provably better than the efficiency of uniformly random selection, and can reach the efficiency of steepest coordinate descent (SCD), enabling an acceleration of a factor of up to $n$, the number of coordinates. In many practical application… ▽ More

    Submitted 26 June, 2017; originally announced June 2017.

    Comments: appearing at ICML 2017

    ACM Class: G.1.6

  43. arXiv:1701.08183  [pdf, ps, other

    math.CO

    On the Existence of Ordinary Triangles

    Authors: Radoslav Fulek, Hossein Nassajian Mojarrad, Márton Naszódi, József Solymosi, Sebastian U. Stich, May Szedlák

    Abstract: Let $P$ be a finite point set in the plane. A \emph{$c$-ordinary triangle} in $P$ is a subset of $P$ consisting of three non-collinear points such that each of the three lines determined by the three points contains at most $c$ points of $P$. Motivated by a question of Erdős, and answering a question of de Zeeuw, we prove that there exists a constant $c>0$ such that $P$ contains a $c$-ordinary tri… ▽ More

    Submitted 10 June, 2017; v1 submitted 27 January, 2017; originally announced January 2017.

    Comments: 5 pages, now an estimate on c is given

    MSC Class: 52C30

  44. arXiv:1406.2010  [pdf, other

    math.OC math.NA

    On low complexity Acceleration Techniques for Randomized Optimization: Supplementary Online Material

    Authors: Sebastian U. Stich

    Abstract: Recently it was shown by Nesterov (2011) that techniques form convex optimization can be used to successfully accelerate simple derivative-free randomized optimization methods. The appeal of those schemes lies in their low complexity, which is only $Θ(n)$ per iteration---compared to $Θ(n^2)$ for algorithms storing second-order information or covariance matrices. From a high-level point of view, th… ▽ More

    Submitted 11 June, 2014; v1 submitted 8 June, 2014; originally announced June 2014.

    Comments: 15 pages, 9 figures; the main part without the appendix is a preprint of a conference publication that will appear at PPSN XIII (2014); the appendix reports the complete numerical data that could not fit into the main part

  45. arXiv:1210.5114  [pdf, other

    math.OC

    Variable Metric Random Pursuit

    Authors: Sebastian U. Stich, Christian L. Müller, Bernd Gärtner

    Abstract: We consider unconstrained randomized optimization of smooth convex objective functions in the gradient-free setting. We analyze Random Pursuit (RP) algorithms with fixed (F-RP) and variable metric (V-RP). The algorithms only use zeroth-order information about the objective function and compute an approximate solution by repeated optimization over randomly chosen one-dimensional subspaces. The dist… ▽ More

    Submitted 30 September, 2014; v1 submitted 18 October, 2012; originally announced October 2012.

    Comments: 42 pages, 6 figures, 15 tables, submitted to journal, Version 3: majorly revised second part, i.e. Section 5 and Appendix

  46. arXiv:1111.0194  [pdf, other

    math.OC cs.DS math.NA

    Optimization of Convex Functions with Random Pursuit

    Authors: Sebastian U. Stich, Christian L. Müller, Bernd Gärtner

    Abstract: We consider unconstrained randomized optimization of convex objective functions. We analyze the Random Pursuit algorithm, which iteratively computes an approximate solution to the optimization problem by repeated optimization over a randomly chosen one-dimensional subspace. This randomized method only uses zeroth-order information about the objective function and does not need any problem-specific… ▽ More

    Submitted 24 May, 2012; v1 submitted 1 November, 2011; originally announced November 2011.

    Comments: 35 pages, 5 figures, 8 algorithms, 21 tables, submitted to journal The appendix contains additional supporting online material, not contained in the journal version