Search | arXiv e-print repository

FedMuon: Federated Learning with Bias-corrected LMO-based Optimization

Authors: Yuki Takezawa, Anastasia Koloskova, Xiaowen Jiang, Sebastian U. Stich

Abstract: Recently, a new optimization method based on the linear minimization oracle (LMO), called Muon, has been attracting increasing attention since it can train neural networks faster than existing adaptive optimization methods, such as Adam. In this paper, we study how Muon can be utilized in federated learning. We first show that straightforwardly using Muon as the local optimizer of FedAvg does not… ▽ More Recently, a new optimization method based on the linear minimization oracle (LMO), called Muon, has been attracting increasing attention since it can train neural networks faster than existing adaptive optimization methods, such as Adam. In this paper, we study how Muon can be utilized in federated learning. We first show that straightforwardly using Muon as the local optimizer of FedAvg does not converge to the stationary point since the LMO is a biased operator. We then propose FedMuon which can mitigate this issue. We also analyze how solving the LMO approximately affects the convergence rate and find that, surprisingly, FedMuon can converge for any number of Newton-Schulz iterations, while it can converge faster as we solve the LMO more accurately. Through experiments, we demonstrated that FedMuon can outperform the state-of-the-art federated learning methods. △ Less

Submitted 30 September, 2025; originally announced September 2025.

arXiv:2506.12397 [pdf, ps, other]

Monotone and nonmonotone linearized block coordinate descent methods for nonsmooth composite optimization problems

Authors: Yassine Nabou, Lahcen El Bourkhissi, Sebastian U. Stich, Tuomo Valkonen

Abstract: In this paper, we introduce both monotone and nonmonotone variants of LiBCoD, a \textbf{Li}nearized \textbf{B}lock \textbf{Co}ordinate \textbf{D}escent method for solving composite optimization problems. At each iteration, a random block is selected, and the smooth components of the objective are linearized along the chosen block in a Gauss-Newton approach. For the monotone variant, we establish a… ▽ More In this paper, we introduce both monotone and nonmonotone variants of LiBCoD, a \textbf{Li}nearized \textbf{B}lock \textbf{Co}ordinate \textbf{D}escent method for solving composite optimization problems. At each iteration, a random block is selected, and the smooth components of the objective are linearized along the chosen block in a Gauss-Newton approach. For the monotone variant, we establish a global sublinear convergence rate to a stationary point under the assumption of bounded iterates. For the nonmonotone variant, we derive a global sublinear convergence rate without requiring global Lipschitz continuity or bounded iterates. Preliminary numerical experiments highlight the promising performance of the proposed approach. △ Less

Submitted 14 June, 2025; originally announced June 2025.

arXiv:2506.05791 [pdf, ps, other]

Exploiting Similarity for Computation and Communication-Efficient Decentralized Optimization

Authors: Yuki Takezawa, Xiaowen Jiang, Anton Rodomanov, Sebastian U. Stich

Abstract: Reducing communication complexity is critical for efficient decentralized optimization. The proximal decentralized optimization (PDO) framework is particularly appealing, as methods within this framework can exploit functional similarity among nodes to reduce communication rounds. Specifically, when local functions at different nodes are similar, these methods achieve faster convergence with fewer… ▽ More Reducing communication complexity is critical for efficient decentralized optimization. The proximal decentralized optimization (PDO) framework is particularly appealing, as methods within this framework can exploit functional similarity among nodes to reduce communication rounds. Specifically, when local functions at different nodes are similar, these methods achieve faster convergence with fewer communication steps. However, existing PDO methods often require highly accurate solutions to subproblems associated with the proximal operator, resulting in significant computational overhead. In this work, we propose the Stabilized Proximal Decentralized Optimization (SPDO) method, which achieves state-of-the-art communication and computational complexities within the PDO framework. Additionally, we refine the analysis of existing PDO methods by relaxing subproblem accuracy requirements and leveraging average functional similarity. Experimental results demonstrate that SPDO significantly outperforms existing methods. △ Less

Submitted 6 June, 2025; originally announced June 2025.

Comments: ICML 2025

arXiv:2503.08427 [pdf, other]

Accelerated Distributed Optimization with Compression and Error Feedback

Authors: Yuan Gao, Anton Rodomanov, Jeremy Rack, Sebastian U. Stich

Abstract: Modern machine learning tasks often involve massive datasets and models, necessitating distributed optimization algorithms with reduced communication overhead. Communication compression, where clients transmit compressed updates to a central server, has emerged as a key technique to mitigate communication bottlenecks. However, the theoretical understanding of stochastic distributed optimization wi… ▽ More Modern machine learning tasks often involve massive datasets and models, necessitating distributed optimization algorithms with reduced communication overhead. Communication compression, where clients transmit compressed updates to a central server, has emerged as a key technique to mitigate communication bottlenecks. However, the theoretical understanding of stochastic distributed optimization with contractive compression remains limited, particularly in conjunction with Nesterov acceleration -- a cornerstone for achieving faster convergence in optimization. In this paper, we propose a novel algorithm, ADEF (Accelerated Distributed Error Feedback), which integrates Nesterov acceleration, contractive compression, error feedback, and gradient difference compression. We prove that ADEF achieves the first accelerated convergence rate for stochastic distributed optimization with contractive compression in the general convex regime. Numerical experiments validate our theoretical findings and demonstrate the practical efficacy of ADEF in reducing communication costs while maintaining fast convergence. △ Less

Submitted 29 March, 2025; v1 submitted 11 March, 2025; originally announced March 2025.

arXiv:2501.15259 [pdf, other]

Scalable Decentralized Learning with Teleportation

Authors: Yuki Takezawa, Sebastian U. Stich

Abstract: Decentralized SGD can run with low communication costs, but its sparse communication characteristics deteriorate the convergence rate, especially when the number of nodes is large. In decentralized learning settings, communication is assumed to occur on only a given topology, while in many practical cases, the topology merely represents a preferred communication pattern, and connecting to arbitrar… ▽ More Decentralized SGD can run with low communication costs, but its sparse communication characteristics deteriorate the convergence rate, especially when the number of nodes is large. In decentralized learning settings, communication is assumed to occur on only a given topology, while in many practical cases, the topology merely represents a preferred communication pattern, and connecting to arbitrary nodes is still possible. Previous studies have tried to alleviate the convergence rate degradation in these cases by designing topologies with large spectral gaps. However, the degradation is still significant when the number of nodes is substantial. In this work, we propose TELEPORTATION. TELEPORTATION activates only a subset of nodes, and the active nodes fetch the parameters from previous active nodes. Then, the active nodes update their parameters by SGD and perform gossip averaging on a relatively small topology comprising only the active nodes. We show that by activating only a proper number of nodes, TELEPORTATION can completely alleviate the convergence rate degradation. Furthermore, we propose an efficient hyperparameter-tuning method to search for the appropriate number of nodes to be activated. Experimentally, we showed that TELEPORTATION can train neural networks more stably and achieve higher accuracy than Decentralized SGD. △ Less

Submitted 27 February, 2025; v1 submitted 25 January, 2025; originally announced January 2025.

Comments: ICLR 2025

arXiv:2501.04443 [pdf, ps, other]

Revisiting LocalSGD and SCAFFOLD: Improved Rates and Missing Analysis

Authors: Ruichen Luo, Sebastian U Stich, Samuel Horváth, Martin Takáč

Abstract: LocalSGD and SCAFFOLD are widely used methods in distributed stochastic optimization, with numerous applications in machine learning, large-scale data processing, and federated learning. However, rigorously establishing their theoretical advantages over simpler methods, such as minibatch SGD (MbSGD), has proven challenging, as existing analyses often rely on strong assumptions, unrealistic premise… ▽ More LocalSGD and SCAFFOLD are widely used methods in distributed stochastic optimization, with numerous applications in machine learning, large-scale data processing, and federated learning. However, rigorously establishing their theoretical advantages over simpler methods, such as minibatch SGD (MbSGD), has proven challenging, as existing analyses often rely on strong assumptions, unrealistic premises, or overly restrictive scenarios. In this work, we revisit the convergence properties of LocalSGD and SCAFFOLD under a variety of existing or weaker conditions, including gradient similarity, Hessian similarity, weak convexity, and Lipschitz continuity of the Hessian. Our analysis shows that (i) LocalSGD achieves faster convergence compared to MbSGD for weakly convex functions without requiring stronger gradient similarity assumptions; (ii) LocalSGD benefits significantly from higher-order similarity and smoothness; and (iii) SCAFFOLD demonstrates faster convergence than MbSGD for a broader class of non-quadratic functions. These theoretical insights provide a clearer understanding of the conditions under which LocalSGD and SCAFFOLD outperform MbSGD. △ Less

Submitted 24 February, 2025; v1 submitted 8 January, 2025; originally announced January 2025.

arXiv:2410.10800 [pdf, other]

Optimizing $(L_0, L_1)$-Smooth Functions by Gradient Methods

Authors: Daniil Vankov, Anton Rodomanov, Angelia Nedich, Lalitha Sankar, Sebastian U. Stich

Abstract: We study gradient methods for optimizing $(L_0, L_1)$-smooth functions, a class that generalizes Lipschitz-smooth functions and has gained attention for its relevance in machine learning. We provide new insights into the structure of this function class and develop a principled framework for analyzing optimization methods in this setting. While our convergence rate estimates recover existing resul… ▽ More We study gradient methods for optimizing $(L_0, L_1)$-smooth functions, a class that generalizes Lipschitz-smooth functions and has gained attention for its relevance in machine learning. We provide new insights into the structure of this function class and develop a principled framework for analyzing optimization methods in this setting. While our convergence rate estimates recover existing results for minimizing the gradient norm in nonconvex problems, our approach significantly improves the best-known complexity bounds for convex objectives. Moreover, we show that the gradient method with Polyak stepsizes and the normalized gradient method achieve nearly the same complexity guarantees as methods that rely on explicit knowledge of~$(L_0, L_1)$. Finally, we demonstrate that a carefully designed accelerated gradient method can be applied to $(L_0, L_1)$-smooth functions, further improving all previous results. △ Less

Submitted 7 March, 2025; v1 submitted 14 October, 2024; originally announced October 2024.

arXiv:2407.07084 [pdf, other]

Stabilized Proximal-Point Methods for Federated Optimization

Authors: Xiaowen Jiang, Anton Rodomanov, Sebastian U. Stich

Abstract: In developing efficient optimization algorithms, it is crucial to account for communication constraints -- a significant challenge in modern Federated Learning. The best-known communication complexity among non-accelerated algorithms is achieved by DANE, a distributed proximal-point algorithm that solves local subproblems at each iteration and that can exploit second-order similarity among individ… ▽ More In developing efficient optimization algorithms, it is crucial to account for communication constraints -- a significant challenge in modern Federated Learning. The best-known communication complexity among non-accelerated algorithms is achieved by DANE, a distributed proximal-point algorithm that solves local subproblems at each iteration and that can exploit second-order similarity among individual functions. However, to achieve such communication efficiency, the algorithm requires solving local subproblems sufficiently accurately resulting in slightly sub-optimal local complexity. Inspired by the hybrid-projection proximal-point method, in this work, we propose a novel distributed algorithm S-DANE. Compared to DANE, this method uses an auxiliary sequence of prox-centers while maintaining the same deterministic communication complexity. Moreover, the accuracy condition for solving the subproblem is milder, leading to enhanced local computation efficiency. Furthermore, S-DANE supports partial client participation and arbitrary stochastic local solvers, making it attractive in practice. We further accelerate S-DANE and show that the resulting algorithm achieves the best-known communication complexity among all existing methods for distributed convex optimization while still enjoying good local computation efficiency as S-DANE. Finally, we propose adaptive variants of both methods using line search, obtaining the first provably efficient adaptive algorithms that could exploit local second-order similarity without the prior knowledge of any parameters. △ Less

Submitted 3 November, 2024; v1 submitted 9 July, 2024; originally announced July 2024.

Comments: Adaptive methods are added

arXiv:2405.20114 [pdf, other]

Towards Faster Decentralized Stochastic Optimization with Communication Compression

Authors: Rustem Islamov, Yuan Gao, Sebastian U. Stich

Abstract: Communication efficiency has garnered significant attention as it is considered the main bottleneck for large-scale decentralized Machine Learning applications in distributed and federated settings. In this regime, clients are restricted to transmitting small amounts of quantized information to their neighbors over a communication graph. Numerous endeavors have been made to address this challengin… ▽ More Communication efficiency has garnered significant attention as it is considered the main bottleneck for large-scale decentralized Machine Learning applications in distributed and federated settings. In this regime, clients are restricted to transmitting small amounts of quantized information to their neighbors over a communication graph. Numerous endeavors have been made to address this challenging problem by developing algorithms with compressed communication for decentralized non-convex optimization problems. Despite considerable efforts, the current results suffer from various issues such as non-scalability with the number of clients, requirements for large batches, or bounded gradient assumption. In this paper, we introduce MoTEF, a novel approach that integrates communication compression with Momentum Tracking and Error Feedback. Our analysis demonstrates that MoTEF achieves most of the desired properties, and significantly outperforms existing methods under arbitrary data heterogeneity. We provide numerical experiments to validate our theoretical findings and confirm the practical superiority of MoTEF. △ Less

Submitted 25 November, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.11667 [pdf, other]

The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication

Authors: Kumar Kshitij Patel, Margalit Glasgow, Ali Zindari, Lingxiao Wang, Sebastian U. Stich, Ziheng Cheng, Nirmit Joshi, Nathan Srebro

Abstract: Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice, including mini-batch SGD. Despite this success, theoretically proving the dominance of local SGD in settings with reasonable data heterogeneity has been difficult, creating a significant gap between theory and practice. In this paper, we provide new lower bounds for local SGD under… ▽ More Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice, including mini-batch SGD. Despite this success, theoretically proving the dominance of local SGD in settings with reasonable data heterogeneity has been difficult, creating a significant gap between theory and practice. In this paper, we provide new lower bounds for local SGD under existing first-order data heterogeneity assumptions, showing that these assumptions are insufficient to prove the effectiveness of local update steps. Furthermore, under these same assumptions, we demonstrate the min-max optimality of accelerated mini-batch SGD, which fully resolves our understanding of distributed optimization for several problem classes. Our results emphasize the need for better models of data heterogeneity to understand the effectiveness of local SGD in practice. Towards this end, we consider higher-order smoothness and heterogeneity assumptions, providing new upper bounds that imply the dominance of local SGD over mini-batch SGD when data heterogeneity is low. △ Less

Submitted 19 May, 2024; originally announced May 2024.

arXiv:2404.08447 [pdf, other]

Federated Optimization with Doubly Regularized Drift Correction

Authors: Xiaowen Jiang, Anton Rodomanov, Sebastian U. Stich

Abstract: Federated learning is a distributed optimization paradigm that allows training machine learning models across decentralized devices while keeping the data localized. The standard method, FedAvg, suffers from client drift which can hamper performance and increase communication costs over centralized methods. Previous works proposed various strategies to mitigate drift, yet none have shown uniformly… ▽ More Federated learning is a distributed optimization paradigm that allows training machine learning models across decentralized devices while keeping the data localized. The standard method, FedAvg, suffers from client drift which can hamper performance and increase communication costs over centralized methods. Previous works proposed various strategies to mitigate drift, yet none have shown uniformly improved communication-computation trade-offs over vanilla gradient descent. In this work, we revisit DANE, an established method in distributed optimization. We show that (i) DANE can achieve the desired communication reduction under Hessian similarity constraints. Furthermore, (ii) we present an extension, DANE+, which supports arbitrary inexact local solvers and has more freedom to choose how to aggregate the local updates. We propose (iii) a novel method, FedRed, which has improved local computational complexity and retains the same communication complexity compared to DANE/DANE+. This is achieved by using doubly regularized drift correction. △ Less

Submitted 12 April, 2024; originally announced April 2024.

arXiv:2403.02967 [pdf, other]

Non-convex Stochastic Composite Optimization with Polyak Momentum

Authors: Yuan Gao, Anton Rodomanov, Sebastian U. Stich

Abstract: The stochastic proximal gradient method is a powerful generalization of the widely used stochastic gradient descent (SGD) method and has found numerous applications in Machine Learning. However, it is notoriously known that this method fails to converge in non-convex settings where the stochastic noise is significant (i.e. when only small or bounded batch sizes are used). In this paper, we focus o… ▽ More The stochastic proximal gradient method is a powerful generalization of the widely used stochastic gradient descent (SGD) method and has found numerous applications in Machine Learning. However, it is notoriously known that this method fails to converge in non-convex settings where the stochastic noise is significant (i.e. when only small or bounded batch sizes are used). In this paper, we focus on the stochastic proximal gradient method with Polyak momentum. We prove this method attains an optimal convergence rate for non-convex composite optimization problems, regardless of batch size. Additionally, we rigorously analyze the variance reduction effect of the Polyak momentum in the composite optimization setting and we show the method also converges when the proximal step can only be solved inexactly. Finally, we provide numerical experiments to validate our theoretical results. △ Less

Submitted 8 December, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

arXiv:2402.04843 [pdf, other]

Spectral Preconditioning for Gradient Methods on Graded Non-convex Functions

Authors: Nikita Doikov, Sebastian U. Stich, Martin Jaggi

Abstract: The performance of optimization methods is often tied to the spectrum of the objective Hessian. Yet, conventional assumptions, such as smoothness, do often not enable us to make finely-grained convergence statements -- particularly not for non-convex problems. Striving for a more intricate characterization of complexity, we introduce a unique concept termed graded non-convexity. This allows to par… ▽ More The performance of optimization methods is often tied to the spectrum of the objective Hessian. Yet, conventional assumptions, such as smoothness, do often not enable us to make finely-grained convergence statements -- particularly not for non-convex problems. Striving for a more intricate characterization of complexity, we introduce a unique concept termed graded non-convexity. This allows to partition the class of non-convex problems into a nested chain of subclasses. Interestingly, many traditional non-convex objectives, including partially convex problems, matrix factorizations, and neural networks, fall within these subclasses. As a second contribution, we propose gradient methods with spectral preconditioning, which employ inexact top eigenvectors of the Hessian to address the ill-conditioning of the problem, contingent on the grade. Our analysis reveals that these new methods provide provably superior convergence rates compared to basic gradient descent on applicable problem classes, particularly when large gaps exist between the top eigenvalues of the Hessian. Our theory is validated by numerical experiments executed on multiple practical machine learning problems. △ Less

Submitted 7 February, 2024; originally announced February 2024.

arXiv:2308.06058 [pdf, other]

Adaptive SGD with Polyak stepsize and Line-search: Robust Convergence and Variance Reduction

Authors: Xiaowen Jiang, Sebastian U. Stich

Abstract: The recently proposed stochastic Polyak stepsize (SPS) and stochastic line-search (SLS) for SGD have shown remarkable effectiveness when training over-parameterized models. However, in non-interpolation settings, both algorithms only guarantee convergence to a neighborhood of a solution which may result in a worse output than the initial guess. While artificially decreasing the adaptive stepsize h… ▽ More The recently proposed stochastic Polyak stepsize (SPS) and stochastic line-search (SLS) for SGD have shown remarkable effectiveness when training over-parameterized models. However, in non-interpolation settings, both algorithms only guarantee convergence to a neighborhood of a solution which may result in a worse output than the initial guess. While artificially decreasing the adaptive stepsize has been proposed to address this issue (Orvieto et al. [2022]), this approach results in slower convergence rates for convex and over-parameterized models. In this work, we make two contributions: Firstly, we propose two new variants of SPS and SLS, called AdaSPS and AdaSLS, which guarantee convergence in non-interpolation settings and maintain sub-linear and linear convergence rates for convex and strongly convex functions when training over-parameterized models. AdaSLS requires no knowledge of problem-dependent parameters, and AdaSPS requires only a lower bound of the optimal function value as input. Secondly, we equip AdaSPS and AdaSLS with a novel variance reduction technique and obtain algorithms that require $\smash{\widetilde{\mathcal{O}}}(n+1/ε)$ gradient evaluations to achieve an $\mathcal{O}(ε)$-suboptimality for convex functions, which improves upon the slower $\mathcal{O}(1/ε^2)$ rates of AdaSPS and AdaSLS without variance reduction in the non-interpolation regimes. Moreover, our result matches the fast rates of AdaSVRG but removes the inner-outer-loop structure, which is easier to implement and analyze. Finally, numerical experiments on synthetic and real datasets validate our theory and demonstrate the effectiveness and robustness of our algorithms. △ Less

Submitted 21 August, 2023; v1 submitted 11 August, 2023; originally announced August 2023.

arXiv:2307.06306 [pdf, other]

Locally Adaptive Federated Learning

Authors: Sohom Mukherjee, Nicolas Loizou, Sebastian U. Stich

Abstract: Federated learning is a paradigm of distributed machine learning in which multiple clients coordinate with a central server to learn a model, without sharing their own training data. Standard federated optimization methods such as Federated Averaging (FedAvg) ensure balance among the clients by using the same stepsize for local updates on all clients. However, this means that all clients need to r… ▽ More Federated learning is a paradigm of distributed machine learning in which multiple clients coordinate with a central server to learn a model, without sharing their own training data. Standard federated optimization methods such as Federated Averaging (FedAvg) ensure balance among the clients by using the same stepsize for local updates on all clients. However, this means that all clients need to respect the global geometry of the function which could yield slow convergence. In this work, we propose locally adaptive federated learning algorithms, that leverage the local geometric information for each client function. We show that such locally adaptive methods with uncoordinated stepsizes across all clients can be particularly efficient in interpolated (overparameterized) settings, and analyze their convergence in the presence of heterogeneous data for convex and strongly convex settings. We validate our theoretical claims by performing illustrative experiments for both i.i.d. non-i.i.d. cases. Our proposed algorithms match the optimization performance of tuned FedAvg in the convex setting, outperform FedAvg as well as state-of-the-art adaptive federated algorithms like FedAMS for non-convex experiments, and come with superior generalization performance. △ Less

Submitted 14 May, 2024; v1 submitted 12 July, 2023; originally announced July 2023.

Comments: 29 pages, 9 figures

arXiv:2306.05100 [pdf, other]

Communication-Efficient Gradient Descent-Accent Methods for Distributed Variational Inequalities: Unified Analysis and Local Updates

Authors: Siqi Zhang, Sayantan Choudhury, Sebastian U Stich, Nicolas Loizou

Abstract: Distributed and federated learning algorithms and techniques associated primarily with minimization problems. However, with the increase of minimax optimization and variational inequality problems in machine learning, the necessity of designing efficient distributed/federated learning approaches for these problems is becoming more apparent. In this paper, we provide a unified convergence analysis… ▽ More Distributed and federated learning algorithms and techniques associated primarily with minimization problems. However, with the increase of minimax optimization and variational inequality problems in machine learning, the necessity of designing efficient distributed/federated learning approaches for these problems is becoming more apparent. In this paper, we provide a unified convergence analysis of communication-efficient local training methods for distributed variational inequality problems (VIPs). Our approach is based on a general key assumption on the stochastic estimates that allows us to propose and analyze several novel local training algorithms under a single framework for solving a class of structured non-monotone VIPs. We present the first local gradient descent-accent algorithms with provable improved communication complexity for solving distributed variational inequalities on heterogeneous data. The general algorithmic framework recovers state-of-the-art algorithms and their sharp convergence guarantees when the setting is specialized to minimization or minimax optimization problems. Finally, we demonstrate the strong performance of the proposed algorithms compared to state-of-the-art methods when solving federated minimax optimization problems. △ Less

Submitted 2 June, 2024; v1 submitted 8 June, 2023; originally announced June 2023.

Comments: ICLR 2024

arXiv:2305.19259 [pdf, other]

On Convergence of Incremental Gradient for Non-Convex Smooth Functions

Authors: Anastasia Koloskova, Nikita Doikov, Sebastian U. Stich, Martin Jaggi

Abstract: In machine learning and neural network optimization, algorithms like incremental gradient, and shuffle SGD are popular due to minimizing the number of cache misses and good practical convergence behavior. However, their optimization properties in theory, especially for non-convex smooth functions, remain incompletely explored. This paper delves into the convergence properties of SGD algorithms w… ▽ More In machine learning and neural network optimization, algorithms like incremental gradient, and shuffle SGD are popular due to minimizing the number of cache misses and good practical convergence behavior. However, their optimization properties in theory, especially for non-convex smooth functions, remain incompletely explored. This paper delves into the convergence properties of SGD algorithms with arbitrary data ordering, within a broad framework for non-convex smooth functions. Our findings show enhanced convergence guarantees for incremental gradient and single shuffle SGD. Particularly if $n$ is the training set size, we improve $n$ times the optimization term of convergence guarantee to reach accuracy $\varepsilon$ from $O(n / \varepsilon)$ to $O(1 / \varepsilon)$. △ Less

Submitted 12 February, 2024; v1 submitted 30 May, 2023; originally announced May 2023.

arXiv:2305.01588 [pdf, other]

Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees

Authors: Anastasia Koloskova, Hadrien Hendrikx, Sebastian U. Stich

Abstract: Gradient clipping is a popular modification to standard (stochastic) gradient descent, at every iteration limiting the gradient norm to a certain value $c >0$. It is widely used for example for stabilizing the training of deep learning models (Goodfellow et al., 2016), or for enforcing differential privacy (Abadi et al., 2016). Despite popularity and simplicity of the clipping mechanism, its conve… ▽ More Gradient clipping is a popular modification to standard (stochastic) gradient descent, at every iteration limiting the gradient norm to a certain value $c >0$. It is widely used for example for stabilizing the training of deep learning models (Goodfellow et al., 2016), or for enforcing differential privacy (Abadi et al., 2016). Despite popularity and simplicity of the clipping mechanism, its convergence guarantees often require specific values of $c$ and strong noise assumptions. In this paper, we give convergence guarantees that show precise dependence on arbitrary clipping thresholds $c$ and show that our guarantees are tight with both deterministic and stochastic gradients. In particular, we show that (i) for deterministic gradient descent, the clipping threshold only affects the higher-order terms of convergence, (ii) in the stochastic setting convergence to the true optimum cannot be guaranteed under the standard noise assumption, even under arbitrary small step-sizes. We give matching upper and lower bounds for convergence of the gradient norm when running clipped SGD, and illustrate these results with experiments. △ Less

Submitted 9 November, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

arXiv:2301.01313 [pdf, other]

Decentralized Gradient Tracking with Local Steps

Authors: Yue Liu, Tao Lin, Anastasia Koloskova, Sebastian U. Stich

Abstract: Gradient tracking (GT) is an algorithm designed for solving decentralized optimization problems over a network (such as training a machine learning model). A key feature of GT is a tracking mechanism that allows to overcome data heterogeneity between nodes. We develop a novel decentralized tracking mechanism, $K$-GT, that enables communication-efficient local updates in GT while inheriting the d… ▽ More Gradient tracking (GT) is an algorithm designed for solving decentralized optimization problems over a network (such as training a machine learning model). A key feature of GT is a tracking mechanism that allows to overcome data heterogeneity between nodes. We develop a novel decentralized tracking mechanism, $K$-GT, that enables communication-efficient local updates in GT while inheriting the data-independence property of GT. We prove a convergence rate for $K$-GT on smooth non-convex functions and prove that it reduces the communication overhead asymptotically by a linear factor $K$, where $K$ denotes the number of local steps. We illustrate the robustness and effectiveness of this heterogeneity correction on convex and non-convex benchmark problems and on a non-convex neural network training task with the MNIST dataset. △ Less

Submitted 3 January, 2023; originally announced January 2023.

arXiv:2206.08307 [pdf, other]

Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning

Authors: Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi

Abstract: We study the asynchronous stochastic gradient descent algorithm for distributed training over $n$ workers which have varying computation and communication frequency over time. In this algorithm, workers compute stochastic gradients in parallel at their own pace and return those to the server without any synchronization. Existing convergence rates of this algorithm for non-convex smooth objectives… ▽ More We study the asynchronous stochastic gradient descent algorithm for distributed training over $n$ workers which have varying computation and communication frequency over time. In this algorithm, workers compute stochastic gradients in parallel at their own pace and return those to the server without any synchronization. Existing convergence rates of this algorithm for non-convex smooth objectives depend on the maximum gradient delay $τ_{\max}$ and show that an $ε$-stationary point is reached after $\mathcal{O}\!\left(σ^2ε^{-2}+ τ_{\max}ε^{-1}\right)$ iterations, where $σ$ denotes the variance of stochastic gradients. In this work (i) we obtain a tighter convergence rate of $\mathcal{O}\!\left(σ^2ε^{-2}+ \sqrt{τ_{\max}τ_{avg}}ε^{-1}\right)$ without any change in the algorithm where $τ_{avg}$ is the average delay, which can be significantly smaller than $τ_{\max}$. We also provide (ii) a simple delay-adaptive learning rate scheme, under which asynchronous SGD achieves a convergence rate of $\mathcal{O}\!\left(σ^2ε^{-2}+ τ_{avg}ε^{-1}\right)$, and does not require any extra hyperparameter tuning nor extra communications. Our result allows to show for the first time that asynchronous SGD is always faster than mini-batch SGD. In addition, (iii) we consider the case of heterogeneous functions motivated by federated learning applications and improve the convergence rate by proving a weaker dependence on the maximum delay compared to prior works. In particular, we show that the heterogeneity term in convergence rate is only affected by the average delay within each worker. △ Less

Submitted 16 June, 2022; originally announced June 2022.

arXiv:2204.06477 [pdf, other]

Data-heterogeneity-aware Mixing for Decentralized Learning

Authors: Yatin Dandi, Anastasia Koloskova, Martin Jaggi, Sebastian U. Stich

Abstract: Decentralized learning provides an effective framework to train machine learning models with data distributed over arbitrary communication graphs. However, most existing approaches toward decentralized learning disregard the interaction between data heterogeneity and graph topology. In this paper, we characterize the dependence of convergence on the relationship between the mixing weights of the g… ▽ More Decentralized learning provides an effective framework to train machine learning models with data distributed over arbitrary communication graphs. However, most existing approaches toward decentralized learning disregard the interaction between data heterogeneity and graph topology. In this paper, we characterize the dependence of convergence on the relationship between the mixing weights of the graph and the data heterogeneity across nodes. We propose a metric that quantifies the ability of a graph to mix the current gradients. We further prove that the metric controls the convergence rate, particularly in settings where the heterogeneity across nodes dominates the stochasticity between updates for a given node. Motivated by our analysis, we propose an approach that periodically and efficiently optimizes the metric using standard convex constrained optimization and sketching techniques. Through comprehensive experiments on standard computer vision and NLP benchmarks, we show that our approach leads to improvement in test performance for a wide range of tasks. △ Less

Submitted 13 April, 2022; originally announced April 2022.

arXiv:2202.03836 [pdf, other]

An Improved Analysis of Gradient Tracking for Decentralized Machine Learning

Authors: Anastasia Koloskova, Tao Lin, Sebastian U. Stich

Abstract: We consider decentralized machine learning over a network where the training data is distributed across $n$ agents, each of which can compute stochastic model updates on their local data. The agent's common goal is to find a model that minimizes the average of all local loss functions. While gradient tracking (GT) algorithms can overcome a key challenge, namely accounting for differences between w… ▽ More We consider decentralized machine learning over a network where the training data is distributed across $n$ agents, each of which can compute stochastic model updates on their local data. The agent's common goal is to find a model that minimizes the average of all local loss functions. While gradient tracking (GT) algorithms can overcome a key challenge, namely accounting for differences between workers' local data distributions, the known convergence rates for GT algorithms are not optimal with respect to their dependence on the mixing parameter $p$ (related to the spectral gap of the connectivity matrix). We provide a tighter analysis of the GT method in the stochastic strongly convex, convex and non-convex settings. We improve the dependency on $p$ from $\mathcal{O}(p^{-2})$ to $\mathcal{O}(p^{-1}c^{-1})$ in the noiseless case and from $\mathcal{O}(p^{-3/2})$ to $\mathcal{O}(p^{-1/2}c^{-1})$ in the general stochastic case, where $c \geq p$ is related to the negative eigenvalues of the connectivity matrix (and is a constant in most practical applications). This improvement was possible due to a new proof technique which could be of independent interest. △ Less

Submitted 8 February, 2022; originally announced February 2022.

Comments: published at NeurIPS 2021

MSC Class: 68W10; 68W15; 68W40; 90C06; 90C35 ACM Class: G.1.6; F.2.1

Journal ref: 35th Conference on Neural Information Processing Systems (NeurIPS 2021)

arXiv:2110.04175 [pdf, other]

RelaySum for Decentralized Deep Learning on Heterogeneous Data

Authors: Thijs Vogels, Lie He, Anastasia Koloskova, Tao Lin, Sai Praneeth Karimireddy, Sebastian U. Stich, Martin Jaggi

Abstract: In decentralized machine learning, workers compute model updates on their local data. Because the workers only communicate with few neighbors without central coordination, these updates propagate progressively over the network. This paradigm enables distributed training on networks without all-to-all connectivity, helping to protect data privacy as well as to reduce the communication cost of distr… ▽ More In decentralized machine learning, workers compute model updates on their local data. Because the workers only communicate with few neighbors without central coordination, these updates propagate progressively over the network. This paradigm enables distributed training on networks without all-to-all connectivity, helping to protect data privacy as well as to reduce the communication cost of distributed training in data centers. A key challenge, primarily in decentralized deep learning, remains the handling of differences between the workers' local data distributions. To tackle this challenge, we introduce the RelaySum mechanism for information propagation in decentralized learning. RelaySum uses spanning trees to distribute information exactly uniformly across all workers with finite delays depending on the distance between nodes. In contrast, the typical gossip averaging mechanism only distributes data uniformly asymptotically while using the same communication volume per step as RelaySum. We prove that RelaySGD, based on this mechanism, is independent of data heterogeneity and scales to many workers, enabling highly accurate decentralized deep learning on heterogeneous data. Our code is available at http://github.com/epfml/relaysgd. △ Less

Submitted 31 January, 2022; v1 submitted 8 October, 2021; originally announced October 2021.

Comments: Presented at NeurIPS 2021

Journal ref: Advances in Neural Information Processing Systems 34, 2021

arXiv:2106.08315 [pdf, other]

Decentralized Local Stochastic Extra-Gradient for Variational Inequalities

Authors: Aleksandr Beznosikov, Pavel Dvurechensky, Anastasia Koloskova, Valentin Samokhin, Sebastian U Stich, Alexander Gasnikov

Abstract: We consider distributed stochastic variational inequalities (VIs) on unbounded domains with the problem data that is heterogeneous (non-IID) and distributed across many devices. We make a very general assumption on the computational network that, in particular, covers the settings of fully decentralized calculations with time-varying networks and centralized topologies commonly used in Federated L… ▽ More We consider distributed stochastic variational inequalities (VIs) on unbounded domains with the problem data that is heterogeneous (non-IID) and distributed across many devices. We make a very general assumption on the computational network that, in particular, covers the settings of fully decentralized calculations with time-varying networks and centralized topologies commonly used in Federated Learning. Moreover, multiple local updates on the workers can be made for reducing the communication frequency between the workers. We extend the stochastic extragradient method to this very general setting and theoretically analyze its convergence rate in the strongly-monotone, monotone, and non-monotone (when a Minty solution exists) settings. The provided rates explicitly exhibit the dependence on network characteristics (e.g., mixing time), iteration counter, data heterogeneity, variance, number of devices, and other standard parameters. As a special case, our method and analysis apply to distributed stochastic saddle-point problems (SPP), e.g., to the training of Deep Generative Adversarial Networks (GANs) for which decentralized training has been reported to be extremely challenging. In experiments for the decentralized training of GANs we demonstrate the effectiveness of our proposed approach. △ Less

Submitted 2 April, 2023; v1 submitted 15 June, 2021; originally announced June 2021.

Comments: Appears in: Advances in Neural Information Processing Systems 35 (NeurIPS 2022). Minor modifications with respect to the NeurIPS version. 43 pages, 1 algorithm, 6 figures, 2 tables

Journal ref: https://proceedings.neurips.cc/paper_files/paper/2022/hash/f9379afacdbabfdc6b060972b60f9ab8-Abstract-Conference.html

arXiv:2011.01697 [pdf, other]

A Linearly Convergent Algorithm for Decentralized Optimization: Sending Less Bits for Free!

Authors: Dmitry Kovalev, Anastasia Koloskova, Martin Jaggi, Peter Richtarik, Sebastian U. Stich

Abstract: Decentralized optimization methods enable on-device training of machine learning models without a central coordinator. In many scenarios communication between devices is energy demanding and time consuming and forms the bottleneck of the entire system. We propose a new randomized first-order method which tackles the communication bottleneck by applying randomized compression operators to the com… ▽ More Decentralized optimization methods enable on-device training of machine learning models without a central coordinator. In many scenarios communication between devices is energy demanding and time consuming and forms the bottleneck of the entire system. We propose a new randomized first-order method which tackles the communication bottleneck by applying randomized compression operators to the communicated messages. By combining our scheme with a new variance reduction technique that progressively throughout the iterations reduces the adverse effect of the injected quantization noise, we obtain the first scheme that converges linearly on strongly convex decentralized problems while using compressed communication only. We prove that our method can solve the problems without any increase in the number of communications compared to the baseline which does not perform any communication compression while still allowing for a significant compression factor which depends on the conditioning of the problem and the topology of the network. Our key theoretical findings are supported by numerical experiments. △ Less

Submitted 3 November, 2020; originally announced November 2020.

arXiv:2008.03606 [pdf, other]

Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning

Authors: Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh

Abstract: Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which gives rise to the client drift phenomenon. In fact, obtaining an algorithm for FL which is uniformly better than simple centralized training has been a major open problem thus far. In this work, we propose a general algorithmic framework, Mime, which i) mitigates cl… ▽ More Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which gives rise to the client drift phenomenon. In fact, obtaining an algorithm for FL which is uniformly better than simple centralized training has been a major open problem thus far. In this work, we propose a general algorithmic framework, Mime, which i) mitigates client drift and ii) adapts arbitrary centralized optimization algorithms such as momentum and Adam to the cross-device federated learning setting. Mime uses a combination of control-variates and server-level statistics (e.g. momentum) at every client-update step to ensure that each local update mimics that of the centralized method run on iid data. We prove a reduction result showing that Mime can translate the convergence of a generic algorithm in the centralized setting into convergence in the federated setting. Further, we show that when combined with momentum based variance reduction, Mime is provably faster than any centralized method--the first such result. We also perform a thorough experimental exploration of Mime's performance on real world datasets. △ Less

Submitted 8 June, 2021; v1 submitted 8 August, 2020; originally announced August 2020.

Comments: Version 2 provides stronger theoretical results and more thorough experiments

MSC Class: 68W40; 68W15; 90C25; 90C06 ACM Class: G.1.6; F.2.1; E.4

arXiv:2008.00051 [pdf, other]

On the Convergence of SGD with Biased Gradients

Authors: Ahmad Ajalloeian, Sebastian U. Stich

Abstract: We analyze the complexity of biased stochastic gradient methods (SGD), where individual updates are corrupted by deterministic, i.e. biased error terms. We derive convergence results for smooth (non-convex) functions and give improved rates under the Polyak-Lojasiewicz condition. We quantify how the magnitude of the bias impacts the attainable accuracy and the convergence rates (sometimes leading… ▽ More We analyze the complexity of biased stochastic gradient methods (SGD), where individual updates are corrupted by deterministic, i.e. biased error terms. We derive convergence results for smooth (non-convex) functions and give improved rates under the Polyak-Lojasiewicz condition. We quantify how the magnitude of the bias impacts the attainable accuracy and the convergence rates (sometimes leading to divergence). Our framework covers many applications where either only biased gradient updates are available, or preferred, over unbiased ones for performance reasons. For instance, in the domain of distributed learning, biased gradient compression techniques such as top-k compression have been proposed as a tool to alleviate the communication bottleneck and in derivative-free optimization, only biased gradient estimators can be queried. We discuss a few guiding examples that show the broad applicability of our analysis. △ Less

Submitted 9 May, 2021; v1 submitted 31 July, 2020; originally announced August 2020.

Comments: Accepted to ICML 2020 Workshop "Beyond First Order Methods in ML Systems", updated 2021

arXiv:2003.10422 [pdf, other]

A Unified Theory of Decentralized SGD with Changing Topology and Local Updates

Authors: Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, Sebastian U. Stich

Abstract: Decentralized stochastic optimization methods have gained a lot of attention recently, mainly because of their cheap per iteration cost, data locality, and their communication-efficiency. In this paper we introduce a unified convergence analysis that covers a large variety of decentralized SGD methods which so far have required different intuitions, have different applications, and which have been… ▽ More Decentralized stochastic optimization methods have gained a lot of attention recently, mainly because of their cheap per iteration cost, data locality, and their communication-efficiency. In this paper we introduce a unified convergence analysis that covers a large variety of decentralized SGD methods which so far have required different intuitions, have different applications, and which have been developed separately in various communities. Our algorithmic framework covers local SGD updates and synchronous and pairwise gossip updates on adaptive network topology. We derive universal convergence rates for smooth (convex and non-convex) problems and the rates interpolate between the heterogeneous (non-identically distributed data) and iid-data settings, recovering linear convergence rates in many special cases, for instance for over-parametrized models. Our proofs rely on weak assumptions (typically improving over prior work in several aspects) and recover (and improve) the best known complexity results for a host of important scenarios, such as for instance coorperative SGD and federated averaging (local SGD). △ Less

Submitted 2 March, 2021; v1 submitted 23 March, 2020; originally announced March 2020.

MSC Class: 68W10; 68W15; 68W40; 90C06; 90C35 ACM Class: G.1.6; F.2.1

arXiv:2002.07839 [pdf, other]

Is Local SGD Better than Minibatch SGD?

Authors: Blake Woodworth, Kumar Kshitij Patel, Sebastian U. Stich, Zhen Dai, Brian Bullins, H. Brendan McMahan, Ohad Shamir, Nathan Srebro

Abstract: We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD. (1) For quadratic objectives we prove that local SGD strictly dominates minibat… ▽ More We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD. (1) For quadratic objectives we prove that local SGD strictly dominates minibatch SGD and that accelerated local SGD is minimax optimal for quadratics; (2) For general convex objectives we provide the first guarantee that at least sometimes improves over minibatch SGD; (3) We show that indeed local SGD does not dominate minibatch SGD by presenting a lower bound on the performance of local SGD that is worse than the minibatch SGD guarantee. △ Less

Submitted 20 July, 2020; v1 submitted 18 February, 2020; originally announced February 2020.

Comments: 29 pages

arXiv:1910.06378 [pdf, other]

SCAFFOLD: Stochastic Controlled Averaging for Federated Learning

Authors: Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh

Abstract: Federated Averaging (FedAvg) has emerged as the algorithm of choice for federated learning due to its simplicity and low communication cost. However, in spite of recent research efforts, its performance is not fully understood. We obtain tight convergence rates for FedAvg and prove that it suffers from `client-drift' when the data is heterogeneous (non-iid), resulting in unstable and slow converge… ▽ More Federated Averaging (FedAvg) has emerged as the algorithm of choice for federated learning due to its simplicity and low communication cost. However, in spite of recent research efforts, its performance is not fully understood. We obtain tight convergence rates for FedAvg and prove that it suffers from `client-drift' when the data is heterogeneous (non-iid), resulting in unstable and slow convergence. As a solution, we propose a new algorithm (SCAFFOLD) which uses control variates (variance reduction) to correct for the `client-drift' in its local updates. We prove that SCAFFOLD requires significantly fewer communication rounds and is not affected by data heterogeneity or client sampling. Further, we show that (for quadratics) SCAFFOLD can take advantage of similarity in the client's data yielding even faster convergence. The latter is the first result to quantify the usefulness of local-steps in distributed optimization. △ Less

Submitted 9 April, 2021; v1 submitted 14 October, 2019; originally announced October 2019.

Comments: v2 contains analysis of FedAvg, non-convex rates of Scaffold, and experimental evaluation. v3 fixes typos, ICML version. v4 slightly improves rate of SCAFFOLD for general convex functions

MSC Class: 68W40; 68W15; 90C25; 90C06 ACM Class: G.1.6; F.2.1; E.4

arXiv:1909.05350 [pdf, ps, other]

The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication

Authors: Sebastian U. Stich, Sai Praneeth Karimireddy

Abstract: We analyze (stochastic) gradient descent (SGD) with delayed updates on smooth quasi-convex and non-convex functions and derive concise, non-asymptotic, convergence rates. We show that the rate of convergence in all cases consists of two terms: (i) a stochastic term which is not affected by the delay, and (ii) a higher order deterministic term which is only linearly slowed down by the delay. Thus,… ▽ More We analyze (stochastic) gradient descent (SGD) with delayed updates on smooth quasi-convex and non-convex functions and derive concise, non-asymptotic, convergence rates. We show that the rate of convergence in all cases consists of two terms: (i) a stochastic term which is not affected by the delay, and (ii) a higher order deterministic term which is only linearly slowed down by the delay. Thus, in the presence of noise, the effects of the delay become negligible after a few iterations and the algorithm converges at the same optimal rate as standard SGD. This result extends a line of research that showed similar results in the asymptotic regime or for strongly-convex quadratic functions only. We further show similar results for SGD with more intricate form of delayed gradients -- compressed gradients under error compensation and for local~SGD where multiple workers perform local steps before communicating with each other. In all of these settings, we improve upon the best known rates. These results show that SGD is robust to compressed and/or delayed stochastic gradient updates. This is in particular important for distributed parallel implementations, where asynchronous and communication efficient methods are the key to achieve linear speedups for optimization with multiple devices. △ Less

Submitted 16 June, 2021; v1 submitted 11 September, 2019; originally announced September 2019.

Comments: Submitted 9/19, Published 9/20

MSC Class: 68W40; 68W15; 90C25; 90C06 ACM Class: G.1.6; F.2.1; E.4

Journal ref: Journal of Machine Learning Research (JMLR), 21(237):1-36, 2020

arXiv:1907.09356 [pdf, other]

Decentralized Deep Learning with Arbitrary Communication Compression

Authors: Anastasia Koloskova, Tao Lin, Sebastian U. Stich, Martin Jaggi

Abstract: Decentralized training of deep learning models is a key element for enabling data privacy and on-device learning over networks, as well as for efficient scaling to large compute clusters. As current approaches suffer from limited bandwidth of the network, we propose the use of communication compression in the decentralized training context. We show that Choco-SGD $-$ recently introduced and analyz… ▽ More Decentralized training of deep learning models is a key element for enabling data privacy and on-device learning over networks, as well as for efficient scaling to large compute clusters. As current approaches suffer from limited bandwidth of the network, we propose the use of communication compression in the decentralized training context. We show that Choco-SGD $-$ recently introduced and analyzed for strongly-convex objectives only $-$ converges under arbitrary high compression ratio on general non-convex functions at the rate $O\bigl(1/\sqrt{nT}\bigr)$ where $T$ denotes the number of iterations and $n$ the number of workers. The algorithm achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods. We demonstrate the practical performance of the algorithm in two key scenarios: the training of deep learning models (i) over distributed user devices, connected by a social network and (ii) in a datacenter (outperforming all-reduce time-wise). △ Less

Submitted 11 November, 2020; v1 submitted 22 July, 2019; originally announced July 2019.

MSC Class: 68W10; 68W15; 68W40; 90C06; 90C25; 90C35 ACM Class: G.1.6; F.2.1; E.4

arXiv:1907.04232 [pdf, ps, other]

Unified Optimal Analysis of the (Stochastic) Gradient Method

Authors: Sebastian U. Stich

Abstract: In this note we give a simple proof for the convergence of stochastic gradient (SGD) methods on $μ$-convex functions under a (milder than standard) $L$-smoothness assumption. We show that for carefully chosen stepsizes SGD converges after $T$ iterations as $O\left( LR^2 \exp \bigl[-\fracμ{4L}T\bigr] + \frac{σ^2}{μT} \right)$ where $σ^2$ measures the variance in the stochastic noise. For determinis… ▽ More In this note we give a simple proof for the convergence of stochastic gradient (SGD) methods on $μ$-convex functions under a (milder than standard) $L$-smoothness assumption. We show that for carefully chosen stepsizes SGD converges after $T$ iterations as $O\left( LR^2 \exp \bigl[-\fracμ{4L}T\bigr] + \frac{σ^2}{μT} \right)$ where $σ^2$ measures the variance in the stochastic noise. For deterministic gradient descent (GD) and SGD in the interpolation setting we have $σ^2 =0$ and we recover the exponential convergence rate. The bound matches with the best known iteration complexity of GD and SGD, up to constants. △ Less

Submitted 23 December, 2019; v1 submitted 9 July, 2019; originally announced July 2019.

Comments: 11 pages, version 2 fixes typos and case distinction in the proof

arXiv:1902.00340 [pdf, other]

Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication

Authors: Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi

Abstract: We consider decentralized stochastic optimization with the objective function (e.g. data samples for machine learning task) being distributed over $n$ machines that can only communicate to their neighbors on a fixed communication graph. To reduce the communication bottleneck, the nodes compress (e.g. quantize or sparsify) their model updates. We cover both unbiased and biased compression operators… ▽ More We consider decentralized stochastic optimization with the objective function (e.g. data samples for machine learning task) being distributed over $n$ machines that can only communicate to their neighbors on a fixed communication graph. To reduce the communication bottleneck, the nodes compress (e.g. quantize or sparsify) their model updates. We cover both unbiased and biased compression operators with quality denoted by $ω\leq 1$ ($ω=1$ meaning no compression). We (i) propose a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate $\mathcal{O}\left(1/(nT) + 1/(T δ^2 ω)^2\right)$ for strongly convex objectives, where $T$ denotes the number of iterations and $δ$ the eigengap of the connectivity matrix. Despite compression quality and network connectivity affecting the higher order terms, the first term in the rate, $\mathcal{O}(1/(nT))$, is the same as for the centralized baseline with exact communication. We (ii) present a novel gossip algorithm, CHOCO-GOSSIP, for the average consensus problem that converges in time $\mathcal{O}(1/(δ^2ω) \log (1/ε))$ for accuracy $ε> 0$. This is (up to our knowledge) the first gossip algorithm that supports arbitrary compressed messages for $ω> 0$ and still exhibits linear convergence. We (iii) show in experiments that both of our algorithms do outperform the respective state-of-the-art baselines and CHOCO-SGD can reduce communication by at least two orders of magnitudes. △ Less

Submitted 1 February, 2019; originally announced February 2019.

MSC Class: 68W10; 68W15; 68W40; 90C06; 90C25; 90C35 ACM Class: G.1.6; F.2.1; E.4

arXiv:1901.09847 [pdf, other]

Error Feedback Fixes SignSGD and other Gradient Compression Schemes

Authors: Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U. Stich, Martin Jaggi

Abstract: Sign-based algorithms (e.g. signSGD) have been proposed as a biased gradient compression technique to alleviate the communication bottleneck in training large neural networks across multiple workers. We show simple convex counter-examples where signSGD does not converge to the optimum. Further, even when it does converge, signSGD may generalize poorly when compared with SGD. These issues arise bec… ▽ More Sign-based algorithms (e.g. signSGD) have been proposed as a biased gradient compression technique to alleviate the communication bottleneck in training large neural networks across multiple workers. We show simple convex counter-examples where signSGD does not converge to the optimum. Further, even when it does converge, signSGD may generalize poorly when compared with SGD. These issues arise because of the biased nature of the sign compression operator. We then show that using error-feedback, i.e. incorporating the error made by the compression operator into the next step, overcomes these issues. We prove that our algorithm EF-SGD with arbitrary compression operator achieves the same rate of convergence as SGD without any additional assumptions. Thus EF-SGD achieves gradient compression for free. Our experiments thoroughly substantiate the theory and show that error-feedback improves both convergence and generalization. Code can be found at \url{https://github.com/epfml/error-feedback-SGD}. △ Less

Submitted 29 May, 2019; v1 submitted 28 January, 2019; originally announced January 2019.

Comments: ICML 2019 (long talk)

ACM Class: I.2.6; I.5.1

arXiv:1810.06999 [pdf, other]

Efficient Greedy Coordinate Descent for Composite Problems

Authors: Sai Praneeth Karimireddy, Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi

Abstract: Coordinate descent with random coordinate selection is the current state of the art for many large scale optimization problems. However, greedy selection of the steepest coordinate on smooth problems can yield convergence rates independent of the dimension $n$, and requiring upto $n$ times fewer iterations. In this paper, we consider greedy updates that are based on subgradients for a class of n… ▽ More Coordinate descent with random coordinate selection is the current state of the art for many large scale optimization problems. However, greedy selection of the steepest coordinate on smooth problems can yield convergence rates independent of the dimension $n$, and requiring upto $n$ times fewer iterations. In this paper, we consider greedy updates that are based on subgradients for a class of non-smooth composite problems, which includes $L1$-regularized problems, SVMs and related applications. For these problems we provide (i) the first linear rates of convergence independent of $n$, and show that our greedy update rule provides speedups similar to those obtained in the smooth case. This was previously conjectured to be true for a stronger greedy coordinate selection strategy. Furthermore, we show that (ii) our new selection rule can be mapped to instances of maximum inner product search, allowing to leverage standard nearest neighbor algorithms to speed up the implementation. We demonstrate the validity of the approach through extensive numerical experiments. △ Less

Submitted 16 October, 2018; originally announced October 2018.

Comments: 44 pages, 17 figures, 3 tables

MSC Class: 90C25; 68Q25 ACM Class: G.1.6

arXiv:1806.00413 [pdf, ps, other]

Global linear convergence of Newton's method without strong-convexity or Lipschitz gradients

Authors: Sai Praneeth Karimireddy, Sebastian U. Stich, Martin Jaggi

Abstract: We show that Newton's method converges globally at a linear rate for objective functions whose Hessians are stable. This class of problems includes many functions which are not strongly convex, such as logistic regression. Our linear convergence result is (i) affine-invariant, and holds even if an (ii) approximate Hessian is used, and if the subproblems are (iii) only solved approximately. Thus we… ▽ More We show that Newton's method converges globally at a linear rate for objective functions whose Hessians are stable. This class of problems includes many functions which are not strongly convex, such as logistic regression. Our linear convergence result is (i) affine-invariant, and holds even if an (ii) approximate Hessian is used, and if the subproblems are (iii) only solved approximately. Thus we theoretically demonstrate the superiority of Newton's method over first-order methods, which would only achieve a sublinear $O(1/t^2)$ rate under similar conditions. △ Less

Submitted 1 June, 2018; originally announced June 2018.

Comments: 19 pages

MSC Class: 90C25; 68Q25 ACM Class: G.1.6

arXiv:1805.09767 [pdf, other]

Local SGD Converges Fast and Communicates Little

Authors: Sebastian U. Stich

Abstract: Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training. The scheme can reach a linear speedup with respect to the number of workers, but this is rarely seen in practice as the scheme often suffers from large network delays and bandwidth limits. To overcome this communication bottleneck recent works propose to reduce the communication frequency. An algo… ▽ More Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training. The scheme can reach a linear speedup with respect to the number of workers, but this is rarely seen in practice as the scheme often suffers from large network delays and bandwidth limits. To overcome this communication bottleneck recent works propose to reduce the communication frequency. An algorithm of this type is local SGD that runs SGD independently in parallel on different workers and averages the sequences only once in a while. This scheme shows promising results in practice, but eluded thorough theoretical analysis. We prove concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers and mini-batch size. The number of communication rounds can be reduced up to a factor of T^{1/2}---where T denotes the number of total steps---compared to mini-batch SGD. This also holds for asynchronous implementations. Local SGD can also be used for large scale training of deep learning models. The results shown here aim serving as a guideline to further explore the theoretical and practical aspects of local SGD in these applications. △ Less

Submitted 3 May, 2019; v1 submitted 24 May, 2018; originally announced May 2018.

Comments: to appear at ICLR 2019, 19 pages

MSC Class: 90C06; 68W40; 68W10 ACM Class: G.1.6; F.2.1

arXiv:1805.00982 [pdf, other]

k-SVRG: Variance Reduction for Large Scale Optimization

Authors: Anant Raj, Sebastian U. Stich

Abstract: Variance reduced stochastic gradient (SGD) methods converge significantly faster than the vanilla SGD counterpart. However, these methods are not very practical on large scale problems, as they either i) require frequent passes over the full data to recompute gradients---without making any progress during this time (like for SVRG), or ii)~they require additional memory that can surpass the size of… ▽ More Variance reduced stochastic gradient (SGD) methods converge significantly faster than the vanilla SGD counterpart. However, these methods are not very practical on large scale problems, as they either i) require frequent passes over the full data to recompute gradients---without making any progress during this time (like for SVRG), or ii)~they require additional memory that can surpass the size of the input problem (like for SAGA). In this work, we propose $k$-SVRG that addresses these issues by making best use of the \emph{available} memory and minimizes the stalling phases without progress. We prove linear convergence of $k$-SVRG on strongly convex problems and convergence to stationary points on non-convex problems. Numerical experiments show the effectiveness of our method. △ Less

Submitted 16 October, 2018; v1 submitted 2 May, 2018; originally announced May 2018.

Comments: The title of the previous version of the manuscript was "SVRG meets SAGA: k-SVRG A Tale of Limited Memory"

MSC Class: 90C06; 68W40; 68W20 ACM Class: G.1.6; F.2.1

arXiv:1803.09539 [pdf, other]

On Matching Pursuit and Coordinate Descent

Authors: Francesco Locatello, Anant Raj, Sai Praneeth Karimireddy, Gunnar Rätsch, Bernhard Schölkopf, Sebastian U. Stich, Martin Jaggi

Abstract: Two popular examples of first-order optimization methods over linear spaces are coordinate descent and matching pursuit algorithms, with their randomized variants. While the former targets the optimization by moving along coordinates, the latter considers a generalized notion of directions. Exploiting the connection between the two algorithms, we present a unified analysis of both, providing affin… ▽ More Two popular examples of first-order optimization methods over linear spaces are coordinate descent and matching pursuit algorithms, with their randomized variants. While the former targets the optimization by moving along coordinates, the latter considers a generalized notion of directions. Exploiting the connection between the two algorithms, we present a unified analysis of both, providing affine invariant sublinear $\mathcal{O}(1/t)$ rates on smooth objectives and linear convergence on strongly convex objectives. As a byproduct of our affine invariant analysis of matching pursuit, our rates for steepest coordinate descent are the tightest known. Furthermore, we show the first accelerated convergence rate $\mathcal{O}(1/t^2)$ for matching pursuit and steepest coordinate descent on convex objectives. △ Less

Submitted 31 May, 2019; v1 submitted 26 March, 2018; originally announced March 2018.

Journal ref: ICML 2018 - Proceedings of the 35th International Conference on Machine Learning

arXiv:1711.02637 [pdf, other]

Safe Adaptive Importance Sampling

Authors: Sebastian U. Stich, Anant Raj, Martin Jaggi

Abstract: Importance sampling has become an indispensable strategy to speed up optimization algorithms for large-scale applications. Improved adaptive variants - using importance values defined by the complete gradient information which changes during optimization - enjoy favorable theoretical properties, but are typically computationally infeasible. In this paper we propose an efficient approximation of gr… ▽ More Importance sampling has become an indispensable strategy to speed up optimization algorithms for large-scale applications. Improved adaptive variants - using importance values defined by the complete gradient information which changes during optimization - enjoy favorable theoretical properties, but are typically computationally infeasible. In this paper we propose an efficient approximation of gradient-based sampling, which is based on safe bounds on the gradient. The proposed sampling distribution is (i) provably the best sampling with respect to the given bounds, (ii) always better than uniform sampling and fixed importance sampling and (iii) can efficiently be computed - in many applications at negligible extra cost. The proposed sampling scheme is generic and can easily be integrated into existing algorithms. In particular, we show that coordinate-descent (CD) and stochastic gradient descent (SGD) can enjoy significant a speed-up under the novel scheme. The proven efficiency of the proposed sampling is verified by extensive numerical testing. △ Less

Submitted 7 November, 2017; originally announced November 2017.

Comments: To appear at NIPS 2017

MSC Class: 90C25; 68W20; 68Q25 ACM Class: G.1.6

arXiv:1706.08427 [pdf, other]

Approximate Steepest Coordinate Descent

Authors: Sebastian U. Stich, Anant Raj, Martin Jaggi

Abstract: We propose a new selection rule for the coordinate selection in coordinate descent methods for huge-scale optimization. The efficiency of this novel scheme is provably better than the efficiency of uniformly random selection, and can reach the efficiency of steepest coordinate descent (SCD), enabling an acceleration of a factor of up to $n$, the number of coordinates. In many practical application… ▽ More We propose a new selection rule for the coordinate selection in coordinate descent methods for huge-scale optimization. The efficiency of this novel scheme is provably better than the efficiency of uniformly random selection, and can reach the efficiency of steepest coordinate descent (SCD), enabling an acceleration of a factor of up to $n$, the number of coordinates. In many practical applications, our scheme can be implemented at no extra cost and computational efficiency very close to the faster uniform selection. Numerical experiments with Lasso and Ridge regression show promising improvements, in line with our theoretical guarantees. △ Less

Submitted 26 June, 2017; originally announced June 2017.

Comments: appearing at ICML 2017

ACM Class: G.1.6

arXiv:1701.08183 [pdf, ps, other]

On the Existence of Ordinary Triangles

Authors: Radoslav Fulek, Hossein Nassajian Mojarrad, Márton Naszódi, József Solymosi, Sebastian U. Stich, May Szedlák

Abstract: Let $P$ be a finite point set in the plane. A \emph{$c$-ordinary triangle} in $P$ is a subset of $P$ consisting of three non-collinear points such that each of the three lines determined by the three points contains at most $c$ points of $P$. Motivated by a question of Erdős, and answering a question of de Zeeuw, we prove that there exists a constant $c>0$ such that $P$ contains a $c$-ordinary tri… ▽ More Let $P$ be a finite point set in the plane. A \emph{$c$-ordinary triangle} in $P$ is a subset of $P$ consisting of three non-collinear points such that each of the three lines determined by the three points contains at most $c$ points of $P$. Motivated by a question of Erdős, and answering a question of de Zeeuw, we prove that there exists a constant $c>0$ such that $P$ contains a $c$-ordinary triangle, provided that $P$ is not contained in the union of two lines. Furthermore, the number of $c$-ordinary triangles in $P$ is $Ω(|P|)$. △ Less

Submitted 10 June, 2017; v1 submitted 27 January, 2017; originally announced January 2017.

Comments: 5 pages, now an estimate on c is given

MSC Class: 52C30

arXiv:1406.2010 [pdf, other]

On low complexity Acceleration Techniques for Randomized Optimization: Supplementary Online Material

Authors: Sebastian U. Stich

Abstract: Recently it was shown by Nesterov (2011) that techniques form convex optimization can be used to successfully accelerate simple derivative-free randomized optimization methods. The appeal of those schemes lies in their low complexity, which is only $Θ(n)$ per iteration---compared to $Θ(n^2)$ for algorithms storing second-order information or covariance matrices. From a high-level point of view, th… ▽ More Recently it was shown by Nesterov (2011) that techniques form convex optimization can be used to successfully accelerate simple derivative-free randomized optimization methods. The appeal of those schemes lies in their low complexity, which is only $Θ(n)$ per iteration---compared to $Θ(n^2)$ for algorithms storing second-order information or covariance matrices. From a high-level point of view, those accelerated schemes employ correlations between successive iterates---a concept looking similar to the evolution path used in Covariance Matrix Adaptation Evolution Strategies (CMA-ES). In this contribution, we (i) implement and empirically test a simple accelerated random search scheme (SARP). Our study is the first to provide numerical evidence that SARP can effectively be implemented with adaptive step size control and does not require access to gradient or advanced line search oracles. We (ii) try to empirically verify the supposed analogy between the evolution path and SARP. We propose an algorithm CMA-EP that uses only the evolution path to bias the search. This algorithm can be generalized to a family of low memory schemes, with complexity $Θ(mn)$ per iteration, following a recent approach by Loshchilov (2014). The study shows that the performance of CMA-EP heavily depends on the spectra of the objective function and thus it cannot accelerate as consistently as SARP. △ Less

Submitted 11 June, 2014; v1 submitted 8 June, 2014; originally announced June 2014.

Comments: 15 pages, 9 figures; the main part without the appendix is a preprint of a conference publication that will appear at PPSN XIII (2014); the appendix reports the complete numerical data that could not fit into the main part

arXiv:1210.5114 [pdf, other]

Variable Metric Random Pursuit

Authors: Sebastian U. Stich, Christian L. Müller, Bernd Gärtner

Abstract: We consider unconstrained randomized optimization of smooth convex objective functions in the gradient-free setting. We analyze Random Pursuit (RP) algorithms with fixed (F-RP) and variable metric (V-RP). The algorithms only use zeroth-order information about the objective function and compute an approximate solution by repeated optimization over randomly chosen one-dimensional subspaces. The dist… ▽ More We consider unconstrained randomized optimization of smooth convex objective functions in the gradient-free setting. We analyze Random Pursuit (RP) algorithms with fixed (F-RP) and variable metric (V-RP). The algorithms only use zeroth-order information about the objective function and compute an approximate solution by repeated optimization over randomly chosen one-dimensional subspaces. The distribution of search directions is dictated by the chosen metric. Variable Metric RP uses novel variants of a randomized zeroth-order Hessian approximation scheme recently introduced by Leventhal and Lewis (D. Leventhal and A. S. Lewis., Optimization 60(3), 329--245, 2011). We here present (i) a refined analysis of the expected single step progress of RP algorithms and their global convergence on (strictly) convex functions and (ii) novel convergence bounds for V-RP on strongly convex functions. We also quantify how well the employed metric needs to match the local geometry of the function in order for the RP algorithms to converge with the best possible rate. Our theoretical results are accompanied by numerical experiments, comparing V-RP with the derivative-free schemes CMA-ES, Implicit Filtering, Nelder-Mead, NEWUOA, Pattern-Search and Nesterov's gradient-free algorithms. △ Less

Submitted 30 September, 2014; v1 submitted 18 October, 2012; originally announced October 2012.

Comments: 42 pages, 6 figures, 15 tables, submitted to journal, Version 3: majorly revised second part, i.e. Section 5 and Appendix

arXiv:1111.0194 [pdf, other]

Optimization of Convex Functions with Random Pursuit

Authors: Sebastian U. Stich, Christian L. Müller, Bernd Gärtner

Abstract: We consider unconstrained randomized optimization of convex objective functions. We analyze the Random Pursuit algorithm, which iteratively computes an approximate solution to the optimization problem by repeated optimization over a randomly chosen one-dimensional subspace. This randomized method only uses zeroth-order information about the objective function and does not need any problem-specific… ▽ More We consider unconstrained randomized optimization of convex objective functions. We analyze the Random Pursuit algorithm, which iteratively computes an approximate solution to the optimization problem by repeated optimization over a randomly chosen one-dimensional subspace. This randomized method only uses zeroth-order information about the objective function and does not need any problem-specific parametrization. We prove convergence and give convergence rates for smooth objectives assuming that the one-dimensional optimization can be solved exactly or approximately by an oracle. A convenient property of Random Pursuit is its invariance under strictly monotone transformations of the objective function. It thus enjoys identical convergence behavior on a wider function class. To support the theoretical results we present extensive numerical performance results of Random Pursuit, two gradient-free algorithms recently proposed by Nesterov, and a classical adaptive step-size random search scheme. We also present an accelerated heuristic version of the Random Pursuit algorithm which significantly improves standard Random Pursuit on all numerical benchmark problems. A general comparison of the experimental results reveals that (i) standard Random Pursuit is effective on strongly convex functions with moderate condition number, and (ii) the accelerated scheme is comparable to Nesterov's fast gradient method and outperforms adaptive step-size strategies. The appendix contains additional supporting online material. △ Less

Submitted 24 May, 2012; v1 submitted 1 November, 2011; originally announced November 2011.

Comments: 35 pages, 5 figures, 8 algorithms, 21 tables, submitted to journal The appendix contains additional supporting online material, not contained in the journal version

Showing 1–46 of 46 results for author: Stich, S U