Skip to main content

Showing 1–50 of 84 results for author: Richtárik, P

Searching in archive stat. Search in all archives.
.
  1. arXiv:2505.13416  [pdf, ps, other

    cs.LG math.OC stat.ML

    Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLMs)

    Authors: Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, Peter Richtárik

    Abstract: Recent developments in deep learning optimization have brought about radically new algorithms based on the Linear Minimization Oracle (LMO) framework, such as $\sf Muon$ and $\sf Scion$. After over a decade of $\sf Adam$'s dominance, these LMO-based methods are emerging as viable replacements, offering several practical advantages such as improved memory efficiency, better hyperparameter transfera… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  2. arXiv:2502.13482  [pdf, other

    cs.LG cs.CR cs.DC math.OC stat.ML

    Smoothed Normalization for Efficient Distributed Private Optimization

    Authors: Egor Shulgin, Sarit Khirirat, Peter Richtárik

    Abstract: Federated learning enables training machine learning models while preserving the privacy of participants. Surprisingly, there is no differentially private distributed method for smooth, non-convex optimization problems. The reason is that standard privacy techniques require bounding the participants' contributions, usually enforced via $\textit{clipping}$ of the updates. Existing literature typica… ▽ More

    Submitted 19 February, 2025; originally announced February 2025.

    Comments: 36 pages

  3. arXiv:2502.12329  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    A Novel Unified Parametric Assumption for Nonconvex Optimization

    Authors: Artem Riabinin, Ahmed Khaled, Peter Richtárik

    Abstract: Nonconvex optimization is central to modern machine learning, but the general framework of nonconvex optimization yields weak convergence guarantees that are too pessimistic compared to practice. On the other hand, while convexity enables efficient optimization, it is of limited applicability to many practical problems. To bridge this gap and better understand the practical success of optimization… ▽ More

    Submitted 17 February, 2025; originally announced February 2025.

  4. arXiv:2502.11682  [pdf, other

    cs.LG math.OC stat.ML

    Double Momentum and Error Feedback for Clipping with Fast Rates and Differential Privacy

    Authors: Rustem Islamov, Samuel Horvath, Aurelien Lucchi, Peter Richtarik, Eduard Gorbunov

    Abstract: Strong Differential Privacy (DP) and Optimization guarantees are two desirable properties for a method in Federated Learning (FL). However, existing algorithms do not achieve both properties at once: they either have optimal DP guarantees but rely on restrictive assumptions such as bounded gradients/bounded data heterogeneity, or they ensure strong optimization performance but lack DP guarantees.… ▽ More

    Submitted 17 February, 2025; originally announced February 2025.

  5. arXiv:2502.02002  [pdf, other

    math.OC cs.LG stat.ML

    The Ball-Proximal (="Broximal") Point Method: a New Algorithm, Convergence Theory, and Applications

    Authors: Kaja Gruntkowska, Hanmin Li, Aadi Rane, Peter Richtárik

    Abstract: Non-smooth and non-convex global optimization poses significant challenges across various applications, where standard gradient-based methods often struggle. We propose the Ball-Proximal Point Method, Broximal Point Method, or Ball Point Method (BPM) for short - a novel algorithmic framework inspired by the classical Proximal Point Method (PPM) (Rockafellar, 1976), which, as we show, sheds new lig… ▽ More

    Submitted 3 February, 2025; originally announced February 2025.

    Comments: 44 pages, 3 figures

  6. arXiv:2502.00775  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    ATA: Adaptive Task Allocation for Efficient Resource Management in Distributed Machine Learning

    Authors: Artavazd Maranjyan, El Mehdi Saad, Peter Richtárik, Francesco Orabona

    Abstract: Asynchronous methods are fundamental for parallelizing computations in distributed machine learning. They aim to accelerate training by fully utilizing all available resources. However, their greedy approach can lead to inefficiencies using more computation than required, especially when computation times vary across devices. If the computation times were known in advance, training could be fast a… ▽ More

    Submitted 22 May, 2025; v1 submitted 2 February, 2025; originally announced February 2025.

  7. arXiv:2501.16168  [pdf, ps, other

    cs.LG cs.DC math.OC stat.ML

    Ringmaster ASGD: The First Asynchronous SGD with Optimal Time Complexity

    Authors: Artavazd Maranjyan, Alexander Tyurin, Peter Richtárik

    Abstract: Asynchronous Stochastic Gradient Descent (Asynchronous SGD) is a cornerstone method for parallelizing learning in distributed machine learning. However, its performance suffers under arbitrarily heterogeneous computation times across workers, leading to suboptimal time complexity and inefficiency as the number of workers scales. While several Asynchronous SGD variants have been proposed, recent fi… ▽ More

    Submitted 3 June, 2025; v1 submitted 27 January, 2025; originally announced January 2025.

  8. arXiv:2412.19916  [pdf, other

    cs.LG cs.CR math.OC stat.ML

    On the Convergence of DP-SGD with Adaptive Clipping

    Authors: Egor Shulgin, Peter Richtárik

    Abstract: Stochastic Gradient Descent (SGD) with gradient clipping is a powerful technique for enabling differentially private optimization. Although prior works extensively investigated clipping with a constant threshold, private training remains highly sensitive to threshold selection, which can be expensive or even infeasible to tune. This sensitivity motivates the development of adaptive approaches, suc… ▽ More

    Submitted 27 December, 2024; originally announced December 2024.

  9. arXiv:2412.17082  [pdf, other

    cs.LG math.OC stat.ML

    MARINA-P: Superior Performance in Non-smooth Federated Optimization with Adaptive Stepsizes

    Authors: Igor Sokolov, Peter Richtárik

    Abstract: Non-smooth communication-efficient federated optimization is crucial for many machine learning applications, yet remains largely unexplored theoretically. Recent advancements have primarily focused on smooth convex and non-convex regimes, leaving a significant gap in understanding the non-smooth convex setting. Additionally, existing literature often overlooks efficient server-to-worker communicat… ▽ More

    Submitted 22 December, 2024; originally announced December 2024.

    Comments: 49 Pages, 5 Algorithms, 4 Theorems, 6 Lemmas, 8 Figures

  10. arXiv:2412.17054  [pdf, ps, other

    math.OC cs.CR cs.LG stat.ML

    Differentially Private Random Block Coordinate Descent

    Authors: Artavazd Maranjyan, Abdurakhmon Sadiev, Peter Richtárik

    Abstract: Coordinate Descent (CD) methods have gained significant attention in machine learning due to their effectiveness in solving high-dimensional problems and their ability to decompose complex optimization tasks. However, classical CD methods were neither designed nor analyzed with data privacy in mind, a critical concern when handling sensitive information. This has led to the development of differen… ▽ More

    Submitted 22 December, 2024; originally announced December 2024.

  11. arXiv:2410.15368  [pdf, other

    math.OC cs.LG stat.ML

    Tighter Performance Theory of FedExProx

    Authors: Wojciech Anyszka, Kaja Gruntkowska, Alexander Tyurin, Peter Richtárik

    Abstract: We revisit FedExProx - a recently proposed distributed optimization method designed to enhance convergence properties of parallel proximal algorithms via extrapolation. In the process, we uncover a surprising flaw: its known theoretical guarantees on quadratic optimization tasks are no better than those offered by the vanilla Gradient Descent (GD) method. Motivated by this observation, we develop… ▽ More

    Submitted 20 October, 2024; originally announced October 2024.

    Comments: 43 pages, 4 figures

  12. arXiv:2410.04285  [pdf, other

    math.OC cs.DC cs.LG stat.ML

    MindFlayer: Efficient Asynchronous Parallel SGD in the Presence of Heterogeneous and Random Worker Compute Times

    Authors: Artavazd Maranjyan, Omar Shaikh Omar, Peter Richtárik

    Abstract: We study the problem of minimizing the expectation of smooth nonconvex functions with the help of several parallel workers whose role is to compute stochastic gradients. In particular, we focus on the challenging situation where the workers' compute times are arbitrarily heterogeneous and random. In the simpler regime characterized by arbitrarily heterogeneous but deterministic compute times, Tyur… ▽ More

    Submitted 5 October, 2024; originally announced October 2024.

  13. arXiv:2405.15545  [pdf, other

    math.OC cs.LG stat.ML

    Freya PAGE: First Optimal Time Complexity for Large-Scale Nonconvex Finite-Sum Optimization with Heterogeneous Asynchronous Computations

    Authors: Alexander Tyurin, Kaja Gruntkowska, Peter Richtárik

    Abstract: In practical distributed systems, workers are typically not homogeneous, and due to differences in hardware configurations and network conditions, can have highly varying processing times. We consider smooth nonconvex finite-sum (empirical risk minimization) problems in this setup and introduce a new parallel method, Freya PAGE, designed to handle arbitrarily heterogeneous and asynchronous computa… ▽ More

    Submitted 2 November, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

    Comments: 43 pages, 2 figures

  14. arXiv:2402.10774  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    Error Feedback Reloaded: From Quadratic to Arithmetic Mean of Smoothness Constants

    Authors: Peter Richtárik, Elnur Gasanov, Konstantin Burlachenko

    Abstract: Error Feedback (EF) is a highly popular and immensely effective mechanism for fixing convergence issues which arise in distributed training methods (such as distributed GD or SGD) when these are enhanced with greedy communication compression techniques such as TopK. While EF was proposed almost a decade ago (Seide et al., 2014), and despite concentrated effort by the community to advance the theor… ▽ More

    Submitted 16 February, 2024; originally announced February 2024.

    Comments: 70 pages, 14 figures, 6 tables

    MSC Class: 90C26; 74Pxx ACM Class: G.1.6; I.2.11; I.2.m

  15. arXiv:2402.06412  [pdf, other

    math.OC cs.LG stat.ML

    Improving the Worst-Case Bidirectional Communication Complexity for Nonconvex Distributed Optimization under Function Similarity

    Authors: Kaja Gruntkowska, Alexander Tyurin, Peter Richtárik

    Abstract: Effective communication between the server and workers plays a key role in distributed optimization. In this paper, we focus on optimizing the server-to-worker communication, uncovering inefficiencies in prevalent downlink compression approaches. Considering first the pure setup where the uplink communication costs are negligible, we introduce MARINA-P, a novel method for downlink compression, emp… ▽ More

    Submitted 2 November, 2024; v1 submitted 9 February, 2024; originally announced February 2024.

  16. arXiv:2305.18929  [pdf, other

    cs.LG math.OC stat.ML

    Clip21: Error Feedback for Gradient Clipping

    Authors: Sarit Khirirat, Eduard Gorbunov, Samuel Horváth, Rustem Islamov, Fakhri Karray, Peter Richtárik

    Abstract: Motivated by the increasing popularity and importance of large-scale training under differential privacy (DP) constraints, we study distributed gradient methods with gradient clipping, i.e., clipping applied to the gradients computed from local information at the nodes. While gradient clipping is an essential tool for injecting formal DP guarantees into gradient-based methods [1], it also induces… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

  17. arXiv:2305.18627  [pdf, other

    cs.LG cs.DC stat.ML

    Global-QSGD: Practical Floatless Quantization for Distributed Learning with Theoretical Guarantees

    Authors: Jihao Xin, Marco Canini, Peter Richtárik, Samuel Horváth

    Abstract: Efficient distributed training is a principal driver of recent advances in deep learning. However, communication often proves costly and becomes the primary bottleneck in these systems. As a result, there is a demand for the design of efficient communication mechanisms that can empirically boost throughput while providing theoretical guarantees. In this work, we introduce Global-QSGD, a novel fami… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

  18. arXiv:2305.15264  [pdf, other

    math.OC cs.DC cs.LG stat.ML

    Error Feedback Shines when Features are Rare

    Authors: Peter Richtárik, Elnur Gasanov, Konstantin Burlachenko

    Abstract: We provide the first proof that gradient descent $\left({\color{green}\sf GD}\right)$ with greedy sparsification $\left({\color{green}\sf TopK}\right)$ and error feedback $\left({\color{green}\sf EF}\right)$ can obtain better communication complexity than vanilla ${\color{green}\sf GD}$ when solving the distributed optimization problem… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

  19. arXiv:2303.04622  [pdf, ps, other

    stat.ML cs.LG math.OC

    ELF: Federated Langevin Algorithms with Primal, Dual and Bidirectional Compression

    Authors: Avetik Karagulyan, Peter Richtárik

    Abstract: Federated sampling algorithms have recently gained great popularity in the community of machine learning and statistics. This paper studies variants of such algorithms called Error Feedback Langevin algorithms (ELF). In particular, we analyze the combinations of EF21 and EF21-P with the federated Langevin Monte-Carlo. We propose three algorithms: P-ELF, D-ELF, and B-ELF that use, respectively, pri… ▽ More

    Submitted 8 March, 2023; originally announced March 2023.

  20. arXiv:2210.16402  [pdf, ps, other

    cs.LG cs.DC math.OC stat.ML

    GradSkip: Communication-Accelerated Local Gradient Methods with Better Computational Complexity

    Authors: Artavazd Maranjyan, Mher Safaryan, Peter Richtárik

    Abstract: We study a class of distributed optimization algorithms that aim to alleviate high communication costs by allowing clients to perform multiple local gradient-type training steps before communication. In a recent breakthrough, Mishchenko et al. (2022) proved that local training, when properly executed, leads to provable communication acceleration, and this holds in the strongly convex regime withou… ▽ More

    Submitted 9 June, 2025; v1 submitted 28 October, 2022; originally announced October 2022.

  21. arXiv:2206.02275  [pdf, other

    cs.LG math.OC stat.ML

    Sharper Rates and Flexible Framework for Nonconvex SGD with Client and Data Sampling

    Authors: Alexander Tyurin, Lukang Sun, Konstantin Burlachenko, Peter Richtárik

    Abstract: We revisit the classical problem of finding an approximately stationary point of the average of $n$ smooth and possibly nonconvex functions. The optimal complexity of stochastic first-order methods in terms of the number of gradient evaluations of individual functions is $\mathcal{O}\left(n + n^{1/2}\varepsilon^{-1}\right)$, attained by the optimal SGD methods $\small\sf\color{green}{SPIDER}$(arXi… ▽ More

    Submitted 5 June, 2022; originally announced June 2022.

    Comments: 25 pages, 6 figures

    MSC Class: 90C26; 65K05 ACM Class: F.2.1; I.2.6

  22. arXiv:2204.13169  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    FedShuffle: Recipes for Better Use of Local Work in Federated Learning

    Authors: Samuel Horváth, Maziar Sanjabi, Lin Xiao, Peter Richtárik, Michael Rabbat

    Abstract: The practice of applying several local updates before aggregation across clients has been empirically shown to be a successful approach to overcoming the communication bottleneck in Federated Learning (FL). Such methods are usually implemented by having clients perform one or more epochs of local training per round while randomly reshuffling their finite dataset in each epoch. Data imbalance, wher… ▽ More

    Submitted 27 September, 2022; v1 submitted 27 April, 2022; originally announced April 2022.

    Comments: Published in Transactions on Machine Learning Research (09/2022)

  23. arXiv:2201.13320  [pdf, other

    cs.LG cs.DC cs.DS math.OC stat.ML

    BEER: Fast $O(1/T)$ Rate for Decentralized Nonconvex Optimization with Communication Compression

    Authors: Haoyu Zhao, Boyue Li, Zhize Li, Peter Richtárik, Yuejie Chi

    Abstract: Communication efficiency has been widely recognized as the bottleneck for large-scale decentralized machine learning applications in multi-agent or federated environments. To tackle the communication bottleneck, there have been many efforts to design communication-compressed algorithms for decentralized nonconvex optimization, where the clients are only allowed to communicate a small amount of qua… ▽ More

    Submitted 13 October, 2022; v1 submitted 31 January, 2022; originally announced January 2022.

    Comments: NeurIPS 2022

  24. arXiv:2111.11556  [pdf, other

    cs.LG math.OC stat.ML

    FLIX: A Simple and Communication-Efficient Alternative to Local Methods in Federated Learning

    Authors: Elnur Gasanov, Ahmed Khaled, Samuel Horváth, Peter Richtárik

    Abstract: Federated Learning (FL) is an increasingly popular machine learning paradigm in which multiple nodes try to collaboratively learn under privacy, communication and multiple heterogeneity constraints. A persistent problem in federated learning is that it is not clear what the optimization objective should be: the standard average risk minimization of supervised learning is inadequate in handling sev… ▽ More

    Submitted 23 February, 2022; v1 submitted 22 November, 2021; originally announced November 2021.

    Comments: V2: includes non-convex analysis as well as new large-scale experiments with neural networks. To appear in AISTATS 2022

  25. arXiv:2110.03313  [pdf, other

    cs.LG stat.ML

    Distributed Methods with Compressed Communication for Solving Variational Inequalities, with Theoretical Guarantees

    Authors: Aleksandr Beznosikov, Peter Richtárik, Michael Diskin, Max Ryabinin, Alexander Gasnikov

    Abstract: Variational inequalities in general and saddle point problems in particular are increasingly relevant in machine learning applications, including adversarial learning, GANs, transport and robust optimization. With increasing data and problem sizes necessary to train high performing models across various applications, we need to rely on parallel and distributed computing. However, in distributed tr… ▽ More

    Submitted 2 April, 2023; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: Appears in: Advances in Neural Information Processing Systems 35 (NeurIPS 2022). Minor modifications with respect to the NeurIPS version. 73 pages, 9 algorithms, 2 figures, 2 tables

    Journal ref: https://proceedings.neurips.cc/paper_files/paper/2022/hash/5ac1428c23b5da5e66d029646ea3206d-Abstract-Conference.html

  26. arXiv:2110.03300  [pdf, ps, other

    cs.LG math.OC stat.ML

    Permutation Compressors for Provably Faster Distributed Nonconvex Optimization

    Authors: Rafał Szlendak, Alexander Tyurin, Peter Richtárik

    Abstract: We study the MARINA method of Gorbunov et al (2021) -- the current state-of-the-art distributed non-convex optimization method in terms of theoretical communication complexity. Theoretical superiority of this method can be largely attributed to two sources: the use of a carefully engineered biased stochastic gradient estimator, which leads to a reduction in the number of communication rounds, and… ▽ More

    Submitted 7 October, 2021; originally announced October 2021.

    Comments: 53 pages

  27. arXiv:2106.05203  [pdf, other

    cs.LG math.OC stat.ML

    EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback

    Authors: Peter Richtárik, Igor Sokolov, Ilyas Fatkhullin

    Abstract: Error feedback (EF), also known as error compensation, is an immensely popular convergence stabilization mechanism in the context of distributed training of supervised machine learning models enhanced by the use of contractive communication compression mechanisms, such as Top-$k$. First proposed by Seide et al (2014) as a heuristic, EF resisted any theoretical understanding until recently [Stich e… ▽ More

    Submitted 9 June, 2021; originally announced June 2021.

    Comments: 37 pages, 5 algorithms, 3 Theorems, 8 Lemmas, 15 Figures

  28. arXiv:2102.12810  [pdf, other

    cs.LG stat.ML

    Hyperparameter Transfer Learning with Adaptive Complexity

    Authors: Samuel Horváth, Aaron Klein, Peter Richtárik, Cédric Archambeau

    Abstract: Bayesian optimization (BO) is a sample efficient approach to automatically tune the hyperparameters of machine learning models. In practice, one frequently has to solve similar hyperparameter tuning problems sequentially. For example, one might have to tune a type of neural network learned across a series of different classification problems. Recent work on multi-task BO exploits knowledge gained… ▽ More

    Submitted 25 February, 2021; originally announced February 2021.

    Comments: 12 pages, Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS) 2021, San Diego, California, USA

  29. arXiv:2102.08374  [pdf, other

    cs.LG math.OC stat.ML

    IntSGD: Adaptive Floatless Compression of Stochastic Gradients

    Authors: Konstantin Mishchenko, Bokun Wang, Dmitry Kovalev, Peter Richtárik

    Abstract: We propose a family of adaptive integer compression operators for distributed Stochastic Gradient Descent (SGD) that do not communicate a single float. This is achieved by multiplying floating-point vectors with a number known to every device and then rounding to integers. In contrast to the prior work on integer compression for SwitchML by Sapio et al. (2021), our IntSGD method is provably conver… ▽ More

    Submitted 20 March, 2022; v1 submitted 16 February, 2021; originally announced February 2021.

    Comments: Spotlight at ICLR 2022. 27 pages, 6 figures, 3 algorithms

    Journal ref: International Conference on Learning Representations (2022)

  30. arXiv:2010.00892  [pdf, other

    cs.LG math.OC stat.ML

    Variance-Reduced Methods for Machine Learning

    Authors: Robert M. Gower, Mark Schmidt, Francis Bach, Peter Richtarik

    Abstract: Stochastic optimization lies at the heart of machine learning, and its cornerstone is stochastic gradient descent (SGD), a method introduced over 60 years ago. The last 8 years have seen an exciting new development: variance reduction (VR) for stochastic optimization methods. These VR methods excel in settings where more than one pass through the training data is allowed, achieving a faster conver… ▽ More

    Submitted 2 October, 2020; originally announced October 2020.

    Comments: 16 pages, 7 figures, 1 table

    MSC Class: 65K05; 68T99 ACM Class: G.1.6

  31. arXiv:2008.10898  [pdf, other

    cs.LG cs.AI cs.DS math.OC stat.ML

    PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization

    Authors: Zhize Li, Hongyan Bao, Xiangliang Zhang, Peter Richtárik

    Abstract: In this paper, we propose a novel stochastic gradient estimator -- ProbAbilistic Gradient Estimator (PAGE) -- for nonconvex optimization. PAGE is easy to implement as it is designed via a small adjustment to vanilla SGD: in each iteration, PAGE uses the vanilla minibatch SGD update with probability $p_t$ or reuses the previous gradient with a small adjustment, at a much lower computational cost, w… ▽ More

    Submitted 11 June, 2021; v1 submitted 25 August, 2020; originally announced August 2020.

    Comments: 25 pages; accepted by ICML 2021 (long talk)

  32. arXiv:2006.11773  [pdf, other

    math.OC stat.ML

    Optimal and Practical Algorithms for Smooth and Strongly Convex Decentralized Optimization

    Authors: Dmitry Kovalev, Adil Salim, Peter Richtárik

    Abstract: We consider the task of decentralized minimization of the sum of smooth strongly convex functions stored across the nodes of a network. For this problem, lower bounds on the number of gradient computations and the number of communication rounds required to achieve $\varepsilon$ accuracy have recently been proven. We propose two new algorithms for this decentralized optimization problem and equip t… ▽ More

    Submitted 13 November, 2020; v1 submitted 21 June, 2020; originally announced June 2020.

  33. arXiv:2006.11573  [pdf, other

    cs.LG math.OC stat.ML

    Unified Analysis of Stochastic Gradient Methods for Composite Convex and Smooth Optimization

    Authors: Ahmed Khaled, Othmane Sebbouh, Nicolas Loizou, Robert M. Gower, Peter Richtárik

    Abstract: We present a unified theorem for the convergence analysis of stochastic gradient algorithms for minimizing a smooth and convex loss plus a convex regularizer. We do this by extending the unified analysis of Gorbunov, Hanzely \& Richtárik (2020) and dropping the requirement that the loss function be strongly convex. Instead, we only rely on convexity of the loss function. Our unified analysis appli… ▽ More

    Submitted 20 June, 2020; originally announced June 2020.

  34. arXiv:2006.11077  [pdf, other

    cs.LG stat.ML

    A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning

    Authors: Samuel Horváth, Peter Richtárik

    Abstract: Modern large-scale machine learning applications require stochastic optimization algorithms to be implemented on distributed compute systems. A key bottleneck of such systems is the communication overhead for exchanging information across the workers, such as stochastic gradients. Among the many techniques proposed to remedy this issue, one of the most successful is the framework of compressed com… ▽ More

    Submitted 14 March, 2021; v1 submitted 19 June, 2020; originally announced June 2020.

    Comments: 10 pages, 7 figures, published as a conference paper at ICLR 2021

  35. arXiv:2006.09270  [pdf, other

    stat.ML cs.LG math.OC

    Primal Dual Interpretation of the Proximal Stochastic Gradient Langevin Algorithm

    Authors: Adil Salim, Peter Richtárik

    Abstract: We consider the task of sampling with respect to a log concave probability distribution. The potential of the target distribution is assumed to be composite, \textit{i.e.}, written as the sum of a smooth convex term, and a nonsmooth convex term possibly taking infinite values. The target distribution can be seen as a minimizer of the Kullback-Leibler divergence defined on the Wasserstein space (\t… ▽ More

    Submitted 22 February, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

  36. arXiv:2006.07013  [pdf, ps, other

    math.OC cs.DS cs.LG stat.ML

    A Unified Analysis of Stochastic Gradient Methods for Nonconvex Federated Optimization

    Authors: Zhize Li, Peter Richtárik

    Abstract: In this paper, we study the performance of a large family of SGD variants in the smooth nonconvex regime. To this end, we propose a generic and flexible assumption capable of accurate modeling of the second moment of the stochastic gradient. Our assumption is satisfied by a large number of specific variants of SGD in the literature, including SGD with arbitrary sampling, SGD with compressed gradie… ▽ More

    Submitted 12 June, 2020; originally announced June 2020.

    Comments: 77 pages

  37. arXiv:2006.05988  [pdf, other

    math.OC cs.LG stat.ML

    Random Reshuffling: Simple Analysis with Vast Improvements

    Authors: Konstantin Mishchenko, Ahmed Khaled, Peter Richtárik

    Abstract: Random Reshuffling (RR) is an algorithm for minimizing finite-sum functions that utilizes iterative gradient descent steps in conjunction with data reshuffling. Often contrasted with its sibling Stochastic Gradient Descent (SGD), RR is usually faster in practice and enjoys significant popularity in convex and non-convex optimization. The convergence rate of RR has attracted substantial attention r… ▽ More

    Submitted 5 April, 2021; v1 submitted 10 June, 2020; originally announced June 2020.

    Comments: v3 updates: Theorem 4 includes a new result for Polyak-Lojasiewicz functions. NeurIPS 2020. 35 pages, 2 figures, 2 tables, 3 algorithms

  38. arXiv:2005.01097  [pdf, other

    cs.LG math.OC stat.ML

    Adaptive Learning of the Optimal Batch Size of SGD

    Authors: Motasem Alfarra, Slavomir Hanzely, Alyazeed Albasyoni, Bernard Ghanem, Peter Richtarik

    Abstract: Recent advances in the theoretical understanding of SGD led to a formula for the optimal batch size minimizing the number of effective data passes, i.e., the number of iterations times the batch size. However, this formula is of no practical value as it depends on the knowledge of the variance of the stochastic gradients evaluated at the optimum. In this paper we design a practical SGD method capa… ▽ More

    Submitted 19 November, 2021; v1 submitted 3 May, 2020; originally announced May 2020.

    Comments: Accepted to the 12th Annual Workshop on Optimization for Machine Learning (OPT2020)

  39. arXiv:2004.02635  [pdf, other

    math.OC cs.LG stat.ML

    Dualize, Split, Randomize: Toward Fast Nonsmooth Optimization Algorithms

    Authors: Adil Salim, Laurent Condat, Konstantin Mishchenko, Peter Richtárik

    Abstract: We consider minimizing the sum of three convex functions, where the first one F is smooth, the second one is nonsmooth and proximable and the third one is the composition of a nonsmooth proximable function with a linear operator L. This template problem has many applications, for instance, in image processing and machine learning. First, we propose a new primal-dual algorithm, which we call PDDY,… ▽ More

    Submitted 26 July, 2022; v1 submitted 3 April, 2020; originally announced April 2020.

  40. arXiv:2004.01442  [pdf, other

    cs.LG math.OC stat.ML

    From Local SGD to Local Fixed-Point Methods for Federated Learning

    Authors: Grigory Malinovsky, Dmitry Kovalev, Elnur Gasanov, Laurent Condat, Peter Richtárik

    Abstract: Most algorithms for solving optimization problems or finding saddle points of convex-concave functions are fixed-point algorithms. In this work we consider the generic problem of finding a fixed point of an average of operators, or an approximation thereof, in a distributed setting. Our work is motivated by the needs of federated learning. In this context, each local operator models the computatio… ▽ More

    Submitted 16 June, 2020; v1 submitted 3 April, 2020; originally announced April 2020.

    Comments: Accepted to ICML 2020

  41. arXiv:2002.12410  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    On Biased Compression for Distributed Learning

    Authors: Aleksandr Beznosikov, Samuel Horváth, Peter Richtárik, Mher Safaryan

    Abstract: In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact biased compressors often show superior performance in practice when compared to the much more studied and understood unbiased compressors, very little is known about them. In this work we study… ▽ More

    Submitted 14 January, 2024; v1 submitted 27 February, 2020; originally announced February 2020.

    Comments: 50 pages, 9 figures, 5 tables, 22 theorems and lemmas, 7 new compression operators, 1 algorithm

    Journal ref: Journal of Machine Learning Research 2023: https://www.jmlr.org/papers/v24/21-1548.html

  42. arXiv:2002.08958  [pdf, other

    cs.LG cs.DC cs.IT math.OC stat.ML

    Uncertainty Principle for Communication Compression in Distributed and Federated Learning and the Search for an Optimal Compressor

    Authors: Mher Safaryan, Egor Shulgin, Peter Richtárik

    Abstract: In order to mitigate the high communication cost in distributed and federated learning, various vector compression schemes, such as quantization, sparsification and dithering, have become very popular. In designing a compression method, one aims to communicate as few bits as possible, which minimizes the cost per communication round, while at the same time attempting to impart as little distortion… ▽ More

    Submitted 26 January, 2021; v1 submitted 20 February, 2020; originally announced February 2020.

    Comments: 23 pages, 6 figures, 2 tables

    Journal ref: Information and Inference: A Journal of the IMA, 2021

  43. arXiv:2002.05516  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    Federated Learning of a Mixture of Global and Local Models

    Authors: Filip Hanzely, Peter Richtárik

    Abstract: We propose a new optimization formulation for training federated learning models. The standard formulation has the form of an empirical risk minimization problem constructed to find a single global model trained from the private data stored across all participating devices. In contrast, our formulation seeks an explicit trade-off between this traditional global model and the local models, which ca… ▽ More

    Submitted 12 February, 2021; v1 submitted 10 February, 2020; originally announced February 2020.

    Comments: 40 pages, 8 algorithms, 6 figures, 1 table (minor changes compared to the previous versions)

  44. arXiv:2002.05359  [pdf, other

    cs.LG math.OC stat.ML

    Adaptivity of Stochastic Gradient Methods for Nonconvex Optimization

    Authors: Samuel Horváth, Lihua Lei, Peter Richtárik, Michael I. Jordan

    Abstract: Adaptivity is an important yet under-studied property in modern optimization theory. The gap between the state-of-the-art theory and the current practice is striking in that algorithms with desirable theoretical guarantees typically involve drastically different settings of hyperparameters, such as step-size schemes and batch sizes, in different regimes. Despite the appealing theoretical results,… ▽ More

    Submitted 13 February, 2020; originally announced February 2020.

    Comments: 11 pages, 4 Figures, 20 pages Appendix

  45. arXiv:2002.03329  [pdf, other

    math.OC cs.LG stat.ML

    Better Theory for SGD in the Nonconvex World

    Authors: Ahmed Khaled, Peter Richtárik

    Abstract: Large-scale nonconvex optimization problems are ubiquitous in modern machine learning, and among practitioners interested in solving them, Stochastic Gradient Descent (SGD) reigns supreme. We revisit the analysis of SGD in the nonconvex setting and propose a new variant of the recently introduced expected smoothness assumption which governs the behaviour of the second moment of the stochastic grad… ▽ More

    Submitted 24 July, 2020; v1 submitted 9 February, 2020; originally announced February 2020.

    Comments: 33 pages, 3 figures, 4 theorems, and 4 propositions. V3 updates: added several references on error conditions (Tseng, Solodov, Bottou and Tsitsiklis, Grimmer), added a full proof of Corollary 1, cleaned up several proofs, and made minor adjustments to text for clarity

  46. arXiv:1912.01597  [pdf, other

    cs.LG math.OC stat.ML

    Stochastic Newton and Cubic Newton Methods with Simple Local Linear-Quadratic Rates

    Authors: Dmitry Kovalev, Konstantin Mishchenko, Peter Richtárik

    Abstract: We present two new remarkably simple stochastic second-order methods for minimizing the average of a very large number of sufficiently smooth and strongly convex functions. The first is a stochastic variant of Newton's method (SN), and the second is a stochastic variant of cubically regularized Newton's method (SCN). We establish local linear-quadratic convergence results. Unlike existing stochast… ▽ More

    Submitted 3 December, 2019; originally announced December 2019.

    Comments: 16 pages, 2 figures, 3 algorithms, 2 theorems, 7 lemmas; to be presented at the NeurIPS workshop "Beyond First Order Methods in ML"

  47. arXiv:1909.04746  [pdf, other

    cs.LG cs.DC math.NA math.OC stat.ML

    Tighter Theory for Local SGD on Identical and Heterogeneous Data

    Authors: Ahmed Khaled, Konstantin Mishchenko, Peter Richtárik

    Abstract: We provide a new analysis of local SGD, removing unnecessary assumptions and elaborating on the difference between two data regimes: identical and heterogeneous. In both cases, we improve the existing theory and provide values of the optimal stepsize and optimal number of local iterations. Our bounds are based on a new notion of variance that is specific to local SGD methods with different data. T… ▽ More

    Submitted 14 April, 2022; v1 submitted 10 September, 2019; originally announced September 2019.

    Comments: AISTATS 2020. 31 pages, 1 algorithm, 5 theorems, 6 figures

  48. arXiv:1909.04716  [pdf, other

    cs.LG cs.DC math.NA math.OC stat.ML

    Gradient Descent with Compressed Iterates

    Authors: Ahmed Khaled, Peter Richtárik

    Abstract: We propose and analyze a new type of stochastic first order method: gradient descent with compressed iterates (GDCI). GDCI in each iteration first compresses the current iterate using a lossy randomized compression technique, and subsequently takes a gradient step. This method is a distillation of a key ingredient in the current practice of federated learning, where a model needs to be compressed… ▽ More

    Submitted 18 March, 2020; v1 submitted 10 September, 2019; originally announced September 2019.

    Comments: NeurIPS 2019 Workshop on Federated Learning for Data Privacy and Confidentiality. 10 pages, 1 algorithm, 1 theorem, 5 lemmas

  49. arXiv:1909.04715  [pdf, other

    cs.LG cs.DC math.NA math.OC stat.ML

    First Analysis of Local GD on Heterogeneous Data

    Authors: Ahmed Khaled, Konstantin Mishchenko, Peter Richtárik

    Abstract: We provide the first convergence analysis of local gradient descent for minimizing the average of smooth and convex but otherwise arbitrary functions. Problems of this form and local gradient descent as a solution method are of importance in federated learning, where each function is based on private data stored by a user on a mobile device, and the data of different users can be arbitrarily heter… ▽ More

    Submitted 18 March, 2020; v1 submitted 10 September, 2019; originally announced September 2019.

    Comments: NeurIPS 2019 Workshop on Federated Learning for Data Privacy and Confidentiality. 11 pages, 4 lemmas, 1 theorem

  50. arXiv:1909.00145  [pdf, other

    eess.IV cs.LG stat.ML

    Stochastic Convolutional Sparse Coding

    Authors: Jinhui Xiong, Peter Richtárik, Wolfgang Heidrich

    Abstract: State-of-the-art methods for Convolutional Sparse Coding usually employ Fourier-domain solvers in order to speed up the convolution operators. However, this approach is not without shortcomings. For example, Fourier-domain representations implicitly assume circular boundary conditions and make it hard to fully exploit the sparsity of the problem as well as the small spatial support of the filters.… ▽ More

    Submitted 31 August, 2019; originally announced September 2019.

    Comments: 8 pages