-
Accelerated Distributed Optimization with Compression and Error Feedback
Authors:
Yuan Gao,
Anton Rodomanov,
Jeremy Rack,
Sebastian U. Stich
Abstract:
Modern machine learning tasks often involve massive datasets and models, necessitating distributed optimization algorithms with reduced communication overhead. Communication compression, where clients transmit compressed updates to a central server, has emerged as a key technique to mitigate communication bottlenecks. However, the theoretical understanding of stochastic distributed optimization wi…
▽ More
Modern machine learning tasks often involve massive datasets and models, necessitating distributed optimization algorithms with reduced communication overhead. Communication compression, where clients transmit compressed updates to a central server, has emerged as a key technique to mitigate communication bottlenecks. However, the theoretical understanding of stochastic distributed optimization with contractive compression remains limited, particularly in conjunction with Nesterov acceleration -- a cornerstone for achieving faster convergence in optimization.
In this paper, we propose a novel algorithm, ADEF (Accelerated Distributed Error Feedback), which integrates Nesterov acceleration, contractive compression, error feedback, and gradient difference compression. We prove that ADEF achieves the first accelerated convergence rate for stochastic distributed optimization with contractive compression in the general convex regime. Numerical experiments validate our theoretical findings and demonstrate the practical efficacy of ADEF in reducing communication costs while maintaining fast convergence.
△ Less
Submitted 29 March, 2025; v1 submitted 11 March, 2025;
originally announced March 2025.
-
DADA: Dual Averaging with Distance Adaptation
Authors:
Mohammad Moshtaghifar,
Anton Rodomanov,
Daniil Vankov,
Sebastian Stich
Abstract:
We present a novel universal gradient method for solving convex optimization problems. Our algorithm -- Dual Averaging with Distance Adaptation (DADA) -- is based on the classical scheme of dual averaging and dynamically adjusts its coefficients based on observed gradients and the distance between iterates and the starting point, eliminating the need for problem-specific parameters. DADA is a univ…
▽ More
We present a novel universal gradient method for solving convex optimization problems. Our algorithm -- Dual Averaging with Distance Adaptation (DADA) -- is based on the classical scheme of dual averaging and dynamically adjusts its coefficients based on observed gradients and the distance between iterates and the starting point, eliminating the need for problem-specific parameters. DADA is a universal algorithm that simultaneously works for a broad spectrum of problem classes, provided the local growth of the objective function around its minimizer can be bounded. Particular examples of such problem classes are nonsmooth Lipschitz functions, Lipschitz-smooth functions, Hölder-smooth functions, functions with high-order Lipschitz derivative, quasi-self-concordant functions, and $(L_0,L_1)$-smooth functions. Crucially, DADA is applicable to both unconstrained and constrained problems, even when the domain is unbounded, without requiring prior knowledge of the number of iterations or desired accuracy.
△ Less
Submitted 17 January, 2025;
originally announced January 2025.
-
Optimizing $(L_0, L_1)$-Smooth Functions by Gradient Methods
Authors:
Daniil Vankov,
Anton Rodomanov,
Angelia Nedich,
Lalitha Sankar,
Sebastian U. Stich
Abstract:
We study gradient methods for optimizing $(L_0, L_1)$-smooth functions, a class that generalizes Lipschitz-smooth functions and has gained attention for its relevance in machine learning. We provide new insights into the structure of this function class and develop a principled framework for analyzing optimization methods in this setting. While our convergence rate estimates recover existing resul…
▽ More
We study gradient methods for optimizing $(L_0, L_1)$-smooth functions, a class that generalizes Lipschitz-smooth functions and has gained attention for its relevance in machine learning. We provide new insights into the structure of this function class and develop a principled framework for analyzing optimization methods in this setting. While our convergence rate estimates recover existing results for minimizing the gradient norm in nonconvex problems, our approach significantly improves the best-known complexity bounds for convex objectives. Moreover, we show that the gradient method with Polyak stepsizes and the normalized gradient method achieve nearly the same complexity guarantees as methods that rely on explicit knowledge of~$(L_0, L_1)$. Finally, we demonstrate that a carefully designed accelerated gradient method can be applied to $(L_0, L_1)$-smooth functions, further improving all previous results.
△ Less
Submitted 7 March, 2025; v1 submitted 14 October, 2024;
originally announced October 2024.
-
Stabilized Proximal-Point Methods for Federated Optimization
Authors:
Xiaowen Jiang,
Anton Rodomanov,
Sebastian U. Stich
Abstract:
In developing efficient optimization algorithms, it is crucial to account for communication constraints -- a significant challenge in modern Federated Learning. The best-known communication complexity among non-accelerated algorithms is achieved by DANE, a distributed proximal-point algorithm that solves local subproblems at each iteration and that can exploit second-order similarity among individ…
▽ More
In developing efficient optimization algorithms, it is crucial to account for communication constraints -- a significant challenge in modern Federated Learning. The best-known communication complexity among non-accelerated algorithms is achieved by DANE, a distributed proximal-point algorithm that solves local subproblems at each iteration and that can exploit second-order similarity among individual functions. However, to achieve such communication efficiency, the algorithm requires solving local subproblems sufficiently accurately resulting in slightly sub-optimal local complexity. Inspired by the hybrid-projection proximal-point method, in this work, we propose a novel distributed algorithm S-DANE. Compared to DANE, this method uses an auxiliary sequence of prox-centers while maintaining the same deterministic communication complexity. Moreover, the accuracy condition for solving the subproblem is milder, leading to enhanced local computation efficiency. Furthermore, S-DANE supports partial client participation and arbitrary stochastic local solvers, making it attractive in practice. We further accelerate S-DANE and show that the resulting algorithm achieves the best-known communication complexity among all existing methods for distributed convex optimization while still enjoying good local computation efficiency as S-DANE. Finally, we propose adaptive variants of both methods using line search, obtaining the first provably efficient adaptive algorithms that could exploit local second-order similarity without the prior knowledge of any parameters.
△ Less
Submitted 3 November, 2024; v1 submitted 9 July, 2024;
originally announced July 2024.
-
Universality of AdaGrad Stepsizes for Stochastic Optimization: Inexact Oracle, Acceleration and Variance Reduction
Authors:
Anton Rodomanov,
Xiaowen Jiang,
Sebastian Stich
Abstract:
We present adaptive gradient methods (both basic and accelerated) for solving convex composite optimization problems in which the main part is approximately smooth (a.k.a. $(δ, L)$-smooth) and can be accessed only via a (potentially biased) stochastic gradient oracle. This setting covers many interesting examples including Hölder smooth problems and various inexact computations of the stochastic g…
▽ More
We present adaptive gradient methods (both basic and accelerated) for solving convex composite optimization problems in which the main part is approximately smooth (a.k.a. $(δ, L)$-smooth) and can be accessed only via a (potentially biased) stochastic gradient oracle. This setting covers many interesting examples including Hölder smooth problems and various inexact computations of the stochastic gradient. Our methods use AdaGrad stepsizes and are adaptive in the sense that they do not require knowing any problem-dependent constants except an estimate of the diameter of the feasible set but nevertheless achieve the best possible convergence rates as if they knew the corresponding constants. We demonstrate that AdaGrad stepsizes work in a variety of situations by proving, in a unified manner, three types of new results. First, we establish efficiency guarantees for our methods in the classical setting where the oracle's variance is uniformly bounded. We then show that, under more refined assumptions on the variance, the same methods without any modifications enjoy implicit variance reduction properties allowing us to express their complexity estimates in terms of the variance only at the minimizer. Finally, we show how to incorporate explicit SVRG-type variance reduction into our methods and obtain even faster algorithms. In all three cases, we present both basic and accelerated algorithms achieving state-of-the-art complexity bounds. As a direct corollary of our results, we obtain universal stochastic gradient methods for Hölder smooth problems which can be used in all situations.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Global Complexity Analysis of BFGS
Authors:
Anton Rodomanov
Abstract:
In this paper, we present a global complexity analysis of the classical BFGS method with inexact line search, as applied to minimizing a strongly convex function with Lipschitz continuous gradient and Hessian. We consider a variety of standard line search strategies including the backtracking line search based on the Armijo condition, Armijo-Goldstein and Wolfe-Powell line searches. Our analysis s…
▽ More
In this paper, we present a global complexity analysis of the classical BFGS method with inexact line search, as applied to minimizing a strongly convex function with Lipschitz continuous gradient and Hessian. We consider a variety of standard line search strategies including the backtracking line search based on the Armijo condition, Armijo-Goldstein and Wolfe-Powell line searches. Our analysis suggests that the convergence of the algorithm proceeds in several different stages before the fast superlinear convergence actually begins. Furthermore, once the initial point is far away from the minimizer, the starting moment of superlinear convergence may be quite large. We show, however, that this drawback can be easily rectified by using a simple restarting procedure.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
Federated Optimization with Doubly Regularized Drift Correction
Authors:
Xiaowen Jiang,
Anton Rodomanov,
Sebastian U. Stich
Abstract:
Federated learning is a distributed optimization paradigm that allows training machine learning models across decentralized devices while keeping the data localized. The standard method, FedAvg, suffers from client drift which can hamper performance and increase communication costs over centralized methods. Previous works proposed various strategies to mitigate drift, yet none have shown uniformly…
▽ More
Federated learning is a distributed optimization paradigm that allows training machine learning models across decentralized devices while keeping the data localized. The standard method, FedAvg, suffers from client drift which can hamper performance and increase communication costs over centralized methods. Previous works proposed various strategies to mitigate drift, yet none have shown uniformly improved communication-computation trade-offs over vanilla gradient descent.
In this work, we revisit DANE, an established method in distributed optimization. We show that (i) DANE can achieve the desired communication reduction under Hessian similarity constraints. Furthermore, (ii) we present an extension, DANE+, which supports arbitrary inexact local solvers and has more freedom to choose how to aggregate the local updates. We propose (iii) a novel method, FedRed, which has improved local computational complexity and retains the same communication complexity compared to DANE/DANE+. This is achieved by using doubly regularized drift correction.
△ Less
Submitted 12 April, 2024;
originally announced April 2024.
-
Non-convex Stochastic Composite Optimization with Polyak Momentum
Authors:
Yuan Gao,
Anton Rodomanov,
Sebastian U. Stich
Abstract:
The stochastic proximal gradient method is a powerful generalization of the widely used stochastic gradient descent (SGD) method and has found numerous applications in Machine Learning. However, it is notoriously known that this method fails to converge in non-convex settings where the stochastic noise is significant (i.e. when only small or bounded batch sizes are used). In this paper, we focus o…
▽ More
The stochastic proximal gradient method is a powerful generalization of the widely used stochastic gradient descent (SGD) method and has found numerous applications in Machine Learning. However, it is notoriously known that this method fails to converge in non-convex settings where the stochastic noise is significant (i.e. when only small or bounded batch sizes are used). In this paper, we focus on the stochastic proximal gradient method with Polyak momentum. We prove this method attains an optimal convergence rate for non-convex composite optimization problems, regardless of batch size. Additionally, we rigorously analyze the variance reduction effect of the Polyak momentum in the composite optimization setting and we show the method also converges when the proximal step can only be solved inexactly. Finally, we provide numerical experiments to validate our theoretical results.
△ Less
Submitted 8 December, 2024; v1 submitted 5 March, 2024;
originally announced March 2024.
-
Universal Gradient Methods for Stochastic Convex Optimization
Authors:
Anton Rodomanov,
Ali Kavis,
Yongtao Wu,
Kimon Antonakopoulos,
Volkan Cevher
Abstract:
We develop universal gradient methods for Stochastic Convex Optimization (SCO). Our algorithms automatically adapt not only to the oracle's noise but also to the Hölder smoothness of the objective function without a priori knowledge of the particular setting. The key ingredient is a novel strategy for adjusting step-size coefficients in the Stochastic Gradient Method (SGD). Unlike AdaGrad, which a…
▽ More
We develop universal gradient methods for Stochastic Convex Optimization (SCO). Our algorithms automatically adapt not only to the oracle's noise but also to the Hölder smoothness of the objective function without a priori knowledge of the particular setting. The key ingredient is a novel strategy for adjusting step-size coefficients in the Stochastic Gradient Method (SGD). Unlike AdaGrad, which accumulates gradient norms, our Universal Gradient Method accumulates appropriate combinations of gradient- and iterate differences. The resulting algorithm has state-of-the-art worst-case convergence rate guarantees for the entire Hölder class including, in particular, both nonsmooth functions and those with Lipschitz continuous gradient. We also present the Universal Fast Gradient Method for SCO enjoying optimal efficiency estimates.
△ Less
Submitted 11 July, 2024; v1 submitted 5 February, 2024;
originally announced February 2024.
-
Polynomial Preconditioning for Gradient Methods
Authors:
Nikita Doikov,
Anton Rodomanov
Abstract:
We study first-order methods with preconditioning for solving structured nonlinear convex optimization problems. We propose a new family of preconditioners generated by symmetric polynomials. They provide first-order optimization methods with a provable improvement of the condition number, cutting the gaps between highest eigenvalues, without explicit knowledge of the actual spectrum. We give a st…
▽ More
We study first-order methods with preconditioning for solving structured nonlinear convex optimization problems. We propose a new family of preconditioners generated by symmetric polynomials. They provide first-order optimization methods with a provable improvement of the condition number, cutting the gaps between highest eigenvalues, without explicit knowledge of the actual spectrum. We give a stochastic interpretation of this preconditioning in terms of coordinate volume sampling and compare it with other classical approaches, including the Chebyshev polynomials. We show how to incorporate a polynomial preconditioning into the Gradient and Fast Gradient Methods and establish the corresponding global complexity bounds. Finally, we propose a simple adaptive search procedure that automatically chooses the best possible polynomial preconditioning for the Gradient Method, minimizing the objective along a low-dimensional Krylov subspace. Numerical experiments confirm the efficiency of our preconditioning strategies for solving various machine learning problems.
△ Less
Submitted 30 January, 2023;
originally announced January 2023.
-
Gradient Methods for Stochastic Optimization in Relative Scale
Authors:
Yurii Nesterov,
Anton Rodomanov
Abstract:
We propose a new concept of a relatively inexact stochastic subgradient and present novel first-order methods that can use such objects to approximately solve convex optimization problems in relative scale. An important example where relatively inexact subgradients naturally arise is given by the Power or Lanczos algorithms for computing an approximate leading eigenvector of a symmetric positive s…
▽ More
We propose a new concept of a relatively inexact stochastic subgradient and present novel first-order methods that can use such objects to approximately solve convex optimization problems in relative scale. An important example where relatively inexact subgradients naturally arise is given by the Power or Lanczos algorithms for computing an approximate leading eigenvector of a symmetric positive semidefinite matrix. Using these algorithms as subroutines in our methods, we get new optimization schemes that can provably solve certain large-scale Semidefinite Programming problems with relative accuracy guarantees by using only matrix-vector products.
△ Less
Submitted 28 May, 2023; v1 submitted 19 January, 2023;
originally announced January 2023.
-
Subgradient Ellipsoid Method for Nonsmooth Convex Problems
Authors:
Anton Rodomanov,
Yurii Nesterov
Abstract:
In this paper, we present a new ellipsoid-type algorithm for solving nonsmooth problems with convex structure. Examples of such problems include nonsmooth convex minimization problems, convex-concave saddle-point problems and variational inequalities with monotone operator. Our algorithm can be seen as a combination of the standard Subgradient and Ellipsoid methods. However, in contrast to the lat…
▽ More
In this paper, we present a new ellipsoid-type algorithm for solving nonsmooth problems with convex structure. Examples of such problems include nonsmooth convex minimization problems, convex-concave saddle-point problems and variational inequalities with monotone operator. Our algorithm can be seen as a combination of the standard Subgradient and Ellipsoid methods. However, in contrast to the latter one, the proposed method has a reasonable convergence rate even when the dimensionality of the problem is sufficiently large. For generating accuracy certificates in our algorithm, we propose an efficient technique, which ameliorates the previously known recipes.
△ Less
Submitted 24 June, 2021;
originally announced June 2021.
-
New Results on Superlinear Convergence of Classical Quasi-Newton Methods
Authors:
Anton Rodomanov,
Yurii Nesterov
Abstract:
We present a new theoretical analysis of local superlinear convergence of classical quasi-Newton methods from the convex Broyden class. As a result, we obtain a significant improvement in the currently known estimates of the convergence rates for these methods. In particular, we show that the corresponding rate of the Broyden-Fletcher-Goldfarb-Shanno method depends only on the product of the dimen…
▽ More
We present a new theoretical analysis of local superlinear convergence of classical quasi-Newton methods from the convex Broyden class. As a result, we obtain a significant improvement in the currently known estimates of the convergence rates for these methods. In particular, we show that the corresponding rate of the Broyden-Fletcher-Goldfarb-Shanno method depends only on the product of the dimensionality of the problem and the logarithm of its condition number.
△ Less
Submitted 1 June, 2021; v1 submitted 29 April, 2020;
originally announced April 2020.
-
Rates of superlinear convergence for classical quasi-Newton methods
Authors:
Anton Rodomanov,
Yurii Nesterov
Abstract:
We study the local convergence of classical quasi-Newton methods for nonlinear optimization. Although it was well established a long time ago that asymptotically these methods converge superlinearly, the corresponding rates of convergence still remain unknown. In this paper, we address this problem. We obtain first explicit non-asymptotic rates of superlinear convergence for the standard quasi-New…
▽ More
We study the local convergence of classical quasi-Newton methods for nonlinear optimization. Although it was well established a long time ago that asymptotically these methods converge superlinearly, the corresponding rates of convergence still remain unknown. In this paper, we address this problem. We obtain first explicit non-asymptotic rates of superlinear convergence for the standard quasi-Newton methods, which are based on the updating formulas from the convex Broyden class. In particular, for the well-known DFP and BFGS methods, we obtain the rates of the form $(\frac{n L^2}{μ^2 k})^{k/2}$ and $(\frac{n L}{μk})^{k/2}$ respectively, where $k$ is the iteration counter, $n$ is the dimension of the problem, $μ$ is the strong convexity parameter, and $L$ is the Lipschitz constant of the gradient.
△ Less
Submitted 1 June, 2021; v1 submitted 20 March, 2020;
originally announced March 2020.
-
Greedy Quasi-Newton Methods with Explicit Superlinear Convergence
Authors:
Anton Rodomanov,
Yurii Nesterov
Abstract:
In this paper, we study greedy variants of quasi-Newton methods. They are based on the updating formulas from a certain subclass of the Broyden family. In particular, this subclass includes the well-known DFP, BFGS and SR1 updates. However, in contrast to the classical quasi-Newton methods, which use the difference of successive iterates for updating the Hessian approximations, our methods apply b…
▽ More
In this paper, we study greedy variants of quasi-Newton methods. They are based on the updating formulas from a certain subclass of the Broyden family. In particular, this subclass includes the well-known DFP, BFGS and SR1 updates. However, in contrast to the classical quasi-Newton methods, which use the difference of successive iterates for updating the Hessian approximations, our methods apply basis vectors, greedily selected so as to maximize a certain measure of progress. For greedy quasi-Newton methods, we establish an explicit non-asymptotic bound on their rate of local superlinear convergence, which contains a contraction factor, depending on the square of the iteration counter. We also show that these methods produce Hessian approximations whose deviation from the exact Hessians linearly convergences to zero.
△ Less
Submitted 1 June, 2021; v1 submitted 3 February, 2020;
originally announced February 2020.
-
Smoothness parameter of power of Euclidean norm
Authors:
Anton Rodomanov,
Yurii Nesterov
Abstract:
In this paper, we study derivatives of powers of Euclidean norm. We prove their Hölder continuity and establish explicit expressions for the corresponding constants. We show that these constants are optimal for odd derivatives and at most two times suboptimal for the even ones. In the particular case of integer powers, when the Hölder continuity transforms into the Lipschitz continuity, we improve…
▽ More
In this paper, we study derivatives of powers of Euclidean norm. We prove their Hölder continuity and establish explicit expressions for the corresponding constants. We show that these constants are optimal for odd derivatives and at most two times suboptimal for the even ones. In the particular case of integer powers, when the Hölder continuity transforms into the Lipschitz continuity, we improve this result and obtain the optimal constants.
△ Less
Submitted 1 June, 2021; v1 submitted 29 July, 2019;
originally announced July 2019.
-
A Randomized Coordinate Descent Method with Volume Sampling
Authors:
Anton Rodomanov,
Dmitry Kropotov
Abstract:
We analyze the coordinate descent method with a new coordinate selection strategy, called volume sampling. This strategy prescribes selecting subsets of variables of certain size proportionally to the determinants of principal submatrices of the matrix, that bounds the curvature of the objective function. In the particular case, when the size of the subsets equals one, volume sampling coincides wi…
▽ More
We analyze the coordinate descent method with a new coordinate selection strategy, called volume sampling. This strategy prescribes selecting subsets of variables of certain size proportionally to the determinants of principal submatrices of the matrix, that bounds the curvature of the objective function. In the particular case, when the size of the subsets equals one, volume sampling coincides with the well-known strategy of sampling coordinates proportionally to their Lipschitz constants. For the coordinate descent with volume sampling, we establish the convergence rates both for convex and strongly convex problems. Our theoretical results show that, by increasing the size of the subsets, it is possible to accelerate the method up to the factor which depends on the spectral gap between the corresponding largest eigenvalues of the curvature matrix. Several numerical experiments confirm our theoretical conclusions.
△ Less
Submitted 29 April, 2020; v1 submitted 9 April, 2019;
originally announced April 2019.
-
Primal-Dual Method for Searching Equilibrium in Hierarchical Congestion Population Games
Authors:
Pavel Dvurechensky,
Alexander Gasnikov,
Evgenia Gasnikova,
Sergey Matsievsky,
Anton Rodomanov,
Inna Usik
Abstract:
In this paper, we consider a large class of hierarchical congestion population games. One can show that the equilibrium in a game of such type can be described as a minimum point in a properly constructed multi-level convex optimization problem. We propose a fast primal-dual composite gradient method and apply it to the problem, which is dual to the problem describing the equilibrium in the consid…
▽ More
In this paper, we consider a large class of hierarchical congestion population games. One can show that the equilibrium in a game of such type can be described as a minimum point in a properly constructed multi-level convex optimization problem. We propose a fast primal-dual composite gradient method and apply it to the problem, which is dual to the problem describing the equilibrium in the considered class of games. We prove that this method allows to find an approximate solution of the initial problem without increasing the complexity.
△ Less
Submitted 25 August, 2016; v1 submitted 29 June, 2016;
originally announced June 2016.