-
Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order
Authors:
Egor Petrov,
Grigoriy Evseev,
Aleksey Antonov,
Andrey Veprikov,
Pavel Plyusnin,
Nikolay Bushkov,
Stanislav Moiseev,
Aleksandr Beznosikov
Abstract:
Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks. Yet traditional first-order optimizers such as Stochastic Gradient Descent (SGD) and Adam incur prohibitive memory and computational costs that scale poorly with model size. In this paper, we investigate zero-order (ZO) optimization methods as a memory- and compute-efficient alternative, part…
▽ More
Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks. Yet traditional first-order optimizers such as Stochastic Gradient Descent (SGD) and Adam incur prohibitive memory and computational costs that scale poorly with model size. In this paper, we investigate zero-order (ZO) optimization methods as a memory- and compute-efficient alternative, particularly in the context of parameter-efficient fine-tuning techniques like LoRA. We propose $\texttt{JAGUAR SignSGD}$, a ZO momentum-based algorithm that extends ZO SignSGD, requiring the same number of parameters as the standard ZO SGD and only $\mathcal{O}(1)$ function evaluations per iteration. To the best of our knowledge, this is the first study to establish rigorous convergence guarantees for SignSGD in the stochastic ZO case. We further propose $\texttt{JAGUAR Muon}$, a novel ZO extension of the Muon optimizer that leverages the matrix structure of model parameters, and we provide its convergence rate under arbitrary stochastic noise. Through extensive experiments on challenging LLM fine-tuning benchmarks, we demonstrate that the proposed algorithms meet or exceed the convergence quality of standard first-order methods, achieving significant memory reduction. Our theoretical and empirical results establish new ZO optimization methods as a practical and theoretically grounded approach for resource-constrained LLM adaptation. Our code is available at https://github.com/brain-mmo-lab/ZO_LLM
△ Less
Submitted 11 June, 2025; v1 submitted 4 June, 2025;
originally announced June 2025.
-
Sign-SGD is the Golden Gate between Multi-Node to Single-Node Learning: Significant Boost via Parameter-Free Optimization
Authors:
Daniil Medyakov,
Sergey Stanko,
Gleb Molodtsov,
Philip Zmushko,
Grigoriy Evseev,
Egor Petrov,
Aleksandr Beznosikov
Abstract:
Quite recently, large language models have made a significant breakthrough across various disciplines. However, training them is an extremely resource-intensive task, even for major players with vast computing resources. One of the methods gaining popularity in light of these challenges is Sign-SGD. This method can be applied both as a memory-efficient approach in single-node training and as a gra…
▽ More
Quite recently, large language models have made a significant breakthrough across various disciplines. However, training them is an extremely resource-intensive task, even for major players with vast computing resources. One of the methods gaining popularity in light of these challenges is Sign-SGD. This method can be applied both as a memory-efficient approach in single-node training and as a gradient compression technique in the distributed learning. Nevertheless, it is impossible to automatically determine the effective stepsize from the theoretical standpoint. Indeed, it depends on the parameters of the dataset to which we do not have access in the real-world learning paradigm. To address this issue, we design several variants of single-node deterministic Sign-SGD. We extend our approaches to practical scenarios: stochastic single-node and multi-node learning, methods with incorporated momentum. We conduct extensive experiments on real machine learning problems that emphasize the practical applicability of our ideas.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
WeightLoRA: Keep Only Necessary Adapters
Authors:
Andrey Veprikov,
Vladimir Solodkin,
Alexander Zyl,
Andrey Savchenko,
Aleksandr Beznosikov
Abstract:
The widespread utilization of language models in modern applications is inconceivable without Parameter-Efficient Fine-Tuning techniques, such as low-rank adaptation ($\texttt{LoRA}$), which adds trainable adapters to selected layers. Although $\texttt{LoRA}$ may obtain accurate solutions, it requires significant memory to train large models and intuition on which layers to add adapters. In this p…
▽ More
The widespread utilization of language models in modern applications is inconceivable without Parameter-Efficient Fine-Tuning techniques, such as low-rank adaptation ($\texttt{LoRA}$), which adds trainable adapters to selected layers. Although $\texttt{LoRA}$ may obtain accurate solutions, it requires significant memory to train large models and intuition on which layers to add adapters. In this paper, we propose a novel method, $\texttt{WeightLoRA}$, which overcomes this issue by adaptive selection of the most critical $\texttt{LoRA}$ heads throughout the optimization process. As a result, we can significantly reduce the number of trainable parameters while maintaining the capability to obtain consistent or even superior metric values. We conduct experiments for a series of competitive benchmarks and DeBERTa, BART, and Llama models, comparing our method with different adaptive approaches. The experimental results demonstrate the efficacy of $\texttt{WeightLoRA}$ and the superior performance of $\texttt{WeightLoRA+}$ in almost all cases.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Incorporating Preconditioning into Accelerated Approaches: Theoretical Guarantees and Practical Improvement
Authors:
Stepan Trifonov,
Leonid Levin,
Savelii Chezhegov,
Aleksandr Beznosikov
Abstract:
Machine learning and deep learning are widely researched fields that provide solutions to many modern problems. Due to the complexity of new problems related to the size of datasets, efficient approaches are obligatory. In optimization theory, the Heavy Ball and Nesterov methods use \textit{momentum} in their updates of model weights. On the other hand, the minimization problems considered may be…
▽ More
Machine learning and deep learning are widely researched fields that provide solutions to many modern problems. Due to the complexity of new problems related to the size of datasets, efficient approaches are obligatory. In optimization theory, the Heavy Ball and Nesterov methods use \textit{momentum} in their updates of model weights. On the other hand, the minimization problems considered may be poorly conditioned, which affects the applicability and effectiveness of the aforementioned techniques. One solution to this issue is \textit{preconditioning}, which has already been investigated in approaches such as \textsc{AdaGrad}, \textsc{RMSProp}, \textsc{Adam} and others. Despite this, momentum acceleration and preconditioning have not been fully explored together. Therefore, we propose the Preconditioned Heavy Ball (\textsc{PHB}) and Preconditioned Nesterov method (\textsc{PN}) with theoretical guarantees of convergence under \textit{unified} assumption on the scaling matrix. Furthermore, we provide numerical experiments that demonstrate superior performance compared to the unscaled techniques in terms of iteration and oracle complexities.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Convergence of Clipped-SGD for Convex $(L_0,L_1)$-Smooth Optimization with Heavy-Tailed Noise
Authors:
Savelii Chezhegov,
Aleksandr Beznosikov,
Samuel Horváth,
Eduard Gorbunov
Abstract:
Gradient clipping is a widely used technique in Machine Learning and Deep Learning (DL), known for its effectiveness in mitigating the impact of heavy-tailed noise, which frequently arises in the training of large language models. Additionally, first-order methods with clipping, such as Clip-SGD, exhibit stronger convergence guarantees than SGD under the $(L_0,L_1)$-smoothness assumption, a proper…
▽ More
Gradient clipping is a widely used technique in Machine Learning and Deep Learning (DL), known for its effectiveness in mitigating the impact of heavy-tailed noise, which frequently arises in the training of large language models. Additionally, first-order methods with clipping, such as Clip-SGD, exhibit stronger convergence guarantees than SGD under the $(L_0,L_1)$-smoothness assumption, a property observed in many DL tasks. However, the high-probability convergence of Clip-SGD under both assumptions -- heavy-tailed noise and $(L_0,L_1)$-smoothness -- has not been fully addressed in the literature. In this paper, we bridge this critical gap by establishing the first high-probability convergence bounds for Clip-SGD applied to convex $(L_0,L_1)$-smooth optimization with heavy-tailed noise. Our analysis extends prior results by recovering known bounds for the deterministic case and the stochastic setting with $L_1 = 0$ as special cases. Notably, our rates avoid exponentially large factors and do not rely on restrictive sub-Gaussian noise assumptions, significantly broadening the applicability of gradient clipping.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Trial and Trust: Addressing Byzantine Attacks with Comprehensive Defense Strategy
Authors:
Gleb Molodtsov,
Daniil Medyakov,
Sergey Skorik,
Nikolas Khachaturov,
Shahane Tigranyan,
Vladimir Aletov,
Aram Avetisyan,
Martin Takáč,
Aleksandr Beznosikov
Abstract:
Recent advancements in machine learning have improved performance while also increasing computational demands. While federated and distributed setups address these issues, their structure is vulnerable to malicious influences. In this paper, we address a specific threat, Byzantine attacks, where compromised clients inject adversarial updates to derail global convergence. We combine the trust score…
▽ More
Recent advancements in machine learning have improved performance while also increasing computational demands. While federated and distributed setups address these issues, their structure is vulnerable to malicious influences. In this paper, we address a specific threat, Byzantine attacks, where compromised clients inject adversarial updates to derail global convergence. We combine the trust scores concept with trial function methodology to dynamically filter outliers. Our methods address the critical limitations of previous approaches, allowing functionality even when Byzantine nodes are in the majority. Moreover, our algorithms adapt to widely used scaled methods like Adam and RMSProp, as well as practical scenarios, including local training and partial participation. We validate the robustness of our methods by conducting extensive experiments on both synthetic and real ECG data collected from medical institutions. Furthermore, we provide a broad theoretical analysis of our algorithms and their extensions to aforementioned practical setups. The convergence guarantees of our methods are comparable to those of classical algorithms developed without Byzantine interference.
△ Less
Submitted 9 June, 2025; v1 submitted 12 May, 2025;
originally announced May 2025.
-
Variance Reduction Methods Do Not Need to Compute Full Gradients: Improved Efficiency through Shuffling
Authors:
Daniil Medyakov,
Gleb Molodtsov,
Savelii Chezhegov,
Alexey Rebrikov,
Aleksandr Beznosikov
Abstract:
In today's world, machine learning is hard to imagine without large training datasets and models. This has led to the use of stochastic methods for training, such as stochastic gradient descent (SGD). SGD provides weak theoretical guarantees of convergence, but there are modifications, such as Stochastic Variance Reduced Gradient (SVRG) and StochAstic Recursive grAdient algoritHm (SARAH), that can…
▽ More
In today's world, machine learning is hard to imagine without large training datasets and models. This has led to the use of stochastic methods for training, such as stochastic gradient descent (SGD). SGD provides weak theoretical guarantees of convergence, but there are modifications, such as Stochastic Variance Reduced Gradient (SVRG) and StochAstic Recursive grAdient algoritHm (SARAH), that can reduce the variance. These methods require the computation of the full gradient occasionally, which can be time consuming. In this paper, we explore variants of variance reduction algorithms that eliminate the need for full gradient computations. To make our approach memory-efficient and avoid full gradient computations, we use two key techniques: the shuffling heuristic and idea of SAG/SAGA methods. As a result, we improve existing estimates for variance reduction algorithms without the full gradient computations. Additionally, for the non-convex objective function, our estimate matches that of classic shuffling methods, while for the strongly convex one, it is an improvement. We conduct comprehensive theoretical analysis and provide extensive experimental results to validate the efficiency and practicality of our methods for large-scale machine learning problems.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
Sign Operator for Coping with Heavy-Tailed Noise in Non-Convex Optimization: High Probability Bounds Under $(L_0, L_1)$-Smoothness
Authors:
Nikita Kornilov,
Philip Zmushko,
Andrei Semenov,
Mark Ikonnikov,
Alexander Gasnikov,
Alexander Beznosikov
Abstract:
In recent years, non-convex optimization problems are more often described by generalized $(L_0, L_1)$-smoothness assumption rather than standard one. Meanwhile, severely corrupted data used in these problems has increased the demand for methods capable of handling heavy-tailed noises, i.e., noises with bounded $κ$-th moment. Motivated by these real-world trends and challenges, we explore sign-bas…
▽ More
In recent years, non-convex optimization problems are more often described by generalized $(L_0, L_1)$-smoothness assumption rather than standard one. Meanwhile, severely corrupted data used in these problems has increased the demand for methods capable of handling heavy-tailed noises, i.e., noises with bounded $κ$-th moment. Motivated by these real-world trends and challenges, we explore sign-based methods in this setup and demonstrate their effectiveness in comparison with other popular solutions like clipping or normalization.
In theory, we prove the first-known high probability convergence bounds under $(L_0, L_1)$-smoothness and heavy-tailed noises with mild parameter dependencies. In the case of standard smoothness, these bounds are novel for sign-based methods as well. In particular, SignSGD with batching achieves sample complexity $\tilde{O}\left(\left(\frac{ΔL_0d}{\varepsilon^2} + \frac{ΔL_1d^\frac{3}{2}}{\varepsilon}\right)\left[1 + \left(\fracσ{\varepsilon}\right)^\fracκ{κ-1}\right]\right), κ\in (1,2]$. Under the assumption of symmetric noises, SignSGD with Majority Voting can robustly work on the whole range of $κ\in (0,2]$ with complexity $\tilde{O}\left(\left(\frac{ΔL_0d}{\varepsilon^2} + \frac{ΔL_1d^\frac{3}{2}}{\varepsilon}\right)\left[\frac{1}{κ^2} + \frac{σ^2}{\varepsilon^2}\right]\right)$. We also obtain results for parameter-agnostic setups, Polyak-Lojasiewicz functions and momentum-based methods (in expectation). Our theoretical findings are supported by the superior performance of sign-based methods in training Large Language Models compared to clipping and normalization.
△ Less
Submitted 27 May, 2025; v1 submitted 11 February, 2025;
originally announced February 2025.
-
Accelerated Methods with Compressed Communications for Distributed Optimization Problems under Data Similarity
Authors:
Dmitry Bylinkin,
Aleksandr Beznosikov
Abstract:
In recent years, as data and problem sizes have increased, distributed learning has become an essential tool for training high-performance models. However, the communication bottleneck, especially for high-dimensional data, is a challenge. Several techniques have been developed to overcome this problem. These include communication compression and implementation of local steps, which work particula…
▽ More
In recent years, as data and problem sizes have increased, distributed learning has become an essential tool for training high-performance models. However, the communication bottleneck, especially for high-dimensional data, is a challenge. Several techniques have been developed to overcome this problem. These include communication compression and implementation of local steps, which work particularly well when there is similarity of local data samples. In this paper, we study the synergy of these approaches for efficient distributed optimization. We propose the first theoretically grounded accelerated algorithms utilizing unbiased and biased compression under data similarity, leveraging variance reduction and error feedback frameworks. Our results are of record and confirmed by experiments on different average losses and datasets.
△ Less
Submitted 20 December, 2024;
originally announced December 2024.
-
Effective Method with Compression for Distributed and Federated Cocoercive Variational Inequalities
Authors:
Daniil Medyakov,
Gleb Molodtsov,
Aleksandr Beznosikov
Abstract:
Variational inequalities as an effective tool for solving applied problems, including machine learning tasks, have been attracting more and more attention from researchers in recent years. The use of variational inequalities covers a wide range of areas - from reinforcement learning and generative models to traditional applications in economics and game theory. At the same time, it is impossible t…
▽ More
Variational inequalities as an effective tool for solving applied problems, including machine learning tasks, have been attracting more and more attention from researchers in recent years. The use of variational inequalities covers a wide range of areas - from reinforcement learning and generative models to traditional applications in economics and game theory. At the same time, it is impossible to imagine the modern world of machine learning without distributed optimization approaches that can significantly speed up the training process on large amounts of data. However, faced with the high costs of communication between devices in a computing network, the scientific community is striving to develop approaches that make computations cheap and stable. In this paper, we investigate the compression technique of transmitted information and its application to the distributed variational inequalities problem. In particular, we present a method based on advanced techniques originally developed for minimization problems. For the new method, we provide an exhaustive theoretical convergence analysis for cocoersive strongly monotone variational inequalities. We conduct experiments that emphasize the high performance of the presented technique and confirm its practical applicability.
△ Less
Submitted 19 December, 2024;
originally announced December 2024.
-
Accelerated Stochastic ExtraGradient: Mixing Hessian and Gradient Similarity to Reduce Communication in Distributed and Federated Learning
Authors:
Dmitry Bylinkin,
Kirill Degtyarev,
Aleksandr Beznosikov
Abstract:
Modern realities and trends in learning require more and more generalization ability of models, which leads to an increase in both models and training sample size. It is already difficult to solve such tasks in a single device mode. This is the reason why distributed and federated learning approaches are becoming more popular every day. Distributed computing involves communication between devices,…
▽ More
Modern realities and trends in learning require more and more generalization ability of models, which leads to an increase in both models and training sample size. It is already difficult to solve such tasks in a single device mode. This is the reason why distributed and federated learning approaches are becoming more popular every day. Distributed computing involves communication between devices, which requires solving two key problems: efficiency and privacy. One of the most well-known approaches to combat communication costs is to exploit the similarity of local data. Both Hessian similarity and homogeneous gradients have been studied in the literature, but separately. In this paper, we combine both of these assumptions in analyzing a new method that incorporates the ideas of using data similarity and clients sampling. Moreover, to address privacy concerns, we apply the technique of additional noise and analyze its impact on the convergence of the proposed method. The theory is confirmed by training on real datasets.
△ Less
Submitted 21 September, 2024;
originally announced September 2024.
-
Methods for Solving Variational Inequalities with Markovian Stochasticity
Authors:
Vladimir Solodkin,
Michael Ermoshin,
Roman Gavrilenko,
Aleksandr Beznosikov
Abstract:
In this paper, we present a novel stochastic method for solving variational inequalities (VI) in the context of Markovian noise. By leveraging Extragradient technique, we can productively solve VI optimization problems characterized by Markovian dynamics. We demonstrate the efficacy of proposed method through rigorous theoretical analysis, proving convergence under quite mild assumptions of $L$-Li…
▽ More
In this paper, we present a novel stochastic method for solving variational inequalities (VI) in the context of Markovian noise. By leveraging Extragradient technique, we can productively solve VI optimization problems characterized by Markovian dynamics. We demonstrate the efficacy of proposed method through rigorous theoretical analysis, proving convergence under quite mild assumptions of $L$-Lipschitzness, strong monotonicity of the operator and boundness of the noise only at the optimum. In order to gain further insight into the nature of Markov processes, we conduct the experiments to investigate the impact of the mixing time parameter on the convergence of the algorithm.
△ Less
Submitted 20 September, 2024;
originally announced September 2024.
-
Local SGD for Near-Quadratic Problems: Improving Convergence under Unconstrained Noise Conditions
Authors:
Andrey Sadchikov,
Savelii Chezhegov,
Aleksandr Beznosikov,
Alexander Gasnikov
Abstract:
Distributed optimization plays an important role in modern large-scale machine learning and data processing systems by optimizing the utilization of computational resources. One of the classical and popular approaches is Local Stochastic Gradient Descent (Local SGD), characterized by multiple local updates before averaging, which is particularly useful in distributed environments to reduce communi…
▽ More
Distributed optimization plays an important role in modern large-scale machine learning and data processing systems by optimizing the utilization of computational resources. One of the classical and popular approaches is Local Stochastic Gradient Descent (Local SGD), characterized by multiple local updates before averaging, which is particularly useful in distributed environments to reduce communication bottlenecks and improve scalability. A typical feature of this method is the dependence on the frequency of communications. But in the case of a quadratic target function with homogeneous data distribution over all devices, the influence of frequency of communications vanishes. As a natural consequence, subsequent studies include the assumption of a Lipschitz Hessian, as this indicates the similarity of the optimized function to a quadratic one to some extent. However, in order to extend the completeness of the Local SGD theory and unlock its potential, in this paper we abandon the Lipschitz Hessian assumption by introducing a new concept of $\textit{approximate quadraticity}$. This assumption gives a new perspective on problems that have near quadratic properties. In addition, existing theoretical analyses of Local SGD often assume bounded variance. We, in turn, consider the unbounded noise condition, which allows us to broaden the class of studied problems.
△ Less
Submitted 18 December, 2024; v1 submitted 16 September, 2024;
originally announced September 2024.
-
New Aspects of Black Box Conditional Gradient: Variance Reduction and One Point Feedback
Authors:
Andrey Veprikov,
Aleksandr Bogdanov,
Vladislav Minashkin,
Aleksandr Beznosikov
Abstract:
This paper deals with the black-box optimization problem. In this setup, we do not have access to the gradient of the objective function, therefore, we need to estimate it somehow. We propose a new type of approximation JAGUAR, that memorizes information from previous iterations and requires $\mathcal{O}(1)$ oracle calls. We implement this approximation in the Frank-Wolfe and Gradient Descent algo…
▽ More
This paper deals with the black-box optimization problem. In this setup, we do not have access to the gradient of the objective function, therefore, we need to estimate it somehow. We propose a new type of approximation JAGUAR, that memorizes information from previous iterations and requires $\mathcal{O}(1)$ oracle calls. We implement this approximation in the Frank-Wolfe and Gradient Descent algorithms and prove the convergence of these methods with different types of zero-order oracle. Our theoretical analysis covers scenarios of non-convex, convex and PL-condition cases. Also in this paper, we consider the stochastic minimization problem on the set $Q$ with noise in the zero-order oracle; this setup is quite unpopular in the literature, but we prove that the JAGUAR approximation is robust not only in deterministic minimization problems, but also in the stochastic case. We perform experiments to compare our gradient estimator with those already known in the literature and confirm the dominance of our methods.
△ Less
Submitted 17 September, 2024; v1 submitted 16 September, 2024;
originally announced September 2024.
-
Method with Batching for Stochastic Finite-Sum Variational Inequalities in Non-Euclidean Setting
Authors:
Alexander Pichugin,
Maksim Pechin,
Aleksandr Beznosikov,
Vasilii Novitskii,
Alexander Gasnikov
Abstract:
Variational inequalities are a universal optimization paradigm that incorporate classical minimization and saddle point problems. Nowadays more and more tasks require to consider stochastic formulations of optimization problems. In this paper, we present an analysis of a method that gives optimal convergence estimates for monotone stochastic finite-sum variational inequalities. In contrast to the…
▽ More
Variational inequalities are a universal optimization paradigm that incorporate classical minimization and saddle point problems. Nowadays more and more tasks require to consider stochastic formulations of optimization problems. In this paper, we present an analysis of a method that gives optimal convergence estimates for monotone stochastic finite-sum variational inequalities. In contrast to the previous works, our method supports batching, does not lose the oracle complexity optimality and uses an arbitrary Bregman distance to take into account geometry of the problem. Paper provides experimental confirmation to algorithm's effectiveness.
△ Less
Submitted 15 September, 2024; v1 submitted 13 August, 2024;
originally announced August 2024.
-
Methods for Optimization Problems with Markovian Stochasticity and Non-Euclidean Geometry
Authors:
Vladimir Solodkin,
Andrew Veprikov,
Aleksandr Beznosikov
Abstract:
This paper examines a variety of classical optimization problems, including well-known minimization tasks and more general variational inequalities. We consider a stochastic formulation of these problems, and unlike most previous work, we take into account the complex Markov nature of the noise. We also consider the geometry of the problem in an arbitrary non-Euclidean setting, and propose four me…
▽ More
This paper examines a variety of classical optimization problems, including well-known minimization tasks and more general variational inequalities. We consider a stochastic formulation of these problems, and unlike most previous work, we take into account the complex Markov nature of the noise. We also consider the geometry of the problem in an arbitrary non-Euclidean setting, and propose four methods based on the Mirror Descent iteration technique. Theoretical analysis is provided for smooth and convex minimization problems and variational inequalities with Lipschitz and monotone operators. The convergence guarantees obtained are optimal for first-order stochastic methods, as evidenced by the lower bound estimates provided in this paper.
△ Less
Submitted 3 August, 2024;
originally announced August 2024.
-
Stochastic Frank-Wolfe: Unified Analysis and Zoo of Special Cases
Authors:
Ruslan Nazykov,
Aleksandr Shestakov,
Vladimir Solodkin,
Aleksandr Beznosikov,
Gauthier Gidel,
Alexander Gasnikov
Abstract:
The Conditional Gradient (or Frank-Wolfe) method is one of the most well-known methods for solving constrained optimization problems appearing in various machine learning tasks. The simplicity of iteration and applicability to many practical problems helped the method to gain popularity in the community. In recent years, the Frank-Wolfe algorithm received many different extensions, including stoch…
▽ More
The Conditional Gradient (or Frank-Wolfe) method is one of the most well-known methods for solving constrained optimization problems appearing in various machine learning tasks. The simplicity of iteration and applicability to many practical problems helped the method to gain popularity in the community. In recent years, the Frank-Wolfe algorithm received many different extensions, including stochastic modifications with variance reduction and coordinate sampling for training of huge models or distributed variants for big data problems. In this paper, we present a unified convergence analysis of the Stochastic Frank-Wolfe method that covers a large number of particular practical cases that may have completely different nature of stochasticity, intuitions and application areas. Our analysis is based on a key parametric assumption on the variance of the stochastic gradients. But unlike most works on unified analysis of other methods, such as SGD, we do not assume an unbiasedness of the real gradient estimation. We conduct analysis for convex and non-convex problems due to the popularity of both cases in machine learning. With this general theoretical framework, we not only cover rates of many known methods, but also develop numerous new methods. This shows the flexibility of our approach in developing new algorithms based on the Conditional Gradient approach. We also demonstrate the properties of the new methods through numerical experiments.
△ Less
Submitted 15 September, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
Accelerated Stochastic Gradient Method with Applications to Consensus Problem in Markov-Varying Networks
Authors:
Vladimir Solodkin,
Savelii Chezhegov,
Ruslan Nazikov,
Aleksandr Beznosikov,
Alexander Gasnikov
Abstract:
Stochastic optimization is a vital field in the realm of mathematical optimization, finding applications in diverse areas ranging from operations research to machine learning. In this paper, we introduce a novel first-order optimization algorithm designed for scenarios where Markovian noise is present, incorporating Nesterov acceleration for enhanced efficiency. The convergence analysis is perform…
▽ More
Stochastic optimization is a vital field in the realm of mathematical optimization, finding applications in diverse areas ranging from operations research to machine learning. In this paper, we introduce a novel first-order optimization algorithm designed for scenarios where Markovian noise is present, incorporating Nesterov acceleration for enhanced efficiency. The convergence analysis is performed using an assumption on noise depending on the distance to the solution. We also delve into the consensus problem over Markov-varying networks, exploring how this algorithm can be applied to achieve agreement among multiple agents with differing objectives during changes in the communication system. To show the performance of our method on the problem above, we conduct experiments to demonstrate the superiority over the classic approach.
△ Less
Submitted 8 June, 2024;
originally announced June 2024.
-
Clipping Improves Adam-Norm and AdaGrad-Norm when the Noise Is Heavy-Tailed
Authors:
Savelii Chezhegov,
Yaroslav Klyukin,
Andrei Semenov,
Aleksandr Beznosikov,
Alexander Gasnikov,
Samuel Horváth,
Martin Takáč,
Eduard Gorbunov
Abstract:
Methods with adaptive stepsizes, such as AdaGrad and Adam, are essential for training modern Deep Learning models, especially Large Language Models. Typically, the noise in the stochastic gradients is heavy-tailed for the later ones. Gradient clipping provably helps to achieve good high-probability convergence for such noises. However, despite the similarity between AdaGrad/Adam and Clip-SGD, the…
▽ More
Methods with adaptive stepsizes, such as AdaGrad and Adam, are essential for training modern Deep Learning models, especially Large Language Models. Typically, the noise in the stochastic gradients is heavy-tailed for the later ones. Gradient clipping provably helps to achieve good high-probability convergence for such noises. However, despite the similarity between AdaGrad/Adam and Clip-SGD, the current understanding of the high-probability convergence of AdaGrad/Adam-type methods is limited in this case. In this work, we prove that AdaGrad/Adam (and their delayed version) can have provably bad high-probability convergence if the noise is heavy-tailed. We also show that gradient clipping fixes this issue, i.e., we derive new high-probability convergence bounds with polylogarithmic dependence on the confidence level for AdaGrad-Norm and Adam-Norm with clipping and with/without delay for smooth convex/non-convex stochastic optimization with heavy-tailed noise. Our empirical evaluations highlight the superiority of clipped versions of AdaGrad/Adam-Norm in handling the heavy-tailed noise.
△ Less
Submitted 13 March, 2025; v1 submitted 6 June, 2024;
originally announced June 2024.
-
Local Methods with Adaptivity via Scaling
Authors:
Savelii Chezhegov,
Sergey Skorik,
Nikolas Khachaturov,
Danil Shalagin,
Aram Avetisyan,
Martin Takáč,
Yaroslav Kholodov,
Aleksandr Beznosikov
Abstract:
The rapid development of machine learning and deep learning has introduced increasingly complex optimization challenges that must be addressed. Indeed, training modern, advanced models has become difficult to implement without leveraging multiple computing nodes in a distributed environment. Distributed optimization is also fundamental to emerging fields such as federated learning. Specifically, t…
▽ More
The rapid development of machine learning and deep learning has introduced increasingly complex optimization challenges that must be addressed. Indeed, training modern, advanced models has become difficult to implement without leveraging multiple computing nodes in a distributed environment. Distributed optimization is also fundamental to emerging fields such as federated learning. Specifically, there is a need to organize the training process to minimize the time lost due to communication. A widely used and extensively researched technique to mitigate the communication bottleneck involves performing local training before communication. This approach is the focus of our paper. Concurrently, adaptive methods that incorporate scaling, notably led by Adam, have gained significant popularity in recent years. Therefore, this paper aims to merge the local training technique with the adaptive approach to develop efficient distributed learning methods. We consider the classical Local SGD method and enhance it with a scaling feature. A crucial aspect is that the scaling is described generically, allowing us to analyze various approaches, including Adam, RMSProp, and OASIS, in a unified manner. In addition to theoretical analysis, we validate the performance of our methods in practice by training a neural network.
△ Less
Submitted 16 September, 2024; v1 submitted 2 June, 2024;
originally announced June 2024.
-
Accelerated Methods with Compression for Horizontal and Vertical Federated Learning
Authors:
Sergey Stanko,
Timur Karimullin,
Aleksandr Beznosikov,
Alexander Gasnikov
Abstract:
Distributed optimization algorithms have emerged as a superior approaches for solving machine learning problems. To accommodate the diverse ways in which data can be stored across devices, these methods must be adaptable to a wide range of situations. As a result, two orthogonal regimes of distributed algorithms are distinguished: horizontal and vertical. During parallel training, communication be…
▽ More
Distributed optimization algorithms have emerged as a superior approaches for solving machine learning problems. To accommodate the diverse ways in which data can be stored across devices, these methods must be adaptable to a wide range of situations. As a result, two orthogonal regimes of distributed algorithms are distinguished: horizontal and vertical. During parallel training, communication between nodes can become a critical bottleneck, particularly for high-dimensional and over-parameterized models. Therefore, it is crucial to enhance current methods with strategies that minimize the amount of data transmitted during training while still achieving a model of similar quality. This paper introduces two accelerated algorithms with various compressors, working in the regime of horizontal and vertical data division. By utilizing a momentum and variance reduction technique from the Katyusha algorithm, we were able to achieve acceleration and demonstrate one of the best asymptotics for the horizontal case. Additionally, we provide one of the first theoretical convergence guarantees for the vertical regime. Our experiments involved several compressor operators, including RandK and PermK, and we were able to demonstrate superior practical performance compared to other popular approaches.
△ Less
Submitted 20 April, 2024;
originally announced April 2024.
-
Extragradient Sliding for Composite Non-Monotone Variational Inequalities
Authors:
Roman Emelyanov,
Andrey Tikhomirov,
Aleksandr Beznosikov,
Alexander Gasnikov
Abstract:
Variational inequalities offer a versatile and straightforward approach to analyzing a broad range of equilibrium problems in both theoretical and practical fields. In this paper, we consider a composite generally non-monotone variational inequality represented as a sum of $L_q$-Lipschitz monotone and $L_p$-Lipschitz generally non-monotone operators. We applied a special sliding version of the cla…
▽ More
Variational inequalities offer a versatile and straightforward approach to analyzing a broad range of equilibrium problems in both theoretical and practical fields. In this paper, we consider a composite generally non-monotone variational inequality represented as a sum of $L_q$-Lipschitz monotone and $L_p$-Lipschitz generally non-monotone operators. We applied a special sliding version of the classical Extragradient method to this problem and obtain better convergence results. In particular, to achieve $\varepsilon$-accuracy of the solution, the oracle complexity of the non-monotone operator $Q$ for our algorithm is $O\left(L_p^2/\varepsilon^2\right)$ in contrast to the basic Extragradient algorithm with $O\left((L_p+L_q)^2/\varepsilon^2\right)$. The results of numerical experiments confirm the theoretical findings and show the superiority of the proposed method.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
Decentralized Finite-Sum Optimization over Time-Varying Networks
Authors:
Dmitry Metelev,
Savelii Chezhegov,
Alexander Rogozin,
Aleksandr Beznosikov,
Alexander Sholokhov,
Alexander Gasnikov,
Dmitry Kovalev
Abstract:
We consider decentralized time-varying stochastic optimization problems where each of the functions held by the nodes has a finite sum structure. Such problems can be efficiently solved using variance reduction techniques. Our aim is to explore the lower complexity bounds (for communication and number of stochastic oracle calls) and find optimal algorithms. The paper studies strongly convex and no…
▽ More
We consider decentralized time-varying stochastic optimization problems where each of the functions held by the nodes has a finite sum structure. Such problems can be efficiently solved using variance reduction techniques. Our aim is to explore the lower complexity bounds (for communication and number of stochastic oracle calls) and find optimal algorithms. The paper studies strongly convex and nonconvex scenarios. To the best of our knowledge, variance reduced schemes and lower bounds for time-varying graphs have not been studied in the literature. For nonconvex objectives, we obtain lower bounds and develop an optimal method GT-PAGE. For strongly convex objectives, we propose the first decentralized time-varying variance-reduction method ADOM+VR and establish lower bound in this scenario, highlighting the open question of matching the algorithms complexity and lower bounds even in static network case.
△ Less
Submitted 7 February, 2025; v1 submitted 4 February, 2024;
originally announced February 2024.
-
Optimal Analysis of Method with Batching for Monotone Stochastic Finite-Sum Variational Inequalities
Authors:
Alexander Pichugin,
Maksim Pechin,
Aleksandr Beznosikov,
Alexander Gasnikov
Abstract:
Variational inequalities are a universal optimization paradigm that is interesting in itself, but also incorporates classical minimization and saddle point problems. Modern realities encourage to consider stochastic formulations of optimization problems. In this paper, we present an analysis of a method that gives optimal convergence estimates for monotone stochastic finite-sum variational inequal…
▽ More
Variational inequalities are a universal optimization paradigm that is interesting in itself, but also incorporates classical minimization and saddle point problems. Modern realities encourage to consider stochastic formulations of optimization problems. In this paper, we present an analysis of a method that gives optimal convergence estimates for monotone stochastic finite-sum variational inequalities. In contrast to the previous works, our method supports batching and does not lose the oracle complexity optimality. The effectiveness of the algorithm, especially in the case of small but not single batches is confirmed experimentally.
△ Less
Submitted 26 March, 2024; v1 submitted 15 January, 2024;
originally announced January 2024.
-
Optimal Data Splitting in Distributed Optimization for Machine Learning
Authors:
Daniil Medyakov,
Gleb Molodtsov,
Aleksandr Beznosikov,
Alexander Gasnikov
Abstract:
The distributed optimization problem has become increasingly relevant recently. It has a lot of advantages such as processing a large amount of data in less time compared to non-distributed methods. However, most distributed approaches suffer from a significant bottleneck - the cost of communications. Therefore, a large amount of research has recently been directed at solving this problem. One suc…
▽ More
The distributed optimization problem has become increasingly relevant recently. It has a lot of advantages such as processing a large amount of data in less time compared to non-distributed methods. However, most distributed approaches suffer from a significant bottleneck - the cost of communications. Therefore, a large amount of research has recently been directed at solving this problem. One such approach uses local data similarity. In particular, there exists an algorithm provably optimally exploiting the similarity property. But this result, as well as results from other works solve the communication bottleneck by focusing only on the fact that communication is significantly more expensive than local computing and does not take into account the various capacities of network devices and the different relationship between communication time and local computing expenses. We consider this setup and the objective of this study is to achieve an optimal ratio of distributed data between the server and local machines for any costs of communications and local computations. The running times of the network are compared between uniform and optimal distributions. The superior theoretical performance of our solutions is experimentally validated.
△ Less
Submitted 26 March, 2024; v1 submitted 15 January, 2024;
originally announced January 2024.
-
Activations and Gradients Compression for Model-Parallel Training
Authors:
Mikhail Rudakov,
Aleksandr Beznosikov,
Yaroslav Kholodov,
Alexander Gasnikov
Abstract:
Large neural networks require enormous computational clusters of machines. Model-parallel training, when the model architecture is partitioned sequentially between workers, is a popular approach for training modern models. Information compression can be applied to decrease workers communication time, as it is often a bottleneck in such systems. This work explores how simultaneous compression of ac…
▽ More
Large neural networks require enormous computational clusters of machines. Model-parallel training, when the model architecture is partitioned sequentially between workers, is a popular approach for training modern models. Information compression can be applied to decrease workers communication time, as it is often a bottleneck in such systems. This work explores how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We analyze compression methods such as quantization and TopK compression, and also experiment with error compensation techniques. Moreover, we employ TopK with AQ-SGD per-batch error feedback approach. We conduct experiments on image classification and language model fine-tuning tasks. Our findings demonstrate that gradients require milder compression rates than activations. We observe that $K=10\%$ is the lowest TopK compression level, which does not harm model convergence severely. Experiments also show that models trained with TopK perform well only when compression is also applied during inference. We find that error feedback techniques do not improve model-parallel training compared to plain compression, but allow model inference without compression with almost no quality drop. Finally, when applied with the AQ-SGD approach, TopK stronger than with $ K=30\%$ worsens model performance significantly.
△ Less
Submitted 26 March, 2024; v1 submitted 15 January, 2024;
originally announced January 2024.
-
About some works of Boris Polyak on convergence of gradient methods and their development
Authors:
Seydamet Ablaev,
Aleksandr Beznosikov,
Alexander Gasnikov,
Darina Dvinskikh,
Aleksandr Lobanov,
Sergei Puchinin,
Fedor Stonyakin
Abstract:
The paper presents a review of the state-of-the-art of subgradient and accelerated methods of convex optimization, including in the presence of disturbances and access to various information about the objective function (function value, gradient, stochastic gradient, higher derivatives). For nonconvex problems, the Polak-Lojasiewicz condition is considered and a review of the main results is given…
▽ More
The paper presents a review of the state-of-the-art of subgradient and accelerated methods of convex optimization, including in the presence of disturbances and access to various information about the objective function (function value, gradient, stochastic gradient, higher derivatives). For nonconvex problems, the Polak-Lojasiewicz condition is considered and a review of the main results is given. The behavior of numerical methods in the presence of sharp minima is considered. The purpose of this survey is to show the influence of the works of B.T. Polyak (1935 -- 2023) on gradient optimization methods and their neighborhoods on the modern development of numerical optimization methods.
△ Less
Submitted 24 December, 2024; v1 submitted 28 November, 2023;
originally announced November 2023.
-
Bregman Proximal Method for Efficient Communications under Similarity
Authors:
Aleksandr Beznosikov,
Darina Dvinskikh,
Dmitry Bylinkin,
Andrei Semenov,
Alexander Gasnikov
Abstract:
We propose a novel stochastic distributed method for both monotone and strongly monotone variational inequalities with Lipschitz operator and proper convex regularizers arising in various applications from game theory to adversarial training. By exploiting similarity, our algorithm overcomes the communication bottleneck that is a major issue in distributed optimization. The proposed method enjoys…
▽ More
We propose a novel stochastic distributed method for both monotone and strongly monotone variational inequalities with Lipschitz operator and proper convex regularizers arising in various applications from game theory to adversarial training. By exploiting similarity, our algorithm overcomes the communication bottleneck that is a major issue in distributed optimization. The proposed method enjoys optimal communication complexity. All the existing distributed algorithms achieving the lower bounds under similarity condition essentially utilize the Euclidean setup. In contrast to them, our method is built upon the Bregman proximal maps and it is compatible with an arbitrary problem geometry. Thereby the proposed method fills an existing gap in this area of research. Our theoretical results are confirmed by numerical experiments on a stochastic matrix game.
△ Less
Submitted 4 October, 2024; v1 submitted 12 November, 2023;
originally announced November 2023.
-
Ito Diffusion Approximation of Universal Ito Chains for Sampling, Optimization and Boosting
Authors:
Aleksei Ustimenko,
Aleksandr Beznosikov
Abstract:
In this work, we consider rather general and broad class of Markov chains, Ito chains, that look like Euler-Maryama discretization of some Stochastic Differential Equation. The chain we study is a unified framework for theoretical analysis. It comes with almost arbitrary isotropic and state-dependent noise instead of normal and state-independent one as in most related papers. Moreover, in our chai…
▽ More
In this work, we consider rather general and broad class of Markov chains, Ito chains, that look like Euler-Maryama discretization of some Stochastic Differential Equation. The chain we study is a unified framework for theoretical analysis. It comes with almost arbitrary isotropic and state-dependent noise instead of normal and state-independent one as in most related papers. Moreover, in our chain the drift and diffusion coefficient can be inexact in order to cover wide range of applications as Stochastic Gradient Langevin Dynamics, sampling, Stochastic Gradient Descent or Stochastic Gradient Boosting. We prove the bound in $W_{2}$-distance between the laws of our Ito chain and corresponding differential equation. These results improve or cover most of the known estimates. And for some particular cases, our analysis is the first.
△ Less
Submitted 30 March, 2024; v1 submitted 9 October, 2023;
originally announced October 2023.
-
Real Acceleration of Communication Process in Distributed Algorithms with Compression
Authors:
Svetlana Tkachenko,
Artem Andreev,
Aleksandr Beznosikov,
Alexander Gasnikov
Abstract:
Modern applied optimization problems become more and more complex every day. Due to this fact, distributed algorithms that can speed up the process of solving an optimization problem through parallelization are of great importance. The main bottleneck of distributed algorithms is communications, which can slow down the method dramatically. One way to solve this issue is to use compression of trans…
▽ More
Modern applied optimization problems become more and more complex every day. Due to this fact, distributed algorithms that can speed up the process of solving an optimization problem through parallelization are of great importance. The main bottleneck of distributed algorithms is communications, which can slow down the method dramatically. One way to solve this issue is to use compression of transmitted information. In the current literature on theoretical distributed optimization, it is generally accepted that as much as we compress information, so much we reduce communication time. But in reality, the communication time depends not only on the size of the transmitted information, but also, for example, on the message startup time. In this paper, we study distributed optimization algorithms under the assumption of a more complex and closer-to-reality dependence of transmission time on compression. In particular, we describe the real speedup achieved by compression, analyze how much it makes sense to compress information, and present an adaptive way to select the power of compression depending on unknown or changing parameters of the communication process.
△ Less
Submitted 10 September, 2023;
originally announced September 2023.
-
Optimal Algorithm with Complexity Separation for Strongly Convex-Strongly Concave Composite Saddle Point Problems
Authors:
Ekaterina Borodich,
Georgiy Kormakov,
Dmitry Kovalev,
Aleksandr Beznosikov,
Alexander Gasnikov
Abstract:
In this work, we focuses on the following saddle point problem $\min_x \max_y p(x) + R(x,y) - q(y)$ where $R(x,y)$ is $L_R$-smooth, $μ_x$-strongly convex, $μ_y$-strongly concave and $p(x), q(y)$ are convex and $L_p, L_q$-smooth respectively. We present a new algorithm with optimal overall complexity…
▽ More
In this work, we focuses on the following saddle point problem $\min_x \max_y p(x) + R(x,y) - q(y)$ where $R(x,y)$ is $L_R$-smooth, $μ_x$-strongly convex, $μ_y$-strongly concave and $p(x), q(y)$ are convex and $L_p, L_q$-smooth respectively. We present a new algorithm with optimal overall complexity $\mathcal{O}\left(\left(\sqrt{\frac{L_p}{μ_x}} + \frac{L_R}{\sqrt{μ_x μ_y}} + \sqrt{\frac{L_q}{μ_y}}\right)\log \frac{1}{\varepsilon}\right)$ and separation of oracle calls in the composite and saddle part. This algorithm requires $\mathcal{O}\left(\left(\sqrt{\frac{L_p}{μ_x}} + \sqrt{\frac{L_q}{μ_y}}\right) \log \frac{1}{\varepsilon}\right)$ oracle calls for $\nabla p(x)$ and $\nabla q(y)$ and $\mathcal{O} \left( \max\left\{\sqrt{\frac{L_p}{μ_x}}, \sqrt{\frac{L_q}{μ_y}}, \frac{L_R}{\sqrt{μ_x μ_y}} \right\}\log \frac{1}{\varepsilon}\right)$ oracle calls for $\nabla R(x,y)$ to find an $\varepsilon$-solution of the problem. To the best of our knowledge, we are the first to develop optimal algorithm with complexity separation in the case $μ_x \not = μ_y$. Also, we apply this algorithm to a bilinear saddle point problem and obtain the optimal complexity for this class of problems.
△ Less
Submitted 24 July, 2023;
originally announced July 2023.
-
Decentralized Optimization Over Slowly Time-Varying Graphs: Algorithms and Lower Bounds
Authors:
Dmitry Metelev,
Aleksandr Beznosikov,
Alexander Rogozin,
Alexander Gasnikov,
Anton Proskurnikov
Abstract:
We consider a decentralized convex unconstrained optimization problem, where the cost function can be decomposed into a sum of strongly convex and smooth functions, associated with individual agents, interacting over a static or time-varying network. Our main concern is the convergence rate of first-order optimization algorithms as a function of the network's graph, more specifically, of the condi…
▽ More
We consider a decentralized convex unconstrained optimization problem, where the cost function can be decomposed into a sum of strongly convex and smooth functions, associated with individual agents, interacting over a static or time-varying network. Our main concern is the convergence rate of first-order optimization algorithms as a function of the network's graph, more specifically, of the condition numbers of gossip matrices. We are interested in the case when the network is time-varying but the rate of changes is restricted. We study two cases: randomly changing network satisfying Markov property and a network changing in a deterministic manner. For the random case, we propose a decentralized optimization algorithm with accelerated consensus. For the deterministic scenario, we show that if the graph is changing in a worst-case way, accelerated consensus is not possible even if only two edges are changed at each iteration. The fact that such a low rate of network changes is sufficient to make accelerated consensus impossible is novel and improves the previous results in the literature.
△ Less
Submitted 24 July, 2023;
originally announced July 2023.
-
Non-Smooth Setting of Stochastic Decentralized Convex Optimization Problem Over Time-Varying Graphs
Authors:
Aleksandr Lobanov,
Andrew Veprikov,
Georgiy Konin,
Aleksandr Beznosikov,
Alexander Gasnikov,
Dmitry Kovalev
Abstract:
Distributed optimization has a rich history. It has demonstrated its effectiveness in many machine learning applications, etc. In this paper we study a subclass of distributed optimization, namely decentralized optimization in a non-smooth setting. Decentralized means that $m$ agents (machines) working in parallel on one problem communicate only with the neighbors agents (machines), i.e. there is…
▽ More
Distributed optimization has a rich history. It has demonstrated its effectiveness in many machine learning applications, etc. In this paper we study a subclass of distributed optimization, namely decentralized optimization in a non-smooth setting. Decentralized means that $m$ agents (machines) working in parallel on one problem communicate only with the neighbors agents (machines), i.e. there is no (central) server through which agents communicate. And by non-smooth setting we mean that each agent has a convex stochastic non-smooth function, that is, agents can hold and communicate information only about the value of the objective function, which corresponds to a gradient-free oracle. In this paper, to minimize the global objective function, which consists of the sum of the functions of each agent, we create a gradient-free algorithm by applying a smoothing scheme via $l_2$ randomization. We also verify in experiments the obtained theoretical convergence results of the gradient-free algorithm proposed in this paper.
△ Less
Submitted 5 September, 2023; v1 submitted 1 July, 2023;
originally announced July 2023.
-
First Order Methods with Markovian Noise: from Acceleration to Variational Inequalities
Authors:
Aleksandr Beznosikov,
Sergey Samsonov,
Marina Sheshukova,
Alexander Gasnikov,
Alexey Naumov,
Eric Moulines
Abstract:
This paper delves into stochastic optimization problems that involve Markovian noise. We present a unified approach for the theoretical analysis of first-order gradient methods for stochastic optimization and variational inequalities. Our approach covers scenarios for both non-convex and strongly convex minimization problems. To achieve an optimal (linear) dependence on the mixing time of the unde…
▽ More
This paper delves into stochastic optimization problems that involve Markovian noise. We present a unified approach for the theoretical analysis of first-order gradient methods for stochastic optimization and variational inequalities. Our approach covers scenarios for both non-convex and strongly convex minimization problems. To achieve an optimal (linear) dependence on the mixing time of the underlying noise sequence, we use the randomized batching scheme, which is based on the multilevel Monte Carlo method. Moreover, our technique allows us to eliminate the limiting assumptions of previous research on Markov noise, such as the need for a bounded domain and uniformly bounded stochastic gradients. Our extension to variational inequalities under Markovian noise is original. Additionally, we provide lower bounds that match the oracle complexity of our method in the case of strongly convex optimization problems.
△ Less
Submitted 30 March, 2024; v1 submitted 25 May, 2023;
originally announced May 2023.
-
Sarah Frank-Wolfe: Methods for Constrained Optimization with Best Rates and Practical Features
Authors:
Aleksandr Beznosikov,
David Dobre,
Gauthier Gidel
Abstract:
The Frank-Wolfe (FW) method is a popular approach for solving optimization problems with structured constraints that arise in machine learning applications. In recent years, stochastic versions of FW have gained popularity, motivated by large datasets for which the computation of the full gradient is prohibitively expensive. In this paper, we present two new variants of the FW algorithms for stoch…
▽ More
The Frank-Wolfe (FW) method is a popular approach for solving optimization problems with structured constraints that arise in machine learning applications. In recent years, stochastic versions of FW have gained popularity, motivated by large datasets for which the computation of the full gradient is prohibitively expensive. In this paper, we present two new variants of the FW algorithms for stochastic finite-sum minimization. Our algorithms have the best convergence guarantees of existing stochastic FW approaches for both convex and non-convex objective functions. Our methods do not have the issue of permanently collecting large batches, which is common to many stochastic projection-free approaches. Moreover, our second approach does not require either large batches or full deterministic gradients, which is a typical weakness of many techniques for finite-sum problems. The faster theoretical rates of our approaches are confirmed experimentally.
△ Less
Submitted 15 September, 2024; v1 submitted 23 April, 2023;
originally announced April 2023.
-
Similarity, Compression and Local Steps: Three Pillars of Efficient Communications for Distributed Variational Inequalities
Authors:
Aleksandr Beznosikov,
Martin Takáč,
Alexander Gasnikov
Abstract:
Variational inequalities are a broad and flexible class of problems that includes minimization, saddle point, and fixed point problems as special cases. Therefore, variational inequalities are used in various applications ranging from equilibrium search to adversarial learning. With the increasing size of data and models, today's instances demand parallel and distributed computing for real-world m…
▽ More
Variational inequalities are a broad and flexible class of problems that includes minimization, saddle point, and fixed point problems as special cases. Therefore, variational inequalities are used in various applications ranging from equilibrium search to adversarial learning. With the increasing size of data and models, today's instances demand parallel and distributed computing for real-world machine learning problems, most of which can be represented as variational inequalities. Meanwhile, most distributed approaches have a significant bottleneck - the cost of communications. The three main techniques to reduce the total number of communication rounds and the cost of one such round are the similarity of local functions, compression of transmitted information, and local updates. In this paper, we combine all these approaches. Such a triple synergy did not exist before for variational inequalities and saddle problems, nor even for minimization problems. The methods presented in this paper have the best theoretical guarantees of communication complexity and are significantly ahead of other methods for distributed variational inequalities. The theoretical results are confirmed by adversarial learning experiments on synthetic and real datasets.
△ Less
Submitted 30 March, 2024; v1 submitted 15 February, 2023;
originally announced February 2023.
-
Randomized gradient-free methods in convex optimization
Authors:
Alexander Gasnikov,
Darina Dvinskikh,
Pavel Dvurechensky,
Eduard Gorbunov,
Aleksander Beznosikov,
Aleksandr Lobanov
Abstract:
This review presents modern gradient-free methods to solve convex optimization problems. By gradient-free methods, we mean those that use only (noisy) realizations of the objective value. We are motivated by various applications where gradient information is prohibitively expensive or even unavailable. We mainly focus on three criteria: oracle complexity, iteration complexity, and the maximum perm…
▽ More
This review presents modern gradient-free methods to solve convex optimization problems. By gradient-free methods, we mean those that use only (noisy) realizations of the objective value. We are motivated by various applications where gradient information is prohibitively expensive or even unavailable. We mainly focus on three criteria: oracle complexity, iteration complexity, and the maximum permissible noise level.
△ Less
Submitted 12 February, 2024; v1 submitted 24 November, 2022;
originally announced November 2022.
-
Decentralized convex optimization over time-varying graphs: a survey
Authors:
Alexander Rogozin,
Alexander Gasnikov,
Aleksander Beznosikov,
Dmitry Kovalev
Abstract:
Decentralized optimization over time-varying networks has a wide range of applications in distributed learning, signal processing and various distributed control problems. The agents of the distributed system locally hold optimization objectives and can communicate to their immediate neighbors over a network that changes from time to time. In this paper, we survey state-of-the-art results and desc…
▽ More
Decentralized optimization over time-varying networks has a wide range of applications in distributed learning, signal processing and various distributed control problems. The agents of the distributed system locally hold optimization objectives and can communicate to their immediate neighbors over a network that changes from time to time. In this paper, we survey state-of-the-art results and describe the techniques for optimization over time-varying graphs. We also give an overview of open questions in the field and formulate hypotheses and directions for future work.
△ Less
Submitted 17 April, 2023; v1 submitted 18 October, 2022;
originally announced October 2022.
-
SARAH-based Variance-reduced Algorithm for Stochastic Finite-sum Cocoercive Variational Inequalities
Authors:
Aleksandr Beznosikov,
Alexander Gasnikov
Abstract:
Variational inequalities are a broad formalism that encompasses a vast number of applications. Motivated by applications in machine learning and beyond, stochastic methods are of great importance. In this paper we consider the problem of stochastic finite-sum cocoercive variational inequalities. For this class of problems, we investigate the convergence of the method based on the SARAH variance re…
▽ More
Variational inequalities are a broad formalism that encompasses a vast number of applications. Motivated by applications in machine learning and beyond, stochastic methods are of great importance. In this paper we consider the problem of stochastic finite-sum cocoercive variational inequalities. For this class of problems, we investigate the convergence of the method based on the SARAH variance reduction technique. We show that for strongly monotone problems it is possible to achieve linear convergence to a solution using this method. Experiments confirm the importance and practical applicability of our approach.
△ Less
Submitted 12 October, 2022;
originally announced October 2022.
-
Smooth Monotone Stochastic Variational Inequalities and Saddle Point Problems: A Survey
Authors:
Aleksandr Beznosikov,
Boris Polyak,
Eduard Gorbunov,
Dmitry Kovalev,
Alexander Gasnikov
Abstract:
This paper is a survey of methods for solving smooth (strongly) monotone stochastic variational inequalities. To begin with, we give the deterministic foundation from which the stochastic methods eventually evolved. Then we review methods for the general stochastic formulation, and look at the finite sum setup. The last parts of the paper are devoted to various recent (not necessarily stochastic)…
▽ More
This paper is a survey of methods for solving smooth (strongly) monotone stochastic variational inequalities. To begin with, we give the deterministic foundation from which the stochastic methods eventually evolved. Then we review methods for the general stochastic formulation, and look at the finite sum setup. The last parts of the paper are devoted to various recent (not necessarily stochastic) advances in algorithms for variational inequalities.
△ Less
Submitted 2 April, 2023; v1 submitted 29 August, 2022;
originally announced August 2022.
-
Compression and Data Similarity: Combination of Two Techniques for Communication-Efficient Solving of Distributed Variational Inequalities
Authors:
Aleksandr Beznosikov,
Alexander Gasnikov
Abstract:
Variational inequalities are an important tool, which includes minimization, saddles, games, fixed-point problems. Modern large-scale and computationally expensive practical applications make distributed methods for solving these problems popular. Meanwhile, most distributed systems have a basic problem - a communication bottleneck. There are various techniques to deal with it. In particular, in t…
▽ More
Variational inequalities are an important tool, which includes minimization, saddles, games, fixed-point problems. Modern large-scale and computationally expensive practical applications make distributed methods for solving these problems popular. Meanwhile, most distributed systems have a basic problem - a communication bottleneck. There are various techniques to deal with it. In particular, in this paper we consider a combination of two popular approaches: compression and data similarity. We show that this synergy can be more effective than each of the approaches separately in solving distributed smooth strongly monotone variational inequalities. Experiments confirm the theoretical conclusions.
△ Less
Submitted 3 September, 2022; v1 submitted 19 June, 2022;
originally announced June 2022.
-
On Scaled Methods for Saddle Point Problems
Authors:
Aleksandr Beznosikov,
Aibek Alanov,
Dmitry Kovalev,
Martin Takáč,
Alexander Gasnikov
Abstract:
Methods with adaptive scaling of different features play a key role in solving saddle point problems, primarily due to Adam's popularity for solving adversarial machine learning problems, including GANS training. This paper carries out a theoretical analysis of the following scaling techniques for solving SPPs: the well-known Adam and RmsProp scaling and the newer AdaHessian and OASIS based on Hut…
▽ More
Methods with adaptive scaling of different features play a key role in solving saddle point problems, primarily due to Adam's popularity for solving adversarial machine learning problems, including GANS training. This paper carries out a theoretical analysis of the following scaling techniques for solving SPPs: the well-known Adam and RmsProp scaling and the newer AdaHessian and OASIS based on Hutchison approximation. We use the Extra Gradient and its improved version with negative momentum as the basic method. Experimental studies on GANs show good applicability not only for Adam, but also for other less popular methods.
△ Less
Submitted 21 June, 2023; v1 submitted 16 June, 2022;
originally announced June 2022.
-
Stochastic Gradient Methods with Preconditioned Updates
Authors:
Abdurakhmon Sadiev,
Aleksandr Beznosikov,
Abdulla Jasem Almansoori,
Dmitry Kamzolov,
Rachael Tappenden,
Martin Takáč
Abstract:
This work considers the non-convex finite sum minimization problem. There are several algorithms for such problems, but existing methods often work poorly when the problem is badly scaled and/or ill-conditioned, and a primary goal of this work is to introduce methods that alleviate this issue. Thus, here we include a preconditioner based on Hutchinson's approach to approximating the diagonal of th…
▽ More
This work considers the non-convex finite sum minimization problem. There are several algorithms for such problems, but existing methods often work poorly when the problem is badly scaled and/or ill-conditioned, and a primary goal of this work is to introduce methods that alleviate this issue. Thus, here we include a preconditioner based on Hutchinson's approach to approximating the diagonal of the Hessian, and couple it with several gradient-based methods to give new scaled algorithms: Scaled SARAH and Scaled L-SVRG. Theoretical complexity guarantees under smoothness assumptions are presented. We prove linear convergence when both smoothness and the PL condition are assumed. Our adaptively scaled methods use approximate partial second-order curvature information and, therefore, can better mitigate the impact of badly scaled problems. This improved practical performance is demonstrated in the numerical experiments also presented in this work.
△ Less
Submitted 14 January, 2024; v1 submitted 1 June, 2022;
originally announced June 2022.
-
Optimal Gradient Sliding and its Application to Distributed Optimization Under Similarity
Authors:
Dmitry Kovalev,
Aleksandr Beznosikov,
Ekaterina Borodich,
Alexander Gasnikov,
Gesualdo Scutari
Abstract:
We study structured convex optimization problems, with additive objective $r:=p + q$, where $r$ is ($μ$-strongly) convex, $q$ is $L_q$-smooth and convex, and $p$ is $L_p$-smooth, possibly nonconvex. For such a class of problems, we proposed an inexact accelerated gradient sliding method that can skip the gradient computation for one of these components while still achieving optimal complexity of g…
▽ More
We study structured convex optimization problems, with additive objective $r:=p + q$, where $r$ is ($μ$-strongly) convex, $q$ is $L_q$-smooth and convex, and $p$ is $L_p$-smooth, possibly nonconvex. For such a class of problems, we proposed an inexact accelerated gradient sliding method that can skip the gradient computation for one of these components while still achieving optimal complexity of gradient calls of $p$ and $q$, that is,
$\mathcal{O}(\sqrt{L_p/μ})$ and $\mathcal{O}(\sqrt{L_q/μ})$, respectively. This result is much sharper than the classic black-box complexity $\mathcal{O}(\sqrt{(L_p+L_q)/μ})$, especially when the difference between $L_q$ and $L_q$ is large. We then apply the proposed method to solve distributed optimization problems over master-worker architectures, under agents' function similarity, due to statistical data similarity or otherwise. The distributed algorithm achieves for the first time lower complexity bounds on {\it both} communication and local gradient calls, with the former having being a long-standing open problem. Finally the method is extended to distributed saddle-problems (under function similarity) by means of solving a class of variational inequalities, achieving lower communication and computation complexity bounds.
△ Less
Submitted 30 May, 2022;
originally announced May 2022.
-
Stochastic Gradient Descent-Ascent: Unified Theory and New Efficient Methods
Authors:
Aleksandr Beznosikov,
Eduard Gorbunov,
Hugo Berard,
Nicolas Loizou
Abstract:
Stochastic Gradient Descent-Ascent (SGDA) is one of the most prominent algorithms for solving min-max optimization and variational inequalities problems (VIP) appearing in various machine learning tasks. The success of the method led to several advanced extensions of the classical SGDA, including variants with arbitrary sampling, variance reduction, coordinate randomization, and distributed varian…
▽ More
Stochastic Gradient Descent-Ascent (SGDA) is one of the most prominent algorithms for solving min-max optimization and variational inequalities problems (VIP) appearing in various machine learning tasks. The success of the method led to several advanced extensions of the classical SGDA, including variants with arbitrary sampling, variance reduction, coordinate randomization, and distributed variants with compression, which were extensively studied in the literature, especially during the last few years. In this paper, we propose a unified convergence analysis that covers a large variety of stochastic gradient descent-ascent methods, which so far have required different intuitions, have different applications and have been developed separately in various communities. A key to our unified framework is a parametric assumption on the stochastic estimates. Via our general theoretical framework, we either recover the sharpest known rates for the known special cases or tighten them. Moreover, to illustrate the flexibility of our approach we develop several new variants of SGDA such as a new variance-reduced method (L-SVRGDA), new distributed methods with compression (QSGDA, DIANA-SGDA, VR-DIANA-SGDA), and a new method with coordinate randomization (SEGA-SGDA). Although variants of the new methods are known for solving minimization problems, they were never considered or analyzed for solving min-max problems and VIPs. We also demonstrate the most important properties of the new methods through extensive numerical experiments.
△ Less
Submitted 8 March, 2023; v1 submitted 15 February, 2022;
originally announced February 2022.
-
Optimal Algorithms for Decentralized Stochastic Variational Inequalities
Authors:
Dmitry Kovalev,
Aleksandr Beznosikov,
Abdurakhmon Sadiev,
Michael Persiianov,
Peter Richtárik,
Alexander Gasnikov
Abstract:
Variational inequalities are a formalism that includes games, minimization, saddle point, and equilibrium problems as special cases. Methods for variational inequalities are therefore universal approaches for many applied tasks, including machine learning problems. This work concentrates on the decentralized setting, which is increasingly important but not well understood. In particular, we consid…
▽ More
Variational inequalities are a formalism that includes games, minimization, saddle point, and equilibrium problems as special cases. Methods for variational inequalities are therefore universal approaches for many applied tasks, including machine learning problems. This work concentrates on the decentralized setting, which is increasingly important but not well understood. In particular, we consider decentralized stochastic (sum-type) variational inequalities over fixed and time-varying networks. We present lower complexity bounds for both communication and local iterations and construct optimal algorithms that match these lower bounds. Our algorithms are the best among the available literature not only in the decentralized stochastic case, but also in the decentralized deterministic and non-distributed stochastic cases. Experimental results confirm the effectiveness of the presented algorithms.
△ Less
Submitted 2 April, 2023; v1 submitted 6 February, 2022;
originally announced February 2022.
-
The Power of First-Order Smooth Optimization for Black-Box Non-Smooth Problems
Authors:
Alexander Gasnikov,
Anton Novitskii,
Vasilii Novitskii,
Farshed Abdukhakimov,
Dmitry Kamzolov,
Aleksandr Beznosikov,
Martin Takáč,
Pavel Dvurechensky,
Bin Gu
Abstract:
Gradient-free/zeroth-order methods for black-box convex optimization have been extensively studied in the last decade with the main focus on oracle calls complexity. In this paper, besides the oracle complexity, we focus also on iteration complexity, and propose a generic approach that, based on optimal first-order methods, allows to obtain in a black-box fashion new zeroth-order algorithms for no…
▽ More
Gradient-free/zeroth-order methods for black-box convex optimization have been extensively studied in the last decade with the main focus on oracle calls complexity. In this paper, besides the oracle complexity, we focus also on iteration complexity, and propose a generic approach that, based on optimal first-order methods, allows to obtain in a black-box fashion new zeroth-order algorithms for non-smooth convex optimization problems. Our approach not only leads to optimal oracle complexity, but also allows to obtain iteration complexity similar to first-order methods, which, in turn, allows to exploit parallel computations to accelerate the convergence of our algorithms. We also elaborate on extensions for stochastic optimization problems, saddle-point problems, and distributed optimization.
△ Less
Submitted 1 March, 2023; v1 submitted 28 January, 2022;
originally announced January 2022.
-
A Unified Analysis of Variational Inequality Methods: Variance Reduction, Sampling, Quantization and Coordinate Descent
Authors:
Aleksandr Beznosikov,
Alexander Gasnikov,
Karina Zainulina,
Alexander Maslovskiy,
Dmitry Pasechnyuk
Abstract:
In this paper, we present a unified analysis of methods for such a wide class of problems as variational inequalities, which includes minimization problems and saddle point problems. We develop our analysis on the modified Extra-Gradient method (the classic algorithm for variational inequalities) and consider the strongly monotone and monotone cases, which corresponds to strongly-convex-strongly-c…
▽ More
In this paper, we present a unified analysis of methods for such a wide class of problems as variational inequalities, which includes minimization problems and saddle point problems. We develop our analysis on the modified Extra-Gradient method (the classic algorithm for variational inequalities) and consider the strongly monotone and monotone cases, which corresponds to strongly-convex-strongly-concave and convex-concave saddle point problems. The theoretical analysis is based on parametric assumptions about Extra-Gradient iterations. Therefore, it can serve as a strong basis for combining the already existing type methods and also for creating new algorithms. In particular, to show this we develop new robust methods, which include methods with quantization, coordinate methods, distributed randomized local methods, and others. Most of these approaches have never been considered in the generality of variational inequalities and have previously been used only for minimization problems. The robustness of the new methods is also confirmed by numerical experiments with GANs.
△ Less
Submitted 3 February, 2022; v1 submitted 28 January, 2022;
originally announced January 2022.
-
Random-reshuffled SARAH does not need a full gradient computations
Authors:
Aleksandr Beznosikov,
Martin Takáč
Abstract:
The StochAstic Recursive grAdient algoritHm (SARAH) algorithm is a variance reduced variant of the Stochastic Gradient Descent (SGD) algorithm that needs a gradient of the objective function from time to time. In this paper, we remove the necessity of a full gradient computation. This is achieved by using a randomized reshuffling strategy and aggregating stochastic gradients obtained in each epoch…
▽ More
The StochAstic Recursive grAdient algoritHm (SARAH) algorithm is a variance reduced variant of the Stochastic Gradient Descent (SGD) algorithm that needs a gradient of the objective function from time to time. In this paper, we remove the necessity of a full gradient computation. This is achieved by using a randomized reshuffling strategy and aggregating stochastic gradients obtained in each epoch. The aggregated stochastic gradients serve as an estimate of a full gradient in the SARAH algorithm. We provide a theoretical analysis of the proposed approach and conclude the paper with numerical experiments that demonstrate the efficiency of this approach.
△ Less
Submitted 14 January, 2024; v1 submitted 26 November, 2021;
originally announced November 2021.
-
Distributed Saddle-Point Problems Under Similarity
Authors:
Aleksandr Beznosikov,
Gesualdo Scutari,
Alexander Rogozin,
Alexander Gasnikov
Abstract:
We study solution methods for (strongly-)convex-(strongly)-concave Saddle-Point Problems (SPPs) over networks of two type - master/workers (thus centralized) architectures and meshed (thus decentralized) networks. The local functions at each node are assumed to be similar, due to statistical data similarity or otherwise. We establish lower complexity bounds for a fairly general class of algorithms…
▽ More
We study solution methods for (strongly-)convex-(strongly)-concave Saddle-Point Problems (SPPs) over networks of two type - master/workers (thus centralized) architectures and meshed (thus decentralized) networks. The local functions at each node are assumed to be similar, due to statistical data similarity or otherwise. We establish lower complexity bounds for a fairly general class of algorithms solving the SPP. We show that a given suboptimality $ε>0$ is achieved over master/workers networks in $Ω\big(Δ\cdot δ/μ\cdot \log (1/\varepsilon)\big)$ rounds of communications, where $δ>0$ measures the degree of similarity of the local functions, $μ$ is their strong convexity constant, and $Δ$ is the diameter of the network. The lower communication complexity bound over meshed networks reads $Ω\big(1/{\sqrtρ} \cdot δ/μ\cdot\log (1/\varepsilon)\big)$, where $ρ$ is the (normalized) eigengap of the gossip matrix used for the communication between neighbouring nodes. We then propose algorithms matching the lower bounds over either types of networks (up to log-factors). We assess the effectiveness of the proposed algorithms on a robust logistic regression problem.
△ Less
Submitted 22 August, 2022; v1 submitted 22 July, 2021;
originally announced July 2021.