-
Sign-SGD is the Golden Gate between Multi-Node to Single-Node Learning: Significant Boost via Parameter-Free Optimization
Authors:
Daniil Medyakov,
Sergey Stanko,
Gleb Molodtsov,
Philip Zmushko,
Grigoriy Evseev,
Egor Petrov,
Aleksandr Beznosikov
Abstract:
Quite recently, large language models have made a significant breakthrough across various disciplines. However, training them is an extremely resource-intensive task, even for major players with vast computing resources. One of the methods gaining popularity in light of these challenges is Sign-SGD. This method can be applied both as a memory-efficient approach in single-node training and as a gra…
▽ More
Quite recently, large language models have made a significant breakthrough across various disciplines. However, training them is an extremely resource-intensive task, even for major players with vast computing resources. One of the methods gaining popularity in light of these challenges is Sign-SGD. This method can be applied both as a memory-efficient approach in single-node training and as a gradient compression technique in the distributed learning. Nevertheless, it is impossible to automatically determine the effective stepsize from the theoretical standpoint. Indeed, it depends on the parameters of the dataset to which we do not have access in the real-world learning paradigm. To address this issue, we design several variants of single-node deterministic Sign-SGD. We extend our approaches to practical scenarios: stochastic single-node and multi-node learning, methods with incorporated momentum. We conduct extensive experiments on real machine learning problems that emphasize the practical applicability of our ideas.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Sign Operator for Coping with Heavy-Tailed Noise in Non-Convex Optimization: High Probability Bounds Under $(L_0, L_1)$-Smoothness
Authors:
Nikita Kornilov,
Philip Zmushko,
Andrei Semenov,
Mark Ikonnikov,
Alexander Gasnikov,
Alexander Beznosikov
Abstract:
In recent years, non-convex optimization problems are more often described by generalized $(L_0, L_1)$-smoothness assumption rather than standard one. Meanwhile, severely corrupted data used in these problems has increased the demand for methods capable of handling heavy-tailed noises, i.e., noises with bounded $κ$-th moment. Motivated by these real-world trends and challenges, we explore sign-bas…
▽ More
In recent years, non-convex optimization problems are more often described by generalized $(L_0, L_1)$-smoothness assumption rather than standard one. Meanwhile, severely corrupted data used in these problems has increased the demand for methods capable of handling heavy-tailed noises, i.e., noises with bounded $κ$-th moment. Motivated by these real-world trends and challenges, we explore sign-based methods in this setup and demonstrate their effectiveness in comparison with other popular solutions like clipping or normalization.
In theory, we prove the first-known high probability convergence bounds under $(L_0, L_1)$-smoothness and heavy-tailed noises with mild parameter dependencies. In the case of standard smoothness, these bounds are novel for sign-based methods as well. In particular, SignSGD with batching achieves sample complexity $\tilde{O}\left(\left(\frac{ΔL_0d}{\varepsilon^2} + \frac{ΔL_1d^\frac{3}{2}}{\varepsilon}\right)\left[1 + \left(\fracσ{\varepsilon}\right)^\fracκ{κ-1}\right]\right), κ\in (1,2]$. Under the assumption of symmetric noises, SignSGD with Majority Voting can robustly work on the whole range of $κ\in (0,2]$ with complexity $\tilde{O}\left(\left(\frac{ΔL_0d}{\varepsilon^2} + \frac{ΔL_1d^\frac{3}{2}}{\varepsilon}\right)\left[\frac{1}{κ^2} + \frac{σ^2}{\varepsilon^2}\right]\right)$. We also obtain results for parameter-agnostic setups, Polyak-Lojasiewicz functions and momentum-based methods (in expectation). Our theoretical findings are supported by the superior performance of sign-based methods in training Large Language Models compared to clipping and normalization.
△ Less
Submitted 27 May, 2025; v1 submitted 11 February, 2025;
originally announced February 2025.
-
Label Privacy in Split Learning for Large Models with Parameter-Efficient Training
Authors:
Philip Zmushko,
Marat Mansurov,
Ruslan Svirschevski,
Denis Kuznedelev,
Max Ryabinin,
Aleksandr Beznosikov
Abstract:
As deep learning models become larger and more expensive, many practitioners turn to fine-tuning APIs. These web services allow fine-tuning a model between two parties: the client that provides the data, and the server that hosts the model. While convenient, these APIs raise a new concern: the data of the client is at risk of privacy breach during the training procedure. This challenge presents an…
▽ More
As deep learning models become larger and more expensive, many practitioners turn to fine-tuning APIs. These web services allow fine-tuning a model between two parties: the client that provides the data, and the server that hosts the model. While convenient, these APIs raise a new concern: the data of the client is at risk of privacy breach during the training procedure. This challenge presents an important practical case of vertical federated learning, where the two parties perform parameter-efficient fine-tuning (PEFT) of a large model. In this study, we systematically search for a way to fine-tune models over an API while keeping the labels private. We analyze the privacy of LoRA, a popular approach for parameter-efficient fine-tuning when training over an API. Using this analysis, we propose P$^3$EFT, a multi-party split learning algorithm that takes advantage of existing PEFT properties to maintain privacy at a lower performance overhead. To validate our algorithm, we fine-tune DeBERTa-v2-XXLarge, Flan-T5 Large and LLaMA-2 7B using LoRA adapters on a range of NLP tasks. We find that P$^3$EFT is competitive with existing privacy-preserving methods in multi-party and two-party setups while having higher accuracy.
△ Less
Submitted 21 December, 2024;
originally announced December 2024.
-
Just a Simple Transformation is Enough for Data Protection in Vertical Federated Learning
Authors:
Andrei Semenov,
Philip Zmushko,
Alexander Pichugin,
Aleksandr Beznosikov
Abstract:
Vertical Federated Learning (VFL) aims to enable collaborative training of deep learning models while maintaining privacy protection. However, the VFL procedure still has components that are vulnerable to attacks by malicious parties. In our work, we consider feature reconstruction attacks, a common risk targeting input data compromise. We theoretically claim that feature reconstruction attacks ca…
▽ More
Vertical Federated Learning (VFL) aims to enable collaborative training of deep learning models while maintaining privacy protection. However, the VFL procedure still has components that are vulnerable to attacks by malicious parties. In our work, we consider feature reconstruction attacks, a common risk targeting input data compromise. We theoretically claim that feature reconstruction attacks cannot succeed without knowledge of the prior distribution on data. Consequently, we demonstrate that even simple model architecture transformations can significantly impact the protection of input data during VFL. Confirming these findings with experimental results, we show that MLP-based models are resistant to state-of-the-art feature reconstruction attacks.
△ Less
Submitted 16 December, 2024;
originally announced December 2024.
-
FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training
Authors:
Philip Zmushko,
Aleksandr Beznosikov,
Martin Takáč,
Samuel Horváth
Abstract:
With the increase in the number of parameters in large language models, the process of pre-training and fine-tuning increasingly demands larger volumes of GPU memory. A significant portion of this memory is typically consumed by the optimizer state. To overcome this challenge, recent approaches such as low-rank adaptation (LoRA (Hu et al., 2021)), low-rank gradient projection (GaLore (Zhao et al.,…
▽ More
With the increase in the number of parameters in large language models, the process of pre-training and fine-tuning increasingly demands larger volumes of GPU memory. A significant portion of this memory is typically consumed by the optimizer state. To overcome this challenge, recent approaches such as low-rank adaptation (LoRA (Hu et al., 2021)), low-rank gradient projection (GaLore (Zhao et al., 2024)), and blockwise optimization (BAdam (Luo et al., 2024)) have been proposed. However, in all these algorithms, the $\textit{effective rank of the weight updates remains low-rank}$, which can lead to a substantial loss of information from the gradient. This loss can be critically important, especially during the pre-training stage. In this paper, we introduce $\texttt{FRUGAL}$ ($\textbf{F}$ull-$\textbf{R}$ank $\textbf{U}$pdates with $\textbf{G}$r$\textbf{A}$dient sp$\textbf{L}$itting), a new memory-efficient optimization framework. $\texttt{FRUGAL}$ leverages gradient splitting to perform low-dimensional updates using advanced algorithms (such as Adam), while updates along the remaining directions are executed via state-free methods like SGD or signSGD (Bernstein et al., 2018). Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam. We provide theoretical convergence guarantees for our framework when using SGDM for low-dimensional updates and SGD for state-free updates. Additionally, our method consistently outperforms concurrent approaches across various fixed memory budgets, achieving state-of-the-art results in pre-training and fine-tuning tasks while balancing memory efficiency and performance metrics.
△ Less
Submitted 12 November, 2024;
originally announced November 2024.