Skip to main content

Showing 1–14 of 14 results for author: Safaryan, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2410.16103  [pdf, other

    cs.LG math.OC stat.ML

    LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics

    Authors: Thomas Robert, Mher Safaryan, Ionut-Vlad Modoranu, Dan Alistarh

    Abstract: We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training. This strategy keeps the optimizer's memory footprint to a fraction of the model size. LDAdam relies on a new projection-aware update rule for the optimizer states that allows… ▽ More

    Submitted 2 March, 2025; v1 submitted 21 October, 2024; originally announced October 2024.

    Comments: 39 pages, ICLR 2025

  2. arXiv:2408.17163  [pdf, other

    cs.LG

    The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information

    Authors: Diyuan Wu, Ionut-Vlad Modoranu, Mher Safaryan, Denis Kuznedelev, Dan Alistarh

    Abstract: The rising footprint of machine learning has led to a focus on imposing \emph{model sparsity} as a means of reducing computational and memory costs. For deep neural networks (DNNs), the state-of-the-art accuracy-vs-sparsity is achieved by heuristics inspired by the classical Optimal Brain Surgeon (OBS) framework~\citep{lecun90brain, hassibi1992second, hassibi1993optimal}, which leverages loss curv… ▽ More

    Submitted 30 August, 2024; originally announced August 2024.

  3. arXiv:2405.15593  [pdf, other

    cs.LG math.NA

    MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence

    Authors: Ionut-Vlad Modoranu, Mher Safaryan, Grigory Malinovsky, Eldar Kurtic, Thomas Robert, Peter Richtarik, Dan Alistarh

    Abstract: We propose a new variant of the Adam optimizer called MicroAdam that specifically minimizes memory overheads, while maintaining theoretical convergence guarantees. We achieve this by compressing the gradient information before it is fed into the optimizer state, thereby reducing its memory footprint significantly. We control the resulting compression error via a novel instance of the classical \em… ▽ More

    Submitted 5 November, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

  4. arXiv:2310.20452  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    AsGrad: A Sharp Unified Analysis of Asynchronous-SGD Algorithms

    Authors: Rustem Islamov, Mher Safaryan, Dan Alistarh

    Abstract: We analyze asynchronous-type algorithms for distributed SGD in the heterogeneous setting, where each worker has its own computation and communication speeds, as well as data distribution. In these algorithms, workers compute possibly stale and stochastic gradients associated with their local data at some iteration back in history and then return those gradients to the server without synchronizing… ▽ More

    Submitted 31 October, 2023; originally announced October 2023.

  5. arXiv:2305.17581  [pdf, other

    cs.LG math.OC

    Knowledge Distillation Performs Partial Variance Reduction

    Authors: Mher Safaryan, Alexandra Peste, Dan Alistarh

    Abstract: Knowledge distillation is a popular approach for enhancing the performance of ''student'' models, with lower representational capacity, by taking advantage of more powerful ''teacher'' models. Despite its apparent simplicity and widespread use, the underlying mechanics behind knowledge distillation (KD) are still not fully understood. In this work, we shed new light on the inner workings of this m… ▽ More

    Submitted 8 December, 2023; v1 submitted 27 May, 2023; originally announced May 2023.

    Comments: 15+22 pages, NeurIPS 2023

  6. arXiv:2210.16402  [pdf, other

    cs.LG math.OC

    GradSkip: Communication-Accelerated Local Gradient Methods with Better Computational Complexity

    Authors: Artavazd Maranjyan, Mher Safaryan, Peter Richtárik

    Abstract: We study a class of distributed optimization algorithms that aim to alleviate high communication costs by allowing the clients to perform multiple local gradient-type training steps prior to communication. While methods of this type have been studied for about a decade, the empirically observed acceleration properties of local training eluded all attempts at theoretical understanding. In a recent… ▽ More

    Submitted 30 May, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

    Comments: 26 pages, 3 algorithms, 3 figures

  7. arXiv:2206.03588  [pdf, other

    cs.LG math.OC

    Distributed Newton-Type Methods with Communication Compression and Bernoulli Aggregation

    Authors: Rustem Islamov, Xun Qian, Slavomír Hanzely, Mher Safaryan, Peter Richtárik

    Abstract: Despite their high computation and communication costs, Newton-type methods remain an appealing option for distributed training due to their robustness against ill-conditioned convex problems. In this work, we study ommunication compression and aggregation mechanisms for curvature information in order to reduce these costs while preserving theoretically superior local convergence guarantees. We pr… ▽ More

    Submitted 7 June, 2022; originally announced June 2022.

  8. arXiv:2111.01847  [pdf, other

    cs.LG math.OC

    Basis Matters: Better Communication-Efficient Second Order Methods for Federated Learning

    Authors: Xun Qian, Rustem Islamov, Mher Safaryan, Peter Richtárik

    Abstract: Recent advances in distributed optimization have shown that Newton-type methods with proper communication compression mechanisms can guarantee fast local rates and low communication cost compared to first order methods. We discover that the communication cost of these methods can be further reduced, sometimes dramatically so, with a surprisingly simple trick: {\em Basis Learn (BL)}. The idea is to… ▽ More

    Submitted 2 November, 2021; originally announced November 2021.

    Comments: 52 pages

  9. arXiv:2106.03524  [pdf, other

    cs.LG math.OC

    Theoretically Better and Numerically Faster Distributed Optimization with Smoothness-Aware Quantization Techniques

    Authors: Bokun Wang, Mher Safaryan, Peter Richtárik

    Abstract: To address the high communication costs of distributed machine learning, a large body of work has been devoted in recent years to designing various compression strategies, such as sparsification and quantization, and optimization algorithms capable of using them. Recently, Safaryan et al. (2021) pioneered a dramatically different compression design approach: they first use the local training data… ▽ More

    Submitted 12 October, 2022; v1 submitted 7 June, 2021; originally announced June 2021.

    Comments: To appear in NeurIPS 2022

  10. arXiv:2106.02969  [pdf, other

    cs.LG cs.DC math.OC

    FedNL: Making Newton-Type Methods Applicable to Federated Learning

    Authors: Mher Safaryan, Rustem Islamov, Xun Qian, Peter Richtárik

    Abstract: Inspired by recent work of Islamov et al (2021), we propose a family of Federated Newton Learn (FedNL) methods, which we believe is a marked step in the direction of making second-order methods applicable to FL. In contrast to the aforementioned work, FedNL employs a different Hessian learning technique which i) enhances privacy as it does not rely on the training data to be revealed to the coordi… ▽ More

    Submitted 22 May, 2022; v1 submitted 5 June, 2021; originally announced June 2021.

    Comments: 65 pages, 7 algorithms, 14 figures --- Accepted to ICML 2022

  11. arXiv:2102.07245  [pdf, other

    cs.LG math.OC

    Smoothness Matrices Beat Smoothness Constants: Better Communication Compression Techniques for Distributed Optimization

    Authors: Mher Safaryan, Filip Hanzely, Peter Richtárik

    Abstract: Large scale distributed optimization has become the default tool for the training of supervised machine learning models with a large number of parameters and training data. Recent advancements in the field provide several mechanisms for speeding up the training, including {\em compressed communication}, {\em variance reduction} and {\em acceleration}. However, none of these methods is capable of e… ▽ More

    Submitted 14 February, 2021; originally announced February 2021.

    Comments: 59 pages, 5 figues, 6 tables

  12. arXiv:2010.03246  [pdf, other

    cs.LG math.OC

    Optimal Gradient Compression for Distributed and Federated Learning

    Authors: Alyazeed Albasyoni, Mher Safaryan, Laurent Condat, Peter Richtárik

    Abstract: Communicating information, like gradient vectors, between computing nodes in distributed and federated learning is typically an unavoidable burden, resulting in scalability issues. Indeed, communication might be slow and costly. Recent advances in communication-efficient training algorithms have reduced this bottleneck by using compression techniques, in the form of sparsification, quantization, o… ▽ More

    Submitted 7 October, 2020; originally announced October 2020.

  13. arXiv:2002.12410  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    On Biased Compression for Distributed Learning

    Authors: Aleksandr Beznosikov, Samuel Horváth, Peter Richtárik, Mher Safaryan

    Abstract: In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact biased compressors often show superior performance in practice when compared to the much more studied and understood unbiased compressors, very little is known about them. In this work we study… ▽ More

    Submitted 14 January, 2024; v1 submitted 27 February, 2020; originally announced February 2020.

    Comments: 50 pages, 9 figures, 5 tables, 22 theorems and lemmas, 7 new compression operators, 1 algorithm

    Journal ref: Journal of Machine Learning Research 2023: https://www.jmlr.org/papers/v24/21-1548.html

  14. arXiv:2002.08958  [pdf, other

    cs.LG cs.DC cs.IT math.OC stat.ML

    Uncertainty Principle for Communication Compression in Distributed and Federated Learning and the Search for an Optimal Compressor

    Authors: Mher Safaryan, Egor Shulgin, Peter Richtárik

    Abstract: In order to mitigate the high communication cost in distributed and federated learning, various vector compression schemes, such as quantization, sparsification and dithering, have become very popular. In designing a compression method, one aims to communicate as few bits as possible, which minimizes the cost per communication round, while at the same time attempting to impart as little distortion… ▽ More

    Submitted 26 January, 2021; v1 submitted 20 February, 2020; originally announced February 2020.

    Comments: 23 pages, 6 figures, 2 tables

    Journal ref: Information and Inference: A Journal of the IMA, 2021