Skip to main content

Showing 1–6 of 6 results for author: Ilin, I

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.05346  [pdf, other

    cs.LG cs.AI cs.CL cs.PF

    Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression

    Authors: Ivan Ilin, Peter Richtarik

    Abstract: This paper presents Thanos, a novel weight-pruning algorithm designed to reduce the memory footprint and enhance the computational efficiency of large language models (LLMs) by removing redundant weights while maintaining accuracy. Thanos introduces a block-wise pruning strategy with adaptive masks that dynamically adjust to weight importance, enabling flexible sparsity patterns and structured for… ▽ More

    Submitted 6 April, 2025; originally announced April 2025.

    Comments: 8 pages, 3 Figures, 3 Tables, 2 Algorithms, paper comes with Appendix

    MSC Class: 68T07; 68Q32

  2. arXiv:2504.04520  [pdf, other

    cs.LG cs.AI cs.CL

    Hessian of Perplexity for Large Language Models by PyTorch autograd (Open Source)

    Authors: Ivan Ilin

    Abstract: Computing the full Hessian matrix -- the matrix of second-order derivatives for an entire Large Language Model (LLM) is infeasible due to its sheer size. In this technical report, we aim to provide a comprehensive guide on how to accurately compute at least a small portion of the Hessian for LLMs using PyTorch autograd library. We also demonstrate how to compute the full diagonal of the Hessian ma… ▽ More

    Submitted 6 April, 2025; originally announced April 2025.

    Comments: 15 pages, 3 figures, open source code on GitHub

    MSC Class: 68T07; 65K10; 65Y05

  3. arXiv:2411.17525  [pdf, other

    cs.LG

    Pushing the Limits of Large Language Model Quantization via the Linearity Theorem

    Authors: Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter Richtárik, Dan Alistarh

    Abstract: Quantizing large language models has become a standard way to reduce their memory and computational costs. Typically, existing methods focus on breaking down the problem into individual layer-wise sub-problems, and minimizing per-layer error, measured via various metrics. Yet, this approach currently lacks theoretical justification and the metrics employed may be sub-optimal. In this paper, we pre… ▽ More

    Submitted 26 November, 2024; originally announced November 2024.

  4. arXiv:2405.14852  [pdf, other

    cs.LG

    PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression

    Authors: Vladimir Malinovskii, Denis Mazur, Ivan Ilin, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan Alistarh, Peter Richtarik

    Abstract: There has been significant interest in "extreme" compression of large language models (LLMs), i.e., to 1-2 bits per parameter, which allows such models to be executed efficiently on resource-constrained devices. Existing work focused on improved one-shot quantization techniques and weight representations; yet, purely post-training approaches are reaching diminishing returns in terms of the accurac… ▽ More

    Submitted 30 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

    Comments: Preprint

  5. arXiv:2402.04785  [pdf, other

    math.OC cs.LG

    Shadowheart SGD: Distributed Asynchronous SGD with Optimal Time Complexity Under Arbitrary Computation and Communication Heterogeneity

    Authors: Alexander Tyurin, Marta Pozzi, Ivan Ilin, Peter Richtárik

    Abstract: We consider nonconvex stochastic optimization problems in the asynchronous centralized distributed setup where the communication times from workers to a server can not be ignored, and the computation and communication times are potentially different for all workers. Using an unbiassed compression technique, we develop a new method-Shadowheart SGD-that provably improves the time complexities of all… ▽ More

    Submitted 2 November, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

  6. arXiv:2312.08053  [pdf, other

    cs.LG cs.DC cs.IT

    Kimad: Adaptive Gradient Compression with Bandwidth Awareness

    Authors: Jihao Xin, Ivan Ilin, Shunkang Zhang, Marco Canini, Peter Richtárik

    Abstract: In distributed training, communication often emerges as a bottleneck. In response, we introduce Kimad, a solution that offers adaptive gradient compression. By consistently monitoring bandwidth, Kimad refines compression ratios to match specific neural network layer requirements. Our exhaustive tests and proofs confirm Kimad's outstanding performance, establishing it as a benchmark in adaptive com… ▽ More

    Submitted 13 December, 2023; originally announced December 2023.