Skip to main content

Showing 1–9 of 9 results for author: Velikanov, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2410.05355  [pdf, other

    cs.CL cs.AI

    Falcon Mamba: The First Competitive Attention-free 7B Language Model

    Authors: Jingwei Zuo, Maksim Velikanov, Dhia Eddine Rhaiem, Ilyas Chahed, Younes Belkada, Guillaume Kunsch, Hakim Hacid

    Abstract: In this technical report, we present Falcon Mamba 7B, a new base large language model based on the novel Mamba architecture. Falcon Mamba 7B is trained on 5.8 trillion tokens with carefully selected data mixtures. As a pure Mamba-based model, Falcon Mamba 7B surpasses leading open-weight models based on Transformers, such as Mistral 7B, Llama3.1 8B, and Falcon2 11B. It is on par with Gemma 7B and… ▽ More

    Submitted 7 October, 2024; originally announced October 2024.

  2. arXiv:2410.04228  [pdf, other

    cs.LG math.OC

    SGD with memory: fundamental properties and stochastic acceleration

    Authors: Dmitry Yarotsky, Maksim Velikanov

    Abstract: An important open problem is the theoretically feasible acceleration of mini-batch SGD-type algorithms on quadratic problems with power-law spectrum. In the non-stochastic setting, the optimal exponent $ξ$ in the loss convergence $L_t\sim C_Lt^{-ξ}$ is double that in plain GD and is achievable using Heavy Ball (HB) with a suitable schedule; this no longer works in the presence of mini-batch noise.… ▽ More

    Submitted 10 March, 2025; v1 submitted 5 October, 2024; originally announced October 2024.

    Comments: ICLR 2025 camera ready

  3. arXiv:2407.14885  [pdf, other

    cs.CL cs.CV

    Falcon2-11B Technical Report

    Authors: Quentin Malartic, Nilabhra Roy Chowdhury, Ruxandra Cojocaru, Mugariya Farooq, Giulia Campesan, Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Ankit Singh, Maksim Velikanov, Basma El Amel Boussaha, Mohammed Al-Yafeai, Hamza Alobeidli, Leen Al Qadi, Mohamed El Amine Seddik, Kirill Fedyanin, Reda Alami, Hakim Hacid

    Abstract: We introduce Falcon2-11B, a foundation model trained on over five trillion tokens, and its multimodal counterpart, Falcon2-11B-vlm, which is a vision-to-text model. We report our findings during the training of the Falcon2-11B which follows a multi-stage approach where the early stages are distinguished by their context length and a final stage where we use a curated, high-quality dataset. Additio… ▽ More

    Submitted 20 July, 2024; originally announced July 2024.

  4. arXiv:2403.11696  [pdf, other

    cs.LG stat.ML

    Generalization error of spectral algorithms

    Authors: Maksim Velikanov, Maxim Panov, Dmitry Yarotsky

    Abstract: The asymptotically precise estimation of the generalization of kernel methods has recently received attention due to the parallels between neural networks and their associated kernels. However, prior works derive such estimates for training by kernel ridge regression (KRR), whereas neural networks are typically trained with gradient descent (GD). In the present work, we consider the training of ke… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  5. arXiv:2312.15799  [pdf, other

    stat.ML cs.LG

    Efficient Conformal Prediction under Data Heterogeneity

    Authors: Vincent Plassier, Nikita Kotelevskii, Aleksandr Rubashevskii, Fedor Noskov, Maksim Velikanov, Alexander Fishkov, Samuel Horvath, Martin Takac, Eric Moulines, Maxim Panov

    Abstract: Conformal Prediction (CP) stands out as a robust framework for uncertainty quantification, which is crucial for ensuring the reliability of predictions. However, common CP methods heavily rely on data exchangeability, a condition often violated in practice. Existing approaches for tackling non-exchangeability lead to methods that are not computable beyond the simplest examples. This work introduce… ▽ More

    Submitted 13 July, 2024; v1 submitted 25 December, 2023; originally announced December 2023.

    Comments: 29 pages

  6. arXiv:2206.11124  [pdf, other

    cs.LG math.OC stat.ML

    A view of mini-batch SGD via generating functions: conditions of convergence, phase transitions, benefit from negative momenta

    Authors: Maksim Velikanov, Denis Kuznedelev, Dmitry Yarotsky

    Abstract: Mini-batch SGD with momentum is a fundamental algorithm for learning large predictive models. In this paper we develop a new analytic framework to analyze noise-averaged properties of mini-batch SGD for linear models at constant learning rates, momenta and sizes of batches. Our key idea is to consider the dynamics of the second moments of model parameters for a special family of "Spectrally Expres… ▽ More

    Submitted 9 March, 2023; v1 submitted 22 June, 2022; originally announced June 2022.

    Comments: The revised version accepted at ICLR2023

  7. arXiv:2202.12297  [pdf, other

    stat.ML cs.LG

    Embedded Ensembles: Infinite Width Limit and Operating Regimes

    Authors: Maksim Velikanov, Roman Kail, Ivan Anokhin, Roman Vashurin, Maxim Panov, Alexey Zaytsev, Dmitry Yarotsky

    Abstract: A memory efficient approach to ensembling neural networks is to share most weights among the ensembled models by means of a single reference network. We refer to this strategy as Embedded Ensembling (EE); its particular examples are BatchEnsembles and Monte-Carlo dropout ensembles. In this paper we perform a systematic theoretical and empirical analysis of embedded ensembles with different number… ▽ More

    Submitted 24 February, 2022; originally announced February 2022.

  8. arXiv:2202.00992  [pdf, other

    math.OC cs.LG cs.NE

    Tight Convergence Rate Bounds for Optimization Under Power Law Spectral Conditions

    Authors: Maksim Velikanov, Dmitry Yarotsky

    Abstract: Performance of optimization on quadratic problems sensitively depends on the low-lying part of the spectrum. For large (effectively infinite-dimensional) problems, this part of the spectrum can often be naturally represented or approximated by power law distributions, resulting in power law convergence rates for iterative solutions of these problems by gradient-based algorithms. In this paper, we… ▽ More

    Submitted 25 March, 2024; v1 submitted 2 February, 2022; originally announced February 2022.

  9. arXiv:2105.00507  [pdf, other

    cs.LG cs.NE math.OC stat.ML

    Universal scaling laws in the gradient descent training of neural networks

    Authors: Maksim Velikanov, Dmitry Yarotsky

    Abstract: Current theoretical results on optimization trajectories of neural networks trained by gradient descent typically have the form of rigorous but potentially loose bounds on the loss values. In the present work we take a different approach and show that the learning trajectory can be characterized by an explicit asymptotic at large training times. Specifically, the leading term in the asymptotic exp… ▽ More

    Submitted 2 May, 2021; originally announced May 2021.