Skip to main content

Showing 1–20 of 20 results for author: Belkin, M

Searching in archive math. Search in all archives.
.
  1. arXiv:2411.11242  [pdf, other

    cs.LG math.OC stat.ML

    Mirror Descent on Reproducing Kernel Banach Spaces

    Authors: Akash Kumar, Mikhail Belkin, Parthe Pandit

    Abstract: Recent advances in machine learning have led to increased interest in reproducing kernel Banach spaces (RKBS) as a more general framework that extends beyond reproducing kernel Hilbert spaces (RKHS). These works have resulted in the formulation of representer theorems under several regularized learning schemes. However, little is known about an optimization method that encompasses these results in… ▽ More

    Submitted 17 November, 2024; originally announced November 2024.

    Comments: 42 pages, 3 figures

  2. arXiv:2410.07622  [pdf, other

    math.CO math.RA

    Eigenvectors of the De Bruijn Graph Laplacian: A Natural Basis for the Cut and Cycle Space

    Authors: Anthony Philippakis, Neil Mallinar, Parthe Pandit, Mikhail Belkin

    Abstract: We study the Laplacian of the undirected De Bruijn graph over an alphabet $A$ of order $k$. While the eigenvalues of this Laplacian were found in 1998 by Delorme and Tillich [1], an explicit description of its eigenvectors has remained elusive. In this work, we find these eigenvectors in closed form and show that they yield a natural and canonical basis for the cut- and cycle-spaces of De Bruijn g… ▽ More

    Submitted 10 October, 2024; originally announced October 2024.

  3. arXiv:2306.04815  [pdf, other

    cs.LG math.OC stat.ML

    Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning

    Authors: Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin

    Abstract: In this paper, we first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD). We provide evidence that the spikes in the training loss of SGD are "catapults", an optimization phenomenon originally observed in GD with large learning rates in [Lewkowycz et al. 2020]. We empirically show that thes… ▽ More

    Submitted 5 June, 2024; v1 submitted 7 June, 2023; originally announced June 2023.

    Comments: ICML 2024

  4. arXiv:2306.02601  [pdf, other

    cs.LG math.OC stat.ML

    Aiming towards the minimizers: fast convergence of SGD for overparametrized problems

    Authors: Chaoyue Liu, Dmitriy Drusvyatskiy, Mikhail Belkin, Damek Davis, Yi-An Ma

    Abstract: Modern machine learning paradigms, such as deep learning, occur in or close to the interpolation regime, wherein the number of model parameters is much larger than the number of data samples. In this work, we propose a regularity condition within the interpolation regime which endows the stochastic gradient method with the same worst-case iteration complexity as the deterministic gradient method,… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

  5. arXiv:2209.15106  [pdf, other

    cs.LG math.OC

    Restricted Strong Convexity of Deep Learning Models with Smooth Activations

    Authors: Arindam Banerjee, Pedro Cisneros-Velarde, Libin Zhu, Mikhail Belkin

    Abstract: We consider the problem of optimization of deep learning models with smooth activation functions. While there exist influential results on the problem from the ``near initialization'' perspective, we shed considerable new light on the problem. In particular, we make two key technical contributions for such models with $L$ layers, $m$ width, and $σ_0^2$ initialization variance. First, for suitable… ▽ More

    Submitted 29 September, 2022; originally announced September 2022.

  6. arXiv:2205.11787  [pdf, other

    cs.LG math.OC stat.ML

    Quadratic models for understanding catapult dynamics of neural networks

    Authors: Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin

    Abstract: While neural networks can be approximated by linear models as their width increases, certain properties of wide neural networks cannot be captured by linear models. In this work we show that recently proposed Neural Quadratic Models can exhibit the "catapult phase" [Lewkowycz et al. 2020] that arises when training such models with large learning rates. We then empirically show that the behaviour o… ▽ More

    Submitted 1 May, 2024; v1 submitted 24 May, 2022; originally announced May 2022.

    Comments: accepted in ICLR 2024; changed the title

  7. arXiv:2205.11786  [pdf, other

    cs.LG math.OC stat.ML

    Transition to Linearity of General Neural Networks with Directed Acyclic Graph Architecture

    Authors: Libin Zhu, Chaoyue Liu, Mikhail Belkin

    Abstract: In this paper we show that feedforward neural networks corresponding to arbitrary directed acyclic graphs undergo transition to linearity as their "width" approaches infinity. The width of these general networks is characterized by the minimum in-degree of their neurons, except for the input and first layers. Our results identify the mathematical structure underlying transition to linearity and ge… ▽ More

    Submitted 7 June, 2023; v1 submitted 24 May, 2022; originally announced May 2022.

    Comments: NeurIPS 2022

  8. arXiv:2202.06526  [pdf, other

    cs.LG math.OC stat.ML

    Benign Overfitting in Two-layer Convolutional Neural Networks

    Authors: Yuan Cao, Zixiang Chen, Mikhail Belkin, Quanquan Gu

    Abstract: Modern neural networks often have great expressive power and can be trained to overfit the training data, while still achieving a good test performance. This phenomenon is referred to as "benign overfitting". Recently, there emerges a line of works studying "benign overfitting" from the theoretical perspective. However, they are limited to linear models or kernel/random feature models, and there i… ▽ More

    Submitted 14 June, 2022; v1 submitted 14 February, 2022; originally announced February 2022.

    Comments: 42 pages, 1 figure. Version 3 improves the presentation and adds a comparison with a concurrent work

  9. arXiv:2112.14872  [pdf, other

    math.OC cs.LG

    Local Quadratic Convergence of Stochastic Gradient Descent with Adaptive Step Size

    Authors: Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

    Abstract: Establishing a fast rate of convergence for optimization methods is crucial to their applicability in practice. With the increasing popularity of deep learning over the past decade, stochastic gradient descent and its adaptive variants (e.g. Adagrad, Adam, etc.) have become prominent methods of choice for machine learning practitioners. While a large number of works have demonstrated that these fi… ▽ More

    Submitted 29 December, 2021; originally announced December 2021.

    Comments: ICML 2021 Workshop on Beyond first-order methods in ML systems

  10. arXiv:2105.14368  [pdf, other

    stat.ML cs.LG math.ST

    Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation

    Authors: Mikhail Belkin

    Abstract: In the past decade the mathematical theory of machine learning has lagged far behind the triumphs of deep neural networks on practical challenges. However, the gap between theory and practice is gradually starting to close. In this paper I will attempt to assemble some pieces of the remarkable and still incomplete mathematical mosaic emerging from the efforts to understand the foundations of deep… ▽ More

    Submitted 29 May, 2021; originally announced May 2021.

    Comments: A version of this paper will appear in Acta Numerica

  11. arXiv:2104.13628  [pdf, other

    cs.LG math.ST stat.ML

    Risk Bounds for Over-parameterized Maximum Margin Classification on Sub-Gaussian Mixtures

    Authors: Yuan Cao, Quanquan Gu, Mikhail Belkin

    Abstract: Modern machine learning systems such as deep neural networks are often highly over-parameterized so that they can fit the noisy training data exactly, yet they can still achieve small test errors in practice. In this paper, we study this "benign overfitting" phenomenon of the maximum margin classifier for linear classification problems. Specifically, we consider data generated from sub-Gaussian mi… ▽ More

    Submitted 2 January, 2022; v1 submitted 28 April, 2021; originally announced April 2021.

    Comments: 27 pages, 3 figures. In NeurIPS 2021

  12. arXiv:2008.01036  [pdf, other

    cs.LG math.ST stat.ML

    Multiple Descent: Design Your Own Generalization Curve

    Authors: Lin Chen, Yifei Min, Mikhail Belkin, Amin Karbasi

    Abstract: This paper explores the generalization loss of linear regression in variably parameterized families of models, both under-parameterized and over-parameterized. We show that the generalization curve can have an arbitrary number of peaks, and moreover, locations of those peaks can be explicitly controlled. Our results highlight the fact that both classical U-shaped generalization curve and the recen… ▽ More

    Submitted 8 November, 2021; v1 submitted 3 August, 2020; originally announced August 2020.

    Comments: Accepted to NeurIPS 2021

  13. arXiv:2003.00307  [pdf, other

    cs.LG math.OC stat.ML

    Loss landscapes and optimization in over-parameterized non-linear systems and neural networks

    Authors: Chaoyue Liu, Libin Zhu, Mikhail Belkin

    Abstract: The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. The purpose of this work is to propose a modern view and a general mathematical framework for loss landscapes and efficient optimization in over-parameterized machine learning models and systems of non-linear equations, a setting that incl… ▽ More

    Submitted 26 May, 2021; v1 submitted 29 February, 2020; originally announced March 2020.

    Comments: The discussion on transition to linearity in Version 1 has been moved to arXiv:2010.01092 (appeared in NeurIPS 2020)

  14. arXiv:1811.02564  [pdf, ps, other

    math.OC cs.LG stat.ML

    On exponential convergence of SGD in non-convex over-parametrized learning

    Authors: Raef Bassily, Mikhail Belkin, Siyuan Ma

    Abstract: Large over-parametrized models learned via stochastic gradient descent (SGD) methods have become a key element in modern machine learning. Although SGD methods are very effective in practice, most theoretical analyses of SGD suggest slower convergence than what is empirically observed. In our recent work [8] we analyzed how interpolation, common in modern over-parametrized learning, results in exp… ▽ More

    Submitted 5 November, 2018; originally announced November 2018.

  15. arXiv:1806.09471  [pdf, other

    stat.ML cs.LG math.ST

    Does data interpolation contradict statistical optimality?

    Authors: Mikhail Belkin, Alexander Rakhlin, Alexandre B. Tsybakov

    Abstract: We show that learning methods interpolating the training data can achieve optimal rates for the problems of nonparametric regression and prediction with square loss.

    Submitted 25 June, 2018; originally announced June 2018.

  16. arXiv:1802.10235  [pdf, other

    cs.LG math.OC

    Parametrized Accelerated Methods Free of Condition Number

    Authors: Chaoyue Liu, Mikhail Belkin

    Abstract: Analyses of accelerated (momentum-based) gradient descent usually assume bounded condition number to obtain exponential convergence rates. However, in many real problems, e.g., kernel methods or deep neural networks, the condition number, even locally, can be unbounded, unknown or mis-estimated. This poses problems in both implementing and analyzing accelerated algorithms. In this paper, we addres… ▽ More

    Submitted 27 February, 2018; originally announced February 2018.

    Comments: 23 pages, 3 figures

  17. arXiv:1607.01718  [pdf, other

    stat.ML cs.DS math.ST

    Graphons, mergeons, and so on!

    Authors: Justin Eldridge, Mikhail Belkin, Yusu Wang

    Abstract: In this work we develop a theory of hierarchical clustering for graphs. Our modeling assumption is that graphs are sampled from a graphon, which is a powerful and general model for generating graphs and analyzing large networks. Graphons are a far richer class of graph models than stochastic blockmodels, the primary setting for recent progress in the statistical theory of graph clustering. We defi… ▽ More

    Submitted 22 May, 2017; v1 submitted 6 July, 2016; originally announced July 2016.

  18. arXiv:1506.06422  [pdf, other

    stat.ML math.ST

    Beyond Hartigan Consistency: Merge Distortion Metric for Hierarchical Clustering

    Authors: Justin Eldridge, Mikhail Belkin, Yusu Wang

    Abstract: Hierarchical clustering is a popular method for analyzing data which associates a tree to a dataset. Hartigan consistency has been used extensively as a framework to analyze such clustering algorithms from a statistical point of view. Still, as we show in the paper, a tree which is Hartigan consistent with a given density can look very different than the correct limit tree. Specifically, Hartigan… ▽ More

    Submitted 13 July, 2015; v1 submitted 21 June, 2015; originally announced June 2015.

  19. arXiv:1105.3931  [pdf, ps, other

    cs.LG math.NA stat.ML

    Behavior of Graph Laplacians on Manifolds with Boundary

    Authors: Xueyuan Zhou, Mikhail Belkin

    Abstract: In manifold learning, algorithms based on graph Laplacians constructed from data have received considerable attention both in practical applications and theoretical analysis. In particular, the convergence of graph Laplacians obtained from sampled data to certain continuous operators has become an active research topic recently. Most of the existing work has been done under the assumption that the… ▽ More

    Submitted 19 May, 2011; originally announced May 2011.

  20. Consistency of spectral clustering

    Authors: Ulrike von Luxburg, Mikhail Belkin, Olivier Bousquet

    Abstract: Consistency is a key property of all statistical procedures analyzing randomly sampled data. Surprisingly, despite decades of work, little is known about consistency of most clustering algorithms. In this paper we investigate consistency of the popular family of spectral clustering algorithms, which clusters the data with the help of eigenvectors of graph Laplacian matrices. We develop new metho… ▽ More

    Submitted 4 April, 2008; originally announced April 2008.

    Comments: Published in at http://dx.doi.org/10.1214/009053607000000640 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

    Report number: IMS-AOS-AOS0287 MSC Class: 62G20 (Primary) 05C50 (Secondary)

    Journal ref: Annals of Statistics 2008, Vol. 36, No. 2, 555-586