-
Dynamical mean-field analysis of adaptive Langevin diffusions: Replica-symmetric fixed point and empirical Bayes
Authors:
Zhou Fan,
Justin Ko,
Bruno Loureiro,
Yue M. Lu,
Yandi Shen
Abstract:
In many applications of statistical estimation via sampling, one may wish to sample from a high-dimensional target distribution that is adaptively evolving to the samples already seen. We study an example of such dynamics, given by a Langevin diffusion for posterior sampling in a Bayesian linear regression model with i.i.d. regression design, whose prior continuously adapts to the Langevin traject…
▽ More
In many applications of statistical estimation via sampling, one may wish to sample from a high-dimensional target distribution that is adaptively evolving to the samples already seen. We study an example of such dynamics, given by a Langevin diffusion for posterior sampling in a Bayesian linear regression model with i.i.d. regression design, whose prior continuously adapts to the Langevin trajectory via a maximum marginal-likelihood scheme. Results of dynamical mean-field theory (DMFT) developed in our companion paper establish a precise high-dimensional asymptotic limit for the joint evolution of the prior parameter and law of the Langevin sample. In this work, we carry out an analysis of the equations that describe this DMFT limit, under conditions of approximate time-translation-invariance which include, in particular, settings where the posterior law satisfies a log-Sobolev inequality. In such settings, we show that this adaptive Langevin trajectory converges on a dimension-independent time horizon to an equilibrium state that is characterized by a system of scalar fixed-point equations, and the associated prior parameter converges to a critical point of a replica-symmetric limit for the model free energy. As a by-product of our analyses, we obtain a new dynamical proof that this replica-symmetric limit for the free energy is exact, in models having a possibly misspecified prior and where a log-Sobolev inequality holds for the posterior law.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Dynamical mean-field analysis of adaptive Langevin diffusions: Propagation-of-chaos and convergence of the linear response
Authors:
Zhou Fan,
Justin Ko,
Bruno Loureiro,
Yue M. Lu,
Yandi Shen
Abstract:
Motivated by an application to empirical Bayes learning in high-dimensional regression, we study a class of Langevin diffusions in a system with random disorder, where the drift coefficient is driven by a parameter that continuously adapts to the empirical distribution of the realized process up to the current time. The resulting dynamics take the form of a stochastic interacting particle system h…
▽ More
Motivated by an application to empirical Bayes learning in high-dimensional regression, we study a class of Langevin diffusions in a system with random disorder, where the drift coefficient is driven by a parameter that continuously adapts to the empirical distribution of the realized process up to the current time. The resulting dynamics take the form of a stochastic interacting particle system having both a McKean-Vlasov type interaction and a pairwise interaction defined by the random disorder. We prove a propagation-of-chaos result, showing that in the large system limit over dimension-independent time horizons, the empirical distribution of sample paths of the Langevin process converges to a deterministic limit law that is described by dynamical mean-field theory. This law is characterized by a system of dynamical fixed-point equations for the limit of the drift parameter and for the correlation and response kernels of the limiting dynamics. Using a dynamical cavity argument, we verify that these correlation and response kernels arise as the asymptotic limits of the averaged correlation and linear response functions of single coordinates of the system. These results enable an asymptotic analysis of an empirical Bayes Langevin dynamics procedure for learning an unknown prior parameter in a linear regression model, which we develop in a companion paper.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
A Random Matrix Theory Perspective on the Spectrum of Learned Features and Asymptotic Generalization Capabilities
Authors:
Yatin Dandi,
Luca Pesce,
Hugo Cui,
Florent Krzakala,
Yue M. Lu,
Bruno Loureiro
Abstract:
A key property of neural networks is their capacity of adapting to data during training. Yet, our current mathematical understanding of feature learning and its relationship to generalization remain limited. In this work, we provide a random matrix analysis of how fully-connected two-layer neural networks adapt to the target function after a single, but aggressive, gradient descent step. We rigoro…
▽ More
A key property of neural networks is their capacity of adapting to data during training. Yet, our current mathematical understanding of feature learning and its relationship to generalization remain limited. In this work, we provide a random matrix analysis of how fully-connected two-layer neural networks adapt to the target function after a single, but aggressive, gradient descent step. We rigorously establish the equivalence between the updated features and an isotropic spiked random feature model, in the limit of large batch size. For the latter model, we derive a deterministic equivalent description of the feature empirical covariance matrix in terms of certain low-dimensional operators. This allows us to sharply characterize the impact of training in the asymptotic feature spectrum, and in particular, provides a theoretical grounding for how the tails of the feature spectrum modify with training. The deterministic equivalent further yields the exact asymptotic generalization error, shedding light on the mechanisms behind its improvement in the presence of feature learning. Our result goes beyond standard random matrix ensembles, and therefore we believe it is of independent technical interest. Different from previous work, our result holds in the challenging maximal learning rate regime, is fully rigorous and allows for finitely supported second layer initialization, which turns out to be crucial for studying the functional expressivity of the learned features. This provides a sharp description of the impact of feature learning in the generalization of two-layer neural networks, beyond the random features and lazy training regimes.
△ Less
Submitted 24 October, 2024;
originally announced October 2024.
-
Asymptotics of Random Feature Regression Beyond the Linear Scaling Regime
Authors:
Hong Hu,
Yue M. Lu,
Theodor Misiakiewicz
Abstract:
Recent advances in machine learning have been achieved by using overparametrized models trained until near interpolation of the training data. It was shown, e.g., through the double descent phenomenon, that the number of parameters is a poor proxy for the model complexity and generalization capabilities. This leaves open the question of understanding the impact of parametrization on the performanc…
▽ More
Recent advances in machine learning have been achieved by using overparametrized models trained until near interpolation of the training data. It was shown, e.g., through the double descent phenomenon, that the number of parameters is a poor proxy for the model complexity and generalization capabilities. This leaves open the question of understanding the impact of parametrization on the performance of these models. How does model complexity and generalization depend on the number of parameters $p$? How should we choose $p$ relative to the sample size $n$ to achieve optimal test error?
In this paper, we investigate the example of random feature ridge regression (RFRR). This model can be seen either as a finite-rank approximation to kernel ridge regression (KRR), or as a simplified model for neural networks trained in the so-called lazy regime. We consider covariates uniformly distributed on the $d$-dimensional sphere and compute sharp asymptotics for the RFRR test error in the high-dimensional polynomial scaling, where $p,n,d \to \infty$ while $p/ d^{κ_1}$ and $n / d^{κ_2}$ stay constant, for all $κ_1 , κ_2 \in \mathbb{R}_{>0}$. These asymptotics precisely characterize the impact of the number of random features and regularization parameter on the test performance. In particular, RFRR exhibits an intuitive trade-off between approximation and generalization power. For $n = o(p)$, the sample size $n$ is the bottleneck and RFRR achieves the same performance as KRR (which is equivalent to taking $p = \infty$). On the other hand, if $p = o(n)$, the number of random features $p$ is the limiting factor and RFRR test error matches the approximation error of the random feature model class (akin to taking $n = \infty$). Finally, a double descent appears at $n= p$, a phenomenon that was previously only characterized in the linear scaling $κ_1 = κ_2 = 1$.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
Universality for the global spectrum of random inner-product kernel matrices in the polynomial regime
Authors:
Sofiia Dubova,
Yue M. Lu,
Benjamin McKenna,
Horng-Tzer Yau
Abstract:
We consider certain large random matrices, called random inner-product kernel matrices, which are essentially given by a nonlinear function $f$ applied entrywise to a sample-covariance matrix, $f(X^TX)$, where $X \in \mathbb{R}^{d \times N}$ is random and normalized in such a way that $f$ typically has order-one arguments. We work in the polynomial regime, where $N \asymp d^\ell$ for some…
▽ More
We consider certain large random matrices, called random inner-product kernel matrices, which are essentially given by a nonlinear function $f$ applied entrywise to a sample-covariance matrix, $f(X^TX)$, where $X \in \mathbb{R}^{d \times N}$ is random and normalized in such a way that $f$ typically has order-one arguments. We work in the polynomial regime, where $N \asymp d^\ell$ for some $\ell > 0$, not just the linear regime where $\ell = 1$. Earlier work by various authors showed that, when the columns of $X$ are either uniform on the sphere or standard Gaussian vectors, and when $\ell$ is an integer (the linear regime $\ell = 1$ is particularly well-studied), the bulk eigenvalues of such matrices behave in a simple way: They are asymptotically given by the free convolution of the semicircular and Marčenko-Pastur distributions, with relative weights given by expanding $f$ in the Hermite basis. In this paper, we show that this phenomenon is universal, holding as soon as $X$ has i.i.d. entries with all finite moments. In the case of non-integer $\ell$, the Marčenko-Pastur term disappears (its weight in the free convolution vanishes), and the spectrum is just semicircular.
△ Less
Submitted 27 October, 2023;
originally announced October 2023.
-
Spectral Universality of Regularized Linear Regression with Nearly Deterministic Sensing Matrices
Authors:
Rishabh Dudeja,
Subhabrata Sen,
Yue M. Lu
Abstract:
It has been observed that the performances of many high-dimensional estimation problems are universal with respect to underlying sensing (or design) matrices. Specifically, matrices with markedly different constructions seem to achieve identical performance if they share the same spectral distribution and have ``generic'' singular vectors. We prove this universality phenomenon for the case of conv…
▽ More
It has been observed that the performances of many high-dimensional estimation problems are universal with respect to underlying sensing (or design) matrices. Specifically, matrices with markedly different constructions seem to achieve identical performance if they share the same spectral distribution and have ``generic'' singular vectors. We prove this universality phenomenon for the case of convex regularized least squares (RLS) estimators under a linear regression model with additive Gaussian noise. Our main contributions are two-fold: (1) We introduce a notion of universality classes for sensing matrices, defined through a set of deterministic conditions that fix the spectrum of the sensing matrix and precisely capture the previously heuristic notion of generic singular vectors; (2) We show that for all sensing matrices that lie in the same universality class, the dynamics of the proximal gradient descent algorithm for solving the regression problem, as well as the performance of RLS estimators themselves (under additional strong convexity conditions) are asymptotically identical. In addition to including i.i.d. Gaussian and rotational invariant matrices as special cases, our universality class also contains highly structured, strongly correlated, or even (nearly) deterministic matrices. Examples of the latter include randomly signed versions of incoherent tight frames and randomly subsampled Hadamard transforms. As a consequence of this universality principle, the asymptotic performance of regularized linear regression on many structured matrices constructed with limited randomness can be characterized by using the rotationally invariant ensemble as an equivalent yet mathematically more tractable surrogate.
△ Less
Submitted 20 July, 2023; v1 submitted 4 August, 2022;
originally announced August 2022.
-
An Equivalence Principle for the Spectrum of Random Inner-Product Kernel Matrices with Polynomial Scalings
Authors:
Yue M. Lu,
Horng-Tzer Yau
Abstract:
We investigate random matrices whose entries are obtained by applying a nonlinear kernel function to pairwise inner products between $n$ independent data vectors, drawn uniformly from the unit sphere in $\mathbb{R}^d$. This study is motivated by applications in machine learning and statistics, where these kernel random matrices and their spectral properties play significant roles. We establish the…
▽ More
We investigate random matrices whose entries are obtained by applying a nonlinear kernel function to pairwise inner products between $n$ independent data vectors, drawn uniformly from the unit sphere in $\mathbb{R}^d$. This study is motivated by applications in machine learning and statistics, where these kernel random matrices and their spectral properties play significant roles. We establish the weak limit of the empirical spectral distribution of these matrices in a polynomial scaling regime, where $d, n \to \infty$ such that $n / d^\ell \to κ$, for some fixed $\ell \in \mathbb{N}$ and $κ\in (0, \infty)$. Our findings generalize an earlier result by Cheng and Singer, who examined the same model in the linear scaling regime (with $\ell = 1$).
Our work reveals an equivalence principle: the spectrum of the random kernel matrix is asymptotically equivalent to that of a simpler matrix model, constructed as a linear combination of a (shifted) Wishart matrix and an independent matrix sampled from the Gaussian orthogonal ensemble. The aspect ratio of the Wishart matrix and the coefficients of the linear combination are determined by $\ell$ and the expansion of the kernel function in the orthogonal Hermite polynomial basis. Consequently, the limiting spectrum of the random kernel matrix can be characterized as the free additive convolution between a Marchenko-Pastur law and a semicircle law. We also extend our results to cases with data vectors sampled from isotropic Gaussian distributions instead of spherical distributions.
△ Less
Submitted 5 May, 2023; v1 submitted 12 May, 2022;
originally announced May 2022.
-
Universality of Approximate Message Passing with Semi-Random Matrices
Authors:
Rishabh Dudeja,
Yue M. Lu,
Subhabrata Sen
Abstract:
Approximate Message Passing (AMP) is a class of iterative algorithms that have found applications in many problems in high-dimensional statistics and machine learning. In its general form, AMP can be formulated as an iterative procedure driven by a matrix $\mathbf{M}$. Theoretical analyses of AMP typically assume strong distributional properties on $\mathbf{M}$ such as $\mathbf{M}$ has i.i.d. sub-…
▽ More
Approximate Message Passing (AMP) is a class of iterative algorithms that have found applications in many problems in high-dimensional statistics and machine learning. In its general form, AMP can be formulated as an iterative procedure driven by a matrix $\mathbf{M}$. Theoretical analyses of AMP typically assume strong distributional properties on $\mathbf{M}$ such as $\mathbf{M}$ has i.i.d. sub-Gaussian entries or is drawn from a rotational invariant ensemble. However, numerical experiments suggest that the behavior of AMP is universal, as long as the eigenvectors of $\mathbf{M}$ are generic. In this paper, we take the first step in rigorously understanding this universality phenomenon. In particular, we investigate a class of memory-free AMP algorithms (proposed by Çakmak and Opper for mean-field Ising spin glasses), and show that their asymptotic dynamics is universal on a broad class of semi-random matrices. In addition to having the standard rotational invariant ensemble as a special case, the class of semi-random matrices that we define in this work also includes matrices constructed with very limited randomness. One such example is a randomly signed version of the Sine model, introduced by Marinari, Parisi, Potters, and Ritort for spin glasses with fully deterministic couplings.
△ Less
Submitted 1 May, 2023; v1 submitted 8 April, 2022;
originally announced April 2022.
-
Generalization error in high-dimensional perceptrons: Approaching Bayes error with convex optimization
Authors:
Benjamin Aubin,
Florent Krzakala,
Yue M. Lu,
Lenka Zdeborová
Abstract:
We consider a commonly studied supervised classification of a synthetic dataset whose labels are generated by feeding a one-layer neural network with random iid inputs. We study the generalization performances of standard classifiers in the high-dimensional regime where $α=n/d$ is kept finite in the limit of a high dimension $d$ and number of samples $n$. Our contribution is three-fold: First, we…
▽ More
We consider a commonly studied supervised classification of a synthetic dataset whose labels are generated by feeding a one-layer neural network with random iid inputs. We study the generalization performances of standard classifiers in the high-dimensional regime where $α=n/d$ is kept finite in the limit of a high dimension $d$ and number of samples $n$. Our contribution is three-fold: First, we prove a formula for the generalization error achieved by $\ell_2$ regularized classifiers that minimize a convex loss. This formula was first obtained by the heuristic replica method of statistical physics. Secondly, focussing on commonly used loss functions and optimizing the $\ell_2$ regularization strength, we observe that while ridge regression performance is poor, logistic and hinge regression are surprisingly able to approach the Bayes-optimal generalization error extremely closely. As $α\to \infty$ they lead to Bayes-optimal rates, a fact that does not follow from predictions of margin-based generalization error bounds. Third, we design an optimal loss and regularizer that provably leads to Bayes-optimal generalization error.
△ Less
Submitted 7 November, 2020; v1 submitted 11 June, 2020;
originally announced June 2020.
-
The role of regularization in classification of high-dimensional noisy Gaussian mixture
Authors:
Francesca Mignacco,
Florent Krzakala,
Yue M. Lu,
Lenka Zdeborová
Abstract:
We consider a high-dimensional mixture of two Gaussians in the noisy regime where even an oracle knowing the centers of the clusters misclassifies a small but finite fraction of the points. We provide a rigorous analysis of the generalization error of regularized convex classifiers, including ridge, hinge and logistic regression, in the high-dimensional limit where the number $n$ of samples and th…
▽ More
We consider a high-dimensional mixture of two Gaussians in the noisy regime where even an oracle knowing the centers of the clusters misclassifies a small but finite fraction of the points. We provide a rigorous analysis of the generalization error of regularized convex classifiers, including ridge, hinge and logistic regression, in the high-dimensional limit where the number $n$ of samples and their dimension $d$ go to infinity while their ratio is fixed to $α= n/d$. We discuss surprising effects of the regularization that in some cases allows to reach the Bayes-optimal performances. We also illustrate the interpolation peak at low regularization, and analyze the role of the respective sizes of the two clusters.
△ Less
Submitted 26 February, 2020;
originally announced February 2020.
-
SLOPE for Sparse Linear Regression:Asymptotics and Optimal Regularization
Authors:
Hong Hu,
Yue M. Lu
Abstract:
In sparse linear regression, the SLOPE estimator generalizes LASSO by penalizing different coordinates of the estimate according to their magnitudes. In this paper, we present a precise performance characterization of SLOPE in the asymptotic regime where the number of unknown parameters grows in proportion to the number of observations. Our asymptotic characterization enables us to derive the fund…
▽ More
In sparse linear regression, the SLOPE estimator generalizes LASSO by penalizing different coordinates of the estimate according to their magnitudes. In this paper, we present a precise performance characterization of SLOPE in the asymptotic regime where the number of unknown parameters grows in proportion to the number of observations. Our asymptotic characterization enables us to derive the fundamental limits of SLOPE in both estimation and variable selection settings. We also provide a computational feasible way to optimally design the regularizing sequences such that the fundamental limits are reached. In both settings, we show that the optimal design problem can be formulated as certain infinite-dimensional convex optimization problems, which have efficient and accurate finite-dimensional approximations. Numerical simulations verify all our asymptotic predictions. They demonstrate the superiority of our optimal regularizing sequences over other designs used in the existing literature.
△ Less
Submitted 4 June, 2021; v1 submitted 27 March, 2019;
originally announced March 2019.
-
Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview
Authors:
Yuejie Chi,
Yue M. Lu,
Yuxin Chen
Abstract:
Substantial progress has been made recently on developing provably accurate and efficient algorithms for low-rank matrix factorization via nonconvex optimization. While conventional wisdom often takes a dim view of nonconvex optimization algorithms due to their susceptibility to spurious local minima, simple iterative methods such as gradient descent have been remarkably successful in practice. Th…
▽ More
Substantial progress has been made recently on developing provably accurate and efficient algorithms for low-rank matrix factorization via nonconvex optimization. While conventional wisdom often takes a dim view of nonconvex optimization algorithms due to their susceptibility to spurious local minima, simple iterative methods such as gradient descent have been remarkably successful in practice. The theoretical footings, however, had been largely lacking until recently.
In this tutorial-style overview, we highlight the important role of statistical models in enabling efficient nonconvex optimization with performance guarantees. We review two contrasting approaches: (1) two-stage algorithms, which consist of a tailored initialization step followed by successive refinement; and (2) global landscape analysis and initialization-free algorithms. Several canonical matrix factorization problems are discussed, including but not limited to matrix sensing, phase retrieval, matrix completion, blind deconvolution, robust principal component analysis, phase synchronization, and joint alignment. Special care is taken to illustrate the key technical insights underlying their analyses. This article serves as a testament that the integrated consideration of optimization and statistics leads to fruitful research findings.
△ Less
Submitted 19 September, 2019; v1 submitted 25 September, 2018;
originally announced September 2018.
-
Scaling Limit: Exact and Tractable Analysis of Online Learning Algorithms with Applications to Regularized Regression and PCA
Authors:
Chuang Wang,
Jonathan Mattingly,
Yue M. Lu
Abstract:
We present a framework for analyzing the exact dynamics of a class of online learning algorithms in the high-dimensional scaling limit. Our results are applied to two concrete examples: online regularized linear regression and principal component analysis. As the ambient dimension tends to infinity, and with proper time scaling, we show that the time-varying joint empirical measures of the target…
▽ More
We present a framework for analyzing the exact dynamics of a class of online learning algorithms in the high-dimensional scaling limit. Our results are applied to two concrete examples: online regularized linear regression and principal component analysis. As the ambient dimension tends to infinity, and with proper time scaling, we show that the time-varying joint empirical measures of the target feature vector and its estimates provided by the algorithms will converge weakly to a deterministic measured-valued process that can be characterized as the unique solution of a nonlinear PDE. Numerical solutions of this PDE can be efficiently obtained. These solutions lead to precise predictions of the performance of the algorithms, as many practical performance metrics are linear functionals of the joint empirical measures. In addition to characterizing the dynamic performance of online learning algorithms, our asymptotic analysis also provides useful insights. In particular, in the high-dimensional limit, and due to exchangeability, the original coupled dynamics associated with the algorithms will be asymptotically "decoupled", with each coordinate independently solving a 1-D effective minimization problem via stochastic gradient descent. Exploiting this insight for nonconvex optimization problems may prove an interesting line of future research.
△ Less
Submitted 7 December, 2017;
originally announced December 2017.
-
Randomized Kaczmarz Algorithm for Inconsistent Linear Systems: An Exact MSE Analysis
Authors:
Chuang Wang,
Ameya Agaskar,
Yue M. Lu
Abstract:
We provide a complete characterization of the randomized Kaczmarz algorithm (RKA) for inconsistent linear systems. The Kaczmarz algorithm, known in some fields as the algebraic reconstruction technique, is a classical method for solving large-scale overdetermined linear systems through a sequence of projection operators; the randomized Kaczmarz algorithm is a recent proposal by Strohmer and Vershy…
▽ More
We provide a complete characterization of the randomized Kaczmarz algorithm (RKA) for inconsistent linear systems. The Kaczmarz algorithm, known in some fields as the algebraic reconstruction technique, is a classical method for solving large-scale overdetermined linear systems through a sequence of projection operators; the randomized Kaczmarz algorithm is a recent proposal by Strohmer and Vershynin to randomize the sequence of projections in order to guarantee exponential convergence (in mean square) to the solutions. A flurry of work followed this development, with renewed interest in the algorithm, its extensions, and various bounds on their performance. Earlier, we studied the special case of consistent linear systems and provided an exact formula for the mean squared error (MSE) in the value reconstructed by RKA, as well as a simple way to compute the exact decay rate of the error. In this work, we consider the case of inconsistent linear systems, which is a more relevant scenario for most applications. First, by using a "lifting trick", we derive an exact formula for the MSE given a fixed noise vector added to the measurements. Then we show how to average over the noise when it is drawn from a distribution with known first and second-order statistics. Finally, we demonstrate the accuracy of our exact MSE formulas through numerical simulations, which also illustrate that previous upper bounds in the literature may be several orders of magnitude too high.
△ Less
Submitted 31 January, 2015;
originally announced February 2015.