-
A precise asymptotic analysis of learning diffusion models: theory and insights
Authors:
Hugo Cui,
Cengiz Pehlevan,
Yue M. Lu
Abstract:
In this manuscript, we consider the problem of learning a flow or diffusion-based generative model parametrized by a two-layer auto-encoder, trained with online stochastic gradient descent, on a high-dimensional target density with an underlying low-dimensional manifold structure. We derive a tight asymptotic characterization of low-dimensional projections of the distribution of samples generated…
▽ More
In this manuscript, we consider the problem of learning a flow or diffusion-based generative model parametrized by a two-layer auto-encoder, trained with online stochastic gradient descent, on a high-dimensional target density with an underlying low-dimensional manifold structure. We derive a tight asymptotic characterization of low-dimensional projections of the distribution of samples generated by the learned model, ascertaining in particular its dependence on the number of training samples. Building on this analysis, we discuss how mode collapse can arise, and lead to model collapse when the generative model is re-trained on generated synthetic data.
△ Less
Submitted 7 January, 2025;
originally announced January 2025.
-
Asymptotic theory of in-context learning by linear attention
Authors:
Yue M. Lu,
Mary I. Letey,
Jacob A. Zavatone-Veth,
Anindita Maiti,
Cengiz Pehlevan
Abstract:
Transformers have a remarkable ability to learn and execute tasks based on examples provided within the input itself, without explicit prior training. It has been argued that this capability, known as in-context learning (ICL), is a cornerstone of Transformers' success, yet questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unr…
▽ More
Transformers have a remarkable ability to learn and execute tasks based on examples provided within the input itself, without explicit prior training. It has been argued that this capability, known as in-context learning (ICL), is a cornerstone of Transformers' success, yet questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unresolved. Here, we provide a precise answer to these questions in an exactly solvable model of ICL of a linear regression task by linear attention. We derive sharp asymptotics for the learning curve in a phenomenologically-rich scaling regime where the token dimension is taken to infinity; the context length and pretraining task diversity scale proportionally with the token dimension; and the number of pretraining examples scales quadratically. We demonstrate a double-descent learning curve with increasing pretraining examples, and uncover a phase transition in the model's behavior between low and high task diversity regimes: In the low diversity regime, the model tends toward memorization of training tasks, whereas in the high diversity regime, it achieves genuine in-context learning and generalization beyond the scope of pretrained tasks. These theoretical insights are empirically validated through experiments with both linear attention and full nonlinear Transformer architectures.
△ Less
Submitted 4 February, 2025; v1 submitted 19 May, 2024;
originally announced May 2024.
-
Asymptotics of feature learning in two-layer networks after one gradient-step
Authors:
Hugo Cui,
Luca Pesce,
Yatin Dandi,
Florent Krzakala,
Yue M. Lu,
Lenka Zdeborová,
Bruno Loureiro
Abstract:
In this manuscript, we investigate the problem of how two-layer neural networks learn features from data, and improve over the kernel regime, after being trained with a single gradient descent step. Leveraging the insight from (Ba et al., 2022), we model the trained network by a spiked Random Features (sRF) model. Further building on recent progress on Gaussian universality (Dandi et al., 2023), w…
▽ More
In this manuscript, we investigate the problem of how two-layer neural networks learn features from data, and improve over the kernel regime, after being trained with a single gradient descent step. Leveraging the insight from (Ba et al., 2022), we model the trained network by a spiked Random Features (sRF) model. Further building on recent progress on Gaussian universality (Dandi et al., 2023), we provide an exact asymptotic description of the generalization error of the sRF in the high-dimensional limit where the number of samples, the width, and the input dimension grow at a proportional rate. The resulting characterization for sRFs also captures closely the learning curves of the original network model. This enables us to understand how adapting to the data is crucial for the network to efficiently learn non-linear functions in the direction of the gradient -- where at initialization it can only express linear functions in this regime.
△ Less
Submitted 4 June, 2024; v1 submitted 7 February, 2024;
originally announced February 2024.
-
Ring-Exchange Interaction Effects on Magnons in Dirac Magnet CoTiO$_3$
Authors:
Yufei Li,
Thuc T. Mai,
M. Karaki,
E. V. Jasper,
K. F. Garrity,
C. Lyon,
D. Shaw,
T. DeLazzer,
A. J. Biacchi,
R. L. Dally,
D. M. Heligman,
J. Gdanski,
T. Adel,
M. F. Muñoz,
A. Giovannone,
A. Pawbake,
C. Faugeras,
J. R. Simpson,
K. Ross,
N. Trivedi,
Y. M. Lu,
A. R. Hight Walker,
R. Valdés Aguilar
Abstract:
The magnetic interactions that determine magnetic order and magnon energies typically involve only two spins. While rare, multi-spin interactions can also appear in quantum magnets and be the driving force in the ground state selection and in the nature of its excitations. By performing time-domain terahertz and magneto-Raman spectroscopy measurements combined with theoretical modeling, we determi…
▽ More
The magnetic interactions that determine magnetic order and magnon energies typically involve only two spins. While rare, multi-spin interactions can also appear in quantum magnets and be the driving force in the ground state selection and in the nature of its excitations. By performing time-domain terahertz and magneto-Raman spectroscopy measurements combined with theoretical modeling, we determine the origin of the magnon excitation gap in Dirac antiferromagnet CoTiO$_3$. By adding a ring-exchange interaction in a hexagonal plaquette of the honeycomb lattice to both an XXZ spin model and to a low energy spin-orbital flavor wave model, a gap is generated in the magnon spectrum at the Brillouin zone center. With this addition, the flavor wave model reproduces a large swath of experimental results including terahertz, Raman, inelastic neutron scattering, and magnetization experiments.
△ Less
Submitted 4 June, 2024; v1 submitted 10 December, 2022;
originally announced December 2022.
-
Construction of optimal spectral methods in phase retrieval
Authors:
Antoine Maillard,
Florent Krzakala,
Yue M. Lu,
Lenka Zdeborová
Abstract:
We consider the phase retrieval problem, in which the observer wishes to recover a $n$-dimensional real or complex signal $\mathbf{X}^\star$ from the (possibly noisy) observation of $|\mathbfΦ \mathbf{X}^\star|$, in which $\mathbfΦ$ is a matrix of size $m \times n$. We consider a \emph{high-dimensional} setting where $n,m \to \infty$ with $m/n = \mathcal{O}(1)$, and a large class of (possibly corr…
▽ More
We consider the phase retrieval problem, in which the observer wishes to recover a $n$-dimensional real or complex signal $\mathbf{X}^\star$ from the (possibly noisy) observation of $|\mathbfΦ \mathbf{X}^\star|$, in which $\mathbfΦ$ is a matrix of size $m \times n$. We consider a \emph{high-dimensional} setting where $n,m \to \infty$ with $m/n = \mathcal{O}(1)$, and a large class of (possibly correlated) random matrices $\mathbfΦ$ and observation channels. Spectral methods are a powerful tool to obtain approximate observations of the signal $\mathbf{X}^\star$ which can be then used as initialization for a subsequent algorithm, at a low computational cost. In this paper, we extend and unify previous results and approaches on spectral methods for the phase retrieval problem. More precisely, we combine the linearization of message-passing algorithms and the analysis of the \emph{Bethe Hessian}, a classical tool of statistical physics. Using this toolbox, we show how to derive optimal spectral methods for arbitrary channel noise and right-unitarily invariant matrix $\mathbfΦ$, in an automated manner (i.e. with no optimization over any hyperparameter or preprocessing function).
△ Less
Submitted 14 October, 2021; v1 submitted 8 December, 2020;
originally announced December 2020.
-
Generalization error in high-dimensional perceptrons: Approaching Bayes error with convex optimization
Authors:
Benjamin Aubin,
Florent Krzakala,
Yue M. Lu,
Lenka Zdeborová
Abstract:
We consider a commonly studied supervised classification of a synthetic dataset whose labels are generated by feeding a one-layer neural network with random iid inputs. We study the generalization performances of standard classifiers in the high-dimensional regime where $α=n/d$ is kept finite in the limit of a high dimension $d$ and number of samples $n$. Our contribution is three-fold: First, we…
▽ More
We consider a commonly studied supervised classification of a synthetic dataset whose labels are generated by feeding a one-layer neural network with random iid inputs. We study the generalization performances of standard classifiers in the high-dimensional regime where $α=n/d$ is kept finite in the limit of a high dimension $d$ and number of samples $n$. Our contribution is three-fold: First, we prove a formula for the generalization error achieved by $\ell_2$ regularized classifiers that minimize a convex loss. This formula was first obtained by the heuristic replica method of statistical physics. Secondly, focussing on commonly used loss functions and optimizing the $\ell_2$ regularization strength, we observe that while ridge regression performance is poor, logistic and hinge regression are surprisingly able to approach the Bayes-optimal generalization error extremely closely. As $α\to \infty$ they lead to Bayes-optimal rates, a fact that does not follow from predictions of margin-based generalization error bounds. Third, we design an optimal loss and regularizer that provably leads to Bayes-optimal generalization error.
△ Less
Submitted 7 November, 2020; v1 submitted 11 June, 2020;
originally announced June 2020.
-
The role of regularization in classification of high-dimensional noisy Gaussian mixture
Authors:
Francesca Mignacco,
Florent Krzakala,
Yue M. Lu,
Lenka Zdeborová
Abstract:
We consider a high-dimensional mixture of two Gaussians in the noisy regime where even an oracle knowing the centers of the clusters misclassifies a small but finite fraction of the points. We provide a rigorous analysis of the generalization error of regularized convex classifiers, including ridge, hinge and logistic regression, in the high-dimensional limit where the number $n$ of samples and th…
▽ More
We consider a high-dimensional mixture of two Gaussians in the noisy regime where even an oracle knowing the centers of the clusters misclassifies a small but finite fraction of the points. We provide a rigorous analysis of the generalization error of regularized convex classifiers, including ridge, hinge and logistic regression, in the high-dimensional limit where the number $n$ of samples and their dimension $d$ go to infinity while their ratio is fixed to $α= n/d$. We discuss surprising effects of the regularization that in some cases allows to reach the Bayes-optimal performances. We also illustrate the interpolation peak at low regularization, and analyze the role of the respective sizes of the two clusters.
△ Less
Submitted 26 February, 2020;
originally announced February 2020.
-
Generalized Approximate Survey Propagation for High-Dimensional Estimation
Authors:
Luca Saglietti,
Yue M. Lu,
Carlo Lucibello
Abstract:
In Generalized Linear Estimation (GLE) problems, we seek to estimate a signal that is observed through a linear transform followed by a component-wise, possibly nonlinear and noisy, channel. In the Bayesian optimal setting, Generalized Approximate Message Passing (GAMP) is known to achieve optimal performance for GLE. However, its performance can significantly degrade whenever there is a mismatch…
▽ More
In Generalized Linear Estimation (GLE) problems, we seek to estimate a signal that is observed through a linear transform followed by a component-wise, possibly nonlinear and noisy, channel. In the Bayesian optimal setting, Generalized Approximate Message Passing (GAMP) is known to achieve optimal performance for GLE. However, its performance can significantly degrade whenever there is a mismatch between the assumed and the true generative model, a situation frequently encountered in practice. In this paper, we propose a new algorithm, named Generalized Approximate Survey Propagation (GASP), for solving GLE in the presence of prior or model mis-specifications. As a prototypical example, we consider the phase retrieval problem, where we show that GASP outperforms the corresponding GAMP, reducing the reconstruction threshold and, for certain choices of its parameters, approaching Bayesian optimal performance. Furthermore, we present a set of State Evolution equations that exactly characterize the dynamics of GASP in the high-dimensional limit.
△ Less
Submitted 13 May, 2019;
originally announced May 2019.
-
A Solvable High-Dimensional Model of GAN
Authors:
Chuang Wang,
Hong Hu,
Yue M. Lu
Abstract:
We present a theoretical analysis of the training process for a single-layer GAN fed by high-dimensional input data. The training dynamics of the proposed model at both microscopic and macroscopic scales can be exactly analyzed in the high-dimensional limit. In particular, we prove that the macroscopic quantities measuring the quality of the training process converge to a deterministic process cha…
▽ More
We present a theoretical analysis of the training process for a single-layer GAN fed by high-dimensional input data. The training dynamics of the proposed model at both microscopic and macroscopic scales can be exactly analyzed in the high-dimensional limit. In particular, we prove that the macroscopic quantities measuring the quality of the training process converge to a deterministic process characterized by an ordinary differential equation (ODE), whereas the microscopic states containing all the detailed weights remain stochastic, whose dynamics can be described by a stochastic differential equation (SDE). This analysis provides a new perspective different from recent analyses in the limit of small learning rate, where the microscopic state is always considered deterministic, and the contribution of noise is ignored. From our analysis, we show that the level of the background noise is essential to the convergence of the training process: setting the noise level too strong leads to failure of feature recovery, whereas setting the noise too weak causes oscillation. Although this work focuses on a simple copy model of GAN, we believe the analysis methods and insights developed here would prove useful in the theoretical understanding of other variants of GANs with more advanced training algorithms.
△ Less
Submitted 28 October, 2019; v1 submitted 21 May, 2018;
originally announced May 2018.
-
Subspace Estimation from Incomplete Observations: A High-Dimensional Analysis
Authors:
Chuang Wang,
Yonina C. Eldar,
Yue M. Lu
Abstract:
We present a high-dimensional analysis of three popular algorithms, namely, Oja's method, GROUSE and PETRELS, for subspace estimation from streaming and highly incomplete observations. We show that, with proper time scaling, the time-varying principal angles between the true subspace and its estimates given by the algorithms converge weakly to deterministic processes when the ambient dimension…
▽ More
We present a high-dimensional analysis of three popular algorithms, namely, Oja's method, GROUSE and PETRELS, for subspace estimation from streaming and highly incomplete observations. We show that, with proper time scaling, the time-varying principal angles between the true subspace and its estimates given by the algorithms converge weakly to deterministic processes when the ambient dimension $n$ tends to infinity. Moreover, the limiting processes can be exactly characterized as the unique solutions of certain ordinary differential equations (ODEs). A finite sample bound is also given, showing that the rate of convergence towards such limits is $\mathcal{O}(1/\sqrt{n})$. In addition to providing asymptotically exact predictions of the dynamic performance of the algorithms, our high-dimensional analysis yields several insights, including an asymptotic equivalence between Oja's method and GROUSE, and a precise scaling relationship linking the amount of missing data to the signal-to-noise ratio. By analyzing the solutions of the limiting ODEs, we also establish phase transition phenomena associated with the steady-state performance of these techniques.
△ Less
Submitted 17 October, 2018; v1 submitted 17 May, 2018;
originally announced May 2018.
-
The Scaling Limit of High-Dimensional Online Independent Component Analysis
Authors:
Chuang Wang,
Yue M. Lu
Abstract:
We analyze the dynamics of an online algorithm for independent component analysis in the high-dimensional scaling limit. As the ambient dimension tends to infinity, and with proper time scaling, we show that the time-varying joint empirical measure of the target feature vector and the estimates provided by the algorithm will converge weakly to a deterministic measured-valued process that can be ch…
▽ More
We analyze the dynamics of an online algorithm for independent component analysis in the high-dimensional scaling limit. As the ambient dimension tends to infinity, and with proper time scaling, we show that the time-varying joint empirical measure of the target feature vector and the estimates provided by the algorithm will converge weakly to a deterministic measured-valued process that can be characterized as the unique solution of a nonlinear PDE. Numerical solutions of this PDE, which involves two spatial variables and one time variable, can be efficiently obtained. These solutions provide detailed information about the performance of the ICA algorithm, as many practical performance metrics are functionals of the joint empirical measures. Numerical simulations show that our asymptotic analysis is accurate even for moderate dimensions. In addition to providing a tool for understanding the performance of the algorithm, our PDE analysis also provides useful insight. In particular, in the high-dimensional limit, the original coupled dynamics associated with the algorithm will be asymptotically "decoupled", with each coordinate independently solving a 1-D effective minimization problem via stochastic gradient descent. Exploiting this insight to design new algorithms for achieving optimal trade-offs between computational and statistical efficiency may prove an interesting line of future research.
△ Less
Submitted 6 November, 2017; v1 submitted 15 October, 2017;
originally announced October 2017.
-
Online Learning for Sparse PCA in High Dimensions: Exact Dynamics and Phase Transitions
Authors:
Chuang Wang,
Yue M. Lu
Abstract:
We study the dynamics of an online algorithm for learning a sparse leading eigenvector from samples generated from a spiked covariance model. This algorithm combines the classical Oja's method for online PCA with an element-wise nonlinearity at each iteration to promote sparsity. In the high-dimensional limit, the joint empirical measure of the underlying sparse eigenvector and its estimate provid…
▽ More
We study the dynamics of an online algorithm for learning a sparse leading eigenvector from samples generated from a spiked covariance model. This algorithm combines the classical Oja's method for online PCA with an element-wise nonlinearity at each iteration to promote sparsity. In the high-dimensional limit, the joint empirical measure of the underlying sparse eigenvector and its estimate provided by the algorithm is shown to converge weakly to a deterministic, measure-valued process. This scaling limit is characterized as the unique solution of a nonlinear PDE, and it provides exact information regarding the asymptotic performance of the algorithm. For example, performance metrics such as the cosine similarity and the misclassification rate in sparse support recovery can be obtained by examining the limiting dynamics. A steady-state analysis of the nonlinear PDE also reveals an interesting phase transition phenomenon. Although our analysis is asymptotic in nature, numerical simulations show that the theoretical predictions are accurate for moderate signal dimensions.
△ Less
Submitted 7 September, 2016;
originally announced September 2016.
-
A new collective mode in YBCO observed by time-domain reflectometry
Authors:
J. P. Hinton,
J. D. Koralek,
Y. M. Lu,
A. Vishwanath,
J. Orenstein,
D. A. Bonn,
W. N. Hardy,
Ruixing Liang
Abstract:
We report the observation of coherent oscillations associated with charge density wave (CDW) order in the underdoped cuprate superconductor YBa2Cu3O6+x by time-resolved optical reflectivity. Oscillations with frequency 1.87 THz onset at approximately 105 K and 130 K for dopings of x = 0.67 (ortho-VIII) and x = 0.75 (ortho-III), respectively. Upon cooling below the superconducting critical temperat…
▽ More
We report the observation of coherent oscillations associated with charge density wave (CDW) order in the underdoped cuprate superconductor YBa2Cu3O6+x by time-resolved optical reflectivity. Oscillations with frequency 1.87 THz onset at approximately 105 K and 130 K for dopings of x = 0.67 (ortho-VIII) and x = 0.75 (ortho-III), respectively. Upon cooling below the superconducting critical temperature (T_c), the oscillation amplitude is enhanced, the phase shifts by π, and the frequency softens by δν/ ν~7%. A bi-quadratically coupled Landau-Ginzburg model qualitatively describes this behavior as arising from competition between superconducting and CDW orders.
△ Less
Submitted 6 May, 2013;
originally announced May 2013.
-
Phototransistor Behavior Based on Dye-Sensitized Solar Cell
Authors:
X. Q. Wang,
C. B. Cai,
Y. F. Wang,
W. Q. Zhou,
Y. M. Lu,
Z. Y. Liu
Abstract:
In the present work, a light-controlled device cell is established based on the dye-sensitized solar cell using nanocrystalline TiO2 films. Voltage-current curves are characterized by three types of transport behaviors: linear increase, saturated plateau and breakdown-like increase, which are actually of the typical performances for a photo-gated transistor. Moreover, an asymmetric behavior is obs…
▽ More
In the present work, a light-controlled device cell is established based on the dye-sensitized solar cell using nanocrystalline TiO2 films. Voltage-current curves are characterized by three types of transport behaviors: linear increase, saturated plateau and breakdown-like increase, which are actually of the typical performances for a photo-gated transistor. Moreover, an asymmetric behavior is observed in the voltage-current loops, which is believed to arise from the difference in the effective photo-conducting areas. The photovoltaic voltage between the shared counter electrode and drain (VCE-D) is investigated as well, clarifying that the predominant dark process in source and the predominant photovoltaic process in drain are series connected, modifying the electric potential levels and thus resulting in the characteristic phototransistor behaviors.
△ Less
Submitted 24 October, 2012;
originally announced October 2012.
-
Unconventional Scaling of the Anomalous Hall Effect Accompanying Electron Localization Correction in the Dirty Regime
Authors:
Y. M. Lu,
J. W. Cai,
Zaibing Guo,
X. X. Zhang
Abstract:
Scaling of the anomalous Hall conductivity to longitudinal conductivity, has been observed in the dirty regime of two-dimensional weak and strong localization regions in ultrathin, polycrystalline, chemically disordered, ferromagnetic FePt films. The relationship between electron transport and temperature reveals a quantitatively insignificant Coulomb interaction in these films while the temperatu…
▽ More
Scaling of the anomalous Hall conductivity to longitudinal conductivity, has been observed in the dirty regime of two-dimensional weak and strong localization regions in ultrathin, polycrystalline, chemically disordered, ferromagnetic FePt films. The relationship between electron transport and temperature reveals a quantitatively insignificant Coulomb interaction in these films while the temperature dependent anomalous Hall conductivity experiences quantum correction from electron localization. At the onset of this correction, the low-temperature anomalous Hall resistivity begins to be saturated when the thickness of the FePt film is reduced, and the corresponding Hall conductivity scaling exponent becomes 2, which is above the recent unified theory of 1.6 (σ_AH \propto σ^1.6_xx). Our results strongly suggest that the correction of the electron localization modulates the scaling exponent of the anomalous Hall effect.
△ Less
Submitted 10 October, 2012;
originally announced October 2012.
-
Dynamical Interplay between Coexisting Orders in the Electron-Doped Cuprate Superconductor Nd_{2-x}Ce_xCuO_4
Authors:
J. P. Hinton,
J. D. Koralek,
G. Yu,
E. M. Motoyama,
Y. M. Lu,
A. Vishwanath,
M. Greven,
J. Orenstein
Abstract:
We use coherent pump-probe spectroscopy to measure the photoinduced reflectivity \DeltaR, and complex dielectric function, δ\in, of the electron-doped cuprate superconductor Nd_{2-x}Ce_xCuO_{4+δ} at a value of x near optimal doping, as a function of time, temperature, and laser fluence. We observe the onset of a negative \DeltaR at T=85 K, above the superconducting transition temperature, T_c, of…
▽ More
We use coherent pump-probe spectroscopy to measure the photoinduced reflectivity \DeltaR, and complex dielectric function, δ\in, of the electron-doped cuprate superconductor Nd_{2-x}Ce_xCuO_{4+δ} at a value of x near optimal doping, as a function of time, temperature, and laser fluence. We observe the onset of a negative \DeltaR at T=85 K, above the superconducting transition temperature, T_c, of 23 K, that exhibits a form of scaling consistent with critical fluctuations in the time domain. A positive ΔR onsets at T_c that we associate with superconducting order. We find that the two signals are strongly coupled below T_c, in a manner that suggests a repulsive interaction between superconductivity and antiferromagnetic correlations.
△ Less
Submitted 4 August, 2012;
originally announced August 2012.