-
Dynamically Learning to Integrate in Recurrent Neural Networks
Authors:
Blake Bordelon,
Jordan Cotler,
Cengiz Pehlevan,
Jacob A. Zavatone-Veth
Abstract:
Learning to remember over long timescales is fundamentally challenging for recurrent neural networks (RNNs). While much prior work has explored why RNNs struggle to learn long timescales and how to mitigate this, we still lack a clear understanding of the dynamics involved when RNNs learn long timescales via gradient descent. Here we build a mathematical theory of the learning dynamics of linear R…
▽ More
Learning to remember over long timescales is fundamentally challenging for recurrent neural networks (RNNs). While much prior work has explored why RNNs struggle to learn long timescales and how to mitigate this, we still lack a clear understanding of the dynamics involved when RNNs learn long timescales via gradient descent. Here we build a mathematical theory of the learning dynamics of linear RNNs trained to integrate white noise. We show that when the initial recurrent weights are small, the dynamics of learning are described by a low-dimensional system that tracks a single outlier eigenvalue of the recurrent weights. This reveals the precise manner in which the long timescale associated with white noise integration is learned. We extend our analyses to RNNs learning a damped oscillatory filter, and find rich dynamical equations for the evolution of a conjugate pair of outlier eigenvalues. Taken together, our analyses build a rich mathematical framework for studying dynamical learning problems salient for both machine learning and neuroscience.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models
Authors:
Alexander Atanasov,
Blake Bordelon,
Jacob A. Zavatone-Veth,
Courtney Paquette,
Cengiz Pehlevan
Abstract:
We derive a novel deterministic equivalence for the two-point function of a random matrix resolvent. Using this result, we give a unified derivation of the performance of a wide variety of high-dimensional linear models trained with stochastic gradient descent. This includes high-dimensional linear regression, kernel regression, and random feature models. Our results include previously known asymp…
▽ More
We derive a novel deterministic equivalence for the two-point function of a random matrix resolvent. Using this result, we give a unified derivation of the performance of a wide variety of high-dimensional linear models trained with stochastic gradient descent. This includes high-dimensional linear regression, kernel regression, and random feature models. Our results include previously known asymptotics as well as novel ones.
△ Less
Submitted 29 April, 2025; v1 submitted 7 February, 2025;
originally announced February 2025.
-
Risk and cross validation in ridge regression with correlated samples
Authors:
Alexander Atanasov,
Jacob A. Zavatone-Veth,
Cengiz Pehlevan
Abstract:
Recent years have seen substantial advances in our understanding of high-dimensional ridge regression, but existing theories assume that training examples are independent. By leveraging techniques from random matrix theory and free probability, we provide sharp asymptotics for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations. We demonstrate that…
▽ More
Recent years have seen substantial advances in our understanding of high-dimensional ridge regression, but existing theories assume that training examples are independent. By leveraging techniques from random matrix theory and free probability, we provide sharp asymptotics for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations. We demonstrate that in this setting, the generalized cross validation estimator (GCV) fails to correctly predict the out-of-sample risk. However, in the case where the noise residuals have the same correlations as the data points, one can modify the GCV to yield an efficiently-computable unbiased estimator that concentrates in the high-dimensional limit, which we dub CorrGCV. We further extend our asymptotic analysis to the case where the test point has nontrivial correlations with the training set, a setting often encountered in time series forecasting. Assuming knowledge of the correlation structure of the time series, this again yields an extension of the GCV estimator, and sharply characterizes the degree to which such test points yield an overly optimistic prediction of long-time risk. We validate the predictions of our theory across a variety of high dimensional data.
△ Less
Submitted 31 May, 2025; v1 submitted 8 August, 2024;
originally announced August 2024.
-
Nadaraya-Watson kernel smoothing as a random energy model
Authors:
Jacob A. Zavatone-Veth,
Cengiz Pehlevan
Abstract:
Precise asymptotics have revealed many surprises in high-dimensional regression. These advances, however, have not extended to perhaps the simplest estimator: direct Nadaraya-Watson (NW) kernel smoothing. Here, we describe how one can use ideas from the analysis of the random energy model (REM) in statistical physics to compute sharp asymptotics for the NW estimator when the sample size is exponen…
▽ More
Precise asymptotics have revealed many surprises in high-dimensional regression. These advances, however, have not extended to perhaps the simplest estimator: direct Nadaraya-Watson (NW) kernel smoothing. Here, we describe how one can use ideas from the analysis of the random energy model (REM) in statistical physics to compute sharp asymptotics for the NW estimator when the sample size is exponential in the dimension. As a simple starting point for investigation, we focus on the case in which one aims to estimate a single-index target function using a radial basis function kernel on the sphere. Our main result is a pointwise asymptotic for the NW predictor, showing that it re-scales the argument of the true link function. Our work provides a first step towards a detailed understanding of kernel smoothing in high dimensions.
△ Less
Submitted 21 November, 2024; v1 submitted 7 August, 2024;
originally announced August 2024.
-
Asymptotic theory of in-context learning by linear attention
Authors:
Yue M. Lu,
Mary I. Letey,
Jacob A. Zavatone-Veth,
Anindita Maiti,
Cengiz Pehlevan
Abstract:
Transformers have a remarkable ability to learn and execute tasks based on examples provided within the input itself, without explicit prior training. It has been argued that this capability, known as in-context learning (ICL), is a cornerstone of Transformers' success, yet questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unr…
▽ More
Transformers have a remarkable ability to learn and execute tasks based on examples provided within the input itself, without explicit prior training. It has been argued that this capability, known as in-context learning (ICL), is a cornerstone of Transformers' success, yet questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unresolved. Here, we provide a precise answer to these questions in an exactly solvable model of ICL of a linear regression task by linear attention. We derive sharp asymptotics for the learning curve in a phenomenologically-rich scaling regime where the token dimension is taken to infinity; the context length and pretraining task diversity scale proportionally with the token dimension; and the number of pretraining examples scales quadratically. We demonstrate a double-descent learning curve with increasing pretraining examples, and uncover a phase transition in the model's behavior between low and high task diversity regimes: In the low diversity regime, the model tends toward memorization of training tasks, whereas in the high diversity regime, it achieves genuine in-context learning and generalization beyond the scope of pretrained tasks. These theoretical insights are empirically validated through experiments with both linear attention and full nonlinear Transformer architectures.
△ Less
Submitted 4 February, 2025; v1 submitted 19 May, 2024;
originally announced May 2024.
-
Scaling and renormalization in high-dimensional regression
Authors:
Alexander Atanasov,
Jacob A. Zavatone-Veth,
Cengiz Pehlevan
Abstract:
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models using the basic tools of random matrix theory and free probability. We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning. Analytic formulas for the training and generaliza…
▽ More
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models using the basic tools of random matrix theory and free probability. We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning. Analytic formulas for the training and generalization errors are obtained in a few lines of algebra directly from the properties of the $S$-transform of free probability. This allows for a straightforward identification of the sources of power-law scaling in model performance. We compute the generalization error of a broad class of random feature models. We find that in all models, the $S$-transform corresponds to the train-test generalization gap, and yields an analogue of the generalized-cross-validation estimator. Using these techniques, we derive fine-grained bias-variance decompositions for a very general class of random feature models with structured covariates. These novel results allow us to discover a scaling regime for random feature models where the variance due to the features limits performance in the overparameterized setting. We also demonstrate how anisotropic weight structure in random feature models can limit performance and lead to nontrivial exponents for finite-width corrections in the overparameterized setting. Our results extend and provide a unifying perspective on earlier models of neural scaling laws.
△ Less
Submitted 26 June, 2024; v1 submitted 1 May, 2024;
originally announced May 2024.
-
Long Sequence Hopfield Memory
Authors:
Hamza Tahir Chaudhry,
Jacob A. Zavatone-Veth,
Dmitry Krotov,
Cengiz Pehlevan
Abstract:
Sequence memory is an essential attribute of natural and artificial intelligence that enables agents to encode, store, and retrieve complex sequences of stimuli and actions. Computational models of sequence memory have been proposed where recurrent Hopfield-like neural networks are trained with temporally asymmetric Hebbian rules. However, these networks suffer from limited sequence capacity (maxi…
▽ More
Sequence memory is an essential attribute of natural and artificial intelligence that enables agents to encode, store, and retrieve complex sequences of stimuli and actions. Computational models of sequence memory have been proposed where recurrent Hopfield-like neural networks are trained with temporally asymmetric Hebbian rules. However, these networks suffer from limited sequence capacity (maximal length of the stored sequence) due to interference between the memories. Inspired by recent work on Dense Associative Memories, we expand the sequence capacity of these models by introducing a nonlinear interaction term, enhancing separation between the patterns. We derive novel scaling laws for sequence capacity with respect to network size, significantly outperforming existing scaling laws for models based on traditional Hopfield networks, and verify these theoretical results with numerical simulation. Moreover, we introduce a generalized pseudoinverse rule to recall sequences of highly correlated patterns. Finally, we extend this model to store sequences with variable timing between states' transitions and describe a biologically-plausible implementation, with connections to motor neuroscience.
△ Less
Submitted 2 November, 2023; v1 submitted 7 June, 2023;
originally announced June 2023.
-
Learning curves for deep structured Gaussian feature models
Authors:
Jacob A. Zavatone-Veth,
Cengiz Pehlevan
Abstract:
In recent years, significant attention in deep learning theory has been devoted to analyzing when models that interpolate their training data can still generalize well to unseen examples. Many insights have been gained from studying models with multiple layers of Gaussian random features, for which one can compute precise generalization asymptotics. However, few works have considered the effect of…
▽ More
In recent years, significant attention in deep learning theory has been devoted to analyzing when models that interpolate their training data can still generalize well to unseen examples. Many insights have been gained from studying models with multiple layers of Gaussian random features, for which one can compute precise generalization asymptotics. However, few works have considered the effect of weight anisotropy; most assume that the random features are generated using independent and identically distributed Gaussian weights, and allow only for structure in the input data. Here, we use the replica trick from statistical physics to derive learning curves for models with many layers of structured Gaussian features. We show that allowing correlations between the rows of the first layer of features can aid generalization, while structure in later layers is generally detrimental. Our results shed light on how weight structure affects generalization in a simple class of solvable models.
△ Less
Submitted 23 October, 2023; v1 submitted 1 March, 2023;
originally announced March 2023.
-
Neural networks learn to magnify areas near decision boundaries
Authors:
Jacob A. Zavatone-Veth,
Sheng Yang,
Julian A. Rubinfien,
Cengiz Pehlevan
Abstract:
In machine learning, there is a long history of trying to build neural networks that can learn from fewer example data by baking in strong geometric priors. However, it is not always clear a priori what geometric constraints are appropriate for a given task. Here, we consider the possibility that one can uncover useful geometric inductive biases by studying how training molds the Riemannian geomet…
▽ More
In machine learning, there is a long history of trying to build neural networks that can learn from fewer example data by baking in strong geometric priors. However, it is not always clear a priori what geometric constraints are appropriate for a given task. Here, we consider the possibility that one can uncover useful geometric inductive biases by studying how training molds the Riemannian geometry induced by unconstrained neural network feature maps. We first show that at infinite width, neural networks with random parameters induce highly symmetric metrics on input space. This symmetry is broken by feature learning: networks trained to perform classification tasks learn to magnify local areas along decision boundaries. This holds in deep networks trained on high-dimensional image classification tasks, and even in self-supervised representation learning. These results begins to elucidate how training shapes the geometry induced by unconstrained neural network feature maps, laying the groundwork for an understanding of this richly nonlinear form of feature learning.
△ Less
Submitted 14 October, 2023; v1 submitted 26 January, 2023;
originally announced January 2023.
-
Replica method for eigenvalues of real Wishart product matrices
Authors:
Jacob A. Zavatone-Veth,
Cengiz Pehlevan
Abstract:
We show how the replica method can be used to compute the asymptotic eigenvalue spectrum of a real Wishart product matrix. For unstructured factors, this provides a compact, elementary derivation of a polynomial condition on the Stieltjes transform first proved by Müller [IEEE Trans. Inf. Theory. 48, 2086-2091 (2002)]. We then show how this computation can be extended to ensembles where the factor…
▽ More
We show how the replica method can be used to compute the asymptotic eigenvalue spectrum of a real Wishart product matrix. For unstructured factors, this provides a compact, elementary derivation of a polynomial condition on the Stieltjes transform first proved by Müller [IEEE Trans. Inf. Theory. 48, 2086-2091 (2002)]. We then show how this computation can be extended to ensembles where the factors are drawn from matrix Gaussian distributions with general correlation structure. For both unstructured and structured ensembles, we derive polynomial conditions on the average values of the minimum and maximum eigenvalues, which in the unstructured case match the results obtained by Akemann, Ipsen, and Kieburg [Phys. Rev. E 88, 052118 (2013)] for the complex Wishart product ensemble.
△ Less
Submitted 20 January, 2023; v1 submitted 21 September, 2022;
originally announced September 2022.
-
Contrasting random and learned features in deep Bayesian linear regression
Authors:
Jacob A. Zavatone-Veth,
William L. Tong,
Cengiz Pehlevan
Abstract:
Understanding how feature learning affects generalization is among the foremost goals of modern deep learning theory. Here, we study how the ability to learn representations affects the generalization performance of a simple class of models: deep Bayesian linear neural networks trained on unstructured Gaussian data. By comparing deep random feature models to deep networks in which all layers are t…
▽ More
Understanding how feature learning affects generalization is among the foremost goals of modern deep learning theory. Here, we study how the ability to learn representations affects the generalization performance of a simple class of models: deep Bayesian linear neural networks trained on unstructured Gaussian data. By comparing deep random feature models to deep networks in which all layers are trained, we provide a detailed characterization of the interplay between width, depth, data density, and prior mismatch. We show that both models display sample-wise double-descent behavior in the presence of label noise. Random feature models can also display model-wise double-descent if there are narrow bottleneck layers, while deep networks do not show these divergences. Random feature models can have particular widths that are optimal for generalization at a given data density, while making neural networks as wide or as narrow as possible is always optimal. Moreover, we show that the leading-order correction to the kernel-limit learning curve cannot distinguish between random feature models and deep networks in which all layers are trained. Taken together, our findings begin to elucidate how architectural details affect generalization performance in this simple class of deep regression models.
△ Less
Submitted 16 June, 2022; v1 submitted 1 March, 2022;
originally announced March 2022.
-
On neural network kernels and the storage capacity problem
Authors:
Jacob A. Zavatone-Veth,
Cengiz Pehlevan
Abstract:
In this short note, we reify the connection between work on the storage capacity problem in wide two-layer treelike neural networks and the rapidly-growing body of literature on kernel limits of wide neural networks. Concretely, we observe that the "effective order parameter" studied in the statistical mechanics literature is exactly equivalent to the infinite-width Neural Network Gaussian Process…
▽ More
In this short note, we reify the connection between work on the storage capacity problem in wide two-layer treelike neural networks and the rapidly-growing body of literature on kernel limits of wide neural networks. Concretely, we observe that the "effective order parameter" studied in the statistical mechanics literature is exactly equivalent to the infinite-width Neural Network Gaussian Process Kernel. This correspondence connects the expressivity and trainability of wide two-layer neural networks.
△ Less
Submitted 12 January, 2022;
originally announced January 2022.
-
Asymptotics of representation learning in finite Bayesian neural networks
Authors:
Jacob A. Zavatone-Veth,
Abdulkadir Canatar,
Benjamin S. Ruben,
Cengiz Pehlevan
Abstract:
Recent works have suggested that finite Bayesian neural networks may sometimes outperform their infinite cousins because finite networks can flexibly adapt their internal representations. However, our theoretical understanding of how the learned hidden layer representations of finite networks differ from the fixed representations of infinite networks remains incomplete. Perturbative finite-width c…
▽ More
Recent works have suggested that finite Bayesian neural networks may sometimes outperform their infinite cousins because finite networks can flexibly adapt their internal representations. However, our theoretical understanding of how the learned hidden layer representations of finite networks differ from the fixed representations of infinite networks remains incomplete. Perturbative finite-width corrections to the network prior and posterior have been studied, but the asymptotics of learned features have not been fully characterized. Here, we argue that the leading finite-width corrections to the average feature kernels for any Bayesian network with linear readout and Gaussian likelihood have a largely universal form. We illustrate this explicitly for three tractable network architectures: deep linear fully-connected and convolutional networks, and networks with a single nonlinear hidden layer. Our results begin to elucidate how task-relevant learning signals shape the hidden layer representations of wide Bayesian neural networks.
△ Less
Submitted 8 February, 2022; v1 submitted 1 June, 2021;
originally announced June 2021.
-
Exact marginal prior distributions of finite Bayesian neural networks
Authors:
Jacob A. Zavatone-Veth,
Cengiz Pehlevan
Abstract:
Bayesian neural networks are theoretically well-understood only in the infinite-width limit, where Gaussian priors over network weights yield Gaussian priors over network outputs. Recent work has suggested that finite Bayesian networks may outperform their infinite counterparts, but their non-Gaussian function space priors have been characterized only though perturbative approaches. Here, we deriv…
▽ More
Bayesian neural networks are theoretically well-understood only in the infinite-width limit, where Gaussian priors over network weights yield Gaussian priors over network outputs. Recent work has suggested that finite Bayesian networks may outperform their infinite counterparts, but their non-Gaussian function space priors have been characterized only though perturbative approaches. Here, we derive exact solutions for the function space priors for individual input examples of a class of finite fully-connected feedforward Bayesian neural networks. For deep linear networks, the prior has a simple expression in terms of the Meijer $G$-function. The prior of a finite ReLU network is a mixture of the priors of linear networks of smaller widths, corresponding to different numbers of active units in each layer. Our results unify previous descriptions of finite network priors in terms of their tail decay and large-width behavior.
△ Less
Submitted 18 October, 2021; v1 submitted 23 April, 2021;
originally announced April 2021.
-
Activation function dependence of the storage capacity of treelike neural networks
Authors:
Jacob A. Zavatone-Veth,
Cengiz Pehlevan
Abstract:
The expressive power of artificial neural networks crucially depends on the nonlinearity of their activation functions. Though a wide variety of nonlinear activation functions have been proposed for use in artificial neural networks, a detailed understanding of their role in determining the expressive power of a network has not emerged. Here, we study how activation functions affect the storage ca…
▽ More
The expressive power of artificial neural networks crucially depends on the nonlinearity of their activation functions. Though a wide variety of nonlinear activation functions have been proposed for use in artificial neural networks, a detailed understanding of their role in determining the expressive power of a network has not emerged. Here, we study how activation functions affect the storage capacity of treelike two-layer networks. We relate the boundedness or divergence of the capacity in the infinite-width limit to the smoothness of the activation function, elucidating the relationship between previously studied special cases. Our results show that nonlinearity can both increase capacity and decrease the robustness of classification, and provide simple estimates for the capacity of networks with several commonly used activation functions. Furthermore, they generate a hypothesis for the functional benefit of dendritic spikes in branched neurons.
△ Less
Submitted 4 February, 2021; v1 submitted 21 July, 2020;
originally announced July 2020.