-
The Computational Advantage of Depth: Learning High-Dimensional Hierarchical Functions with Gradient Descent
Authors:
Yatin Dandi,
Luca Pesce,
Lenka Zdeborová,
Florent Krzakala
Abstract:
Understanding the advantages of deep neural networks trained by gradient descent (GD) compared to shallow models remains an open theoretical challenge. In this paper, we introduce a class of target functions (single and multi-index Gaussian hierarchical targets) that incorporate a hierarchy of latent subspace dimensionalities. This framework enables us to analytically study the learning dynamics a…
▽ More
Understanding the advantages of deep neural networks trained by gradient descent (GD) compared to shallow models remains an open theoretical challenge. In this paper, we introduce a class of target functions (single and multi-index Gaussian hierarchical targets) that incorporate a hierarchy of latent subspace dimensionalities. This framework enables us to analytically study the learning dynamics and generalization performance of deep networks compared to shallow ones in the high-dimensional limit. Specifically, our main theorem shows that feature learning with GD successively reduces the effective dimensionality, transforming a high-dimensional problem into a sequence of lower-dimensional ones. This enables learning the target function with drastically less samples than with shallow networks. While the results are proven in a controlled training setting, we also discuss more common training procedures and argue that they learn through the same mechanisms.
△ Less
Submitted 11 June, 2025; v1 submitted 19 February, 2025;
originally announced February 2025.
-
Germanium target sensed by phonon-mediated kinetic inductance detectors
Authors:
D. Delicato,
D. Angelone,
L. Bandiera,
M. Calvo,
M. Cappelli,
U. Chowdhury,
G. Del Castello,
M. Folcarelli,
M. del Gallo Roccagiovine,
V. Guidi,
G. L. Pesce,
M. Romagnoni,
A. Cruciani,
A. Mazzolari,
A. Monfardini,
M. Vignati
Abstract:
Cryogenic phonon detectors are adopted in experiments searching for dark matter interactions or coherent elastic neutrino-nucleus scattering, thanks to the low energy threshold they can achieve. The phonon-mediated sensing of particle interactions in passive silicon absorbers has been demonstrated with Kinetic Inductance Detectors (KIDs). Targets with neutron number larger than silicon, however, f…
▽ More
Cryogenic phonon detectors are adopted in experiments searching for dark matter interactions or coherent elastic neutrino-nucleus scattering, thanks to the low energy threshold they can achieve. The phonon-mediated sensing of particle interactions in passive silicon absorbers has been demonstrated with Kinetic Inductance Detectors (KIDs). Targets with neutron number larger than silicon, however, feature higher cross section to neutrinos while multi-target absorbers in dark matter experiments would provide a stronger evidence of a possible signal. In this work we present the design, fabrication and operation of KIDs coupled to a germanium absorber, achieving phonon-sensing performance comparable to silicon absorbers. The device introduced in this work is a proof of concept for a scalable neutrino detector and for a multi-target dark matter experiment.
△ Less
Submitted 15 April, 2025; v1 submitted 10 December, 2024;
originally announced December 2024.
-
A Random Matrix Theory Perspective on the Spectrum of Learned Features and Asymptotic Generalization Capabilities
Authors:
Yatin Dandi,
Luca Pesce,
Hugo Cui,
Florent Krzakala,
Yue M. Lu,
Bruno Loureiro
Abstract:
A key property of neural networks is their capacity of adapting to data during training. Yet, our current mathematical understanding of feature learning and its relationship to generalization remain limited. In this work, we provide a random matrix analysis of how fully-connected two-layer neural networks adapt to the target function after a single, but aggressive, gradient descent step. We rigoro…
▽ More
A key property of neural networks is their capacity of adapting to data during training. Yet, our current mathematical understanding of feature learning and its relationship to generalization remain limited. In this work, we provide a random matrix analysis of how fully-connected two-layer neural networks adapt to the target function after a single, but aggressive, gradient descent step. We rigorously establish the equivalence between the updated features and an isotropic spiked random feature model, in the limit of large batch size. For the latter model, we derive a deterministic equivalent description of the feature empirical covariance matrix in terms of certain low-dimensional operators. This allows us to sharply characterize the impact of training in the asymptotic feature spectrum, and in particular, provides a theoretical grounding for how the tails of the feature spectrum modify with training. The deterministic equivalent further yields the exact asymptotic generalization error, shedding light on the mechanisms behind its improvement in the presence of feature learning. Our result goes beyond standard random matrix ensembles, and therefore we believe it is of independent technical interest. Different from previous work, our result holds in the challenging maximal learning rate regime, is fully rigorous and allows for finitely supported second layer initialization, which turns out to be crucial for studying the functional expressivity of the learned features. This provides a sharp description of the impact of feature learning in the generalization of two-layer neural networks, beyond the random features and lazy training regimes.
△ Less
Submitted 24 October, 2024;
originally announced October 2024.
-
Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs
Authors:
Luca Arnaboldi,
Yatin Dandi,
Florent Krzakala,
Bruno Loureiro,
Luca Pesce,
Ludovic Stephan
Abstract:
We study the impact of the batch size $n_b$ on the iteration time $T$ of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gr…
▽ More
We study the impact of the batch size $n_b$ on the iteration time $T$ of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gradient updates with large batches $n_b \lesssim d^{\frac{\ell}{2}}$ minimizes the training time without changing the total sample complexity, where $\ell$ is the information exponent of the target to be learned \citep{arous2021online} and $d$ is the input dimension. However, larger batch sizes than $n_b \gg d^{\frac{\ell}{2}}$ are detrimental for improving the time complexity of SGD. We provably overcome this fundamental limitation via a different training protocol, \textit{Correlation loss SGD}, which suppresses the auto-correlation terms in the loss function. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs). Finally, we validate our theoretical results with numerical experiments.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Repetita Iuvant: Data Repetition Allows SGD to Learn High-Dimensional Multi-Index Functions
Authors:
Luca Arnaboldi,
Yatin Dandi,
Florent Krzakala,
Luca Pesce,
Ludovic Stephan
Abstract:
Neural networks can identify low-dimensional relevant structures within high-dimensional noisy data, yet our mathematical understanding of how they do so remains scarce. Here, we investigate the training dynamics of two-layer shallow neural networks trained with gradient-based algorithms, and discuss how they learn pertinent features in multi-index models, that is target functions with low-dimensi…
▽ More
Neural networks can identify low-dimensional relevant structures within high-dimensional noisy data, yet our mathematical understanding of how they do so remains scarce. Here, we investigate the training dynamics of two-layer shallow neural networks trained with gradient-based algorithms, and discuss how they learn pertinent features in multi-index models, that is target functions with low-dimensional relevant directions. In the high-dimensional regime, where the input dimension $d$ diverges, we show that a simple modification of the idealized single-pass gradient descent training scenario, where data can now be repeated or iterated upon twice, drastically improves its computational efficiency. In particular, it surpasses the limitations previously believed to be dictated by the Information and Leap exponents associated with the target function to be learned. Our results highlight the ability of networks to learn relevant structures from data alone without any pre-processing. More precisely, we show that (almost) all directions are learned with at most $O(d \log d)$ steps. Among the exceptions is a set of hard functions that includes sparse parities. In the presence of coupling between directions, however, these can be learned sequentially through a hierarchical mechanism that generalizes the notion of staircase functions. Our results are proven by a rigorous study of the evolution of the relevant statistics for high-dimensional dynamics.
△ Less
Submitted 10 February, 2025; v1 submitted 24 May, 2024;
originally announced May 2024.
-
Asymptotics of feature learning in two-layer networks after one gradient-step
Authors:
Hugo Cui,
Luca Pesce,
Yatin Dandi,
Florent Krzakala,
Yue M. Lu,
Lenka Zdeborová,
Bruno Loureiro
Abstract:
In this manuscript, we investigate the problem of how two-layer neural networks learn features from data, and improve over the kernel regime, after being trained with a single gradient descent step. Leveraging the insight from (Ba et al., 2022), we model the trained network by a spiked Random Features (sRF) model. Further building on recent progress on Gaussian universality (Dandi et al., 2023), w…
▽ More
In this manuscript, we investigate the problem of how two-layer neural networks learn features from data, and improve over the kernel regime, after being trained with a single gradient descent step. Leveraging the insight from (Ba et al., 2022), we model the trained network by a spiked Random Features (sRF) model. Further building on recent progress on Gaussian universality (Dandi et al., 2023), we provide an exact asymptotic description of the generalization error of the sRF in the high-dimensional limit where the number of samples, the width, and the input dimension grow at a proportional rate. The resulting characterization for sRFs also captures closely the learning curves of the original network model. This enables us to understand how adapting to the data is crucial for the network to efficiently learn non-linear functions in the direction of the gradient -- where at initialization it can only express linear functions in this regime.
△ Less
Submitted 4 June, 2024; v1 submitted 7 February, 2024;
originally announced February 2024.
-
The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents
Authors:
Yatin Dandi,
Emanuele Troiani,
Luca Arnaboldi,
Luca Pesce,
Lenka Zdeborová,
Florent Krzakala
Abstract:
We investigate the training dynamics of two-layer neural networks when learning multi-index target functions. We focus on multi-pass gradient descent (GD) that reuses the batches multiple times and show that it significantly changes the conclusion about which functions are learnable compared to single-pass gradient descent. In particular, multi-pass GD with finite stepsize is found to overcome the…
▽ More
We investigate the training dynamics of two-layer neural networks when learning multi-index target functions. We focus on multi-pass gradient descent (GD) that reuses the batches multiple times and show that it significantly changes the conclusion about which functions are learnable compared to single-pass gradient descent. In particular, multi-pass GD with finite stepsize is found to overcome the limitations of gradient flow and single-pass GD given by the information exponent (Ben Arous et al., 2021) and leap exponent (Abbe et al., 2023) of the target function. We show that upon re-using batches, the network achieves in just two time steps an overlap with the target subspace even for functions not satisfying the staircase property (Abbe et al., 2021). We characterize the (broad) class of functions efficiently learned in finite time. The proof of our results is based on the analysis of the Dynamical Mean-Field Theory (DMFT). We further provide a closed-form description of the dynamical process of the low-dimensional projections of the weights, and numerical experiments illustrating the theory.
△ Less
Submitted 30 June, 2024; v1 submitted 5 February, 2024;
originally announced February 2024.
-
Theory and applications of the Sum-Of-Squares technique
Authors:
Francis Bach,
Elisabetta Cornacchia,
Luca Pesce,
Giovanni Piccioli
Abstract:
The Sum-of-Squares (SOS) approximation method is a technique used in optimization problems to derive lower bounds on the optimal value of an objective function. By representing the objective function as a sum of squares in a feature space, the SOS method transforms non-convex global optimization problems into solvable semidefinite programs. This note presents an overview of the SOS method. We star…
▽ More
The Sum-of-Squares (SOS) approximation method is a technique used in optimization problems to derive lower bounds on the optimal value of an objective function. By representing the objective function as a sum of squares in a feature space, the SOS method transforms non-convex global optimization problems into solvable semidefinite programs. This note presents an overview of the SOS method. We start with its application in finite-dimensional feature spaces and, subsequently, we extend it to infinite-dimensional feature spaces using reproducing kernels (k-SOS). Additionally, we highlight the utilization of SOS for estimating some relevant quantities in information theory, including the log-partition function.
△ Less
Submitted 11 March, 2024; v1 submitted 28 June, 2023;
originally announced June 2023.
-
How Two-Layer Neural Networks Learn, One (Giant) Step at a Time
Authors:
Yatin Dandi,
Florent Krzakala,
Bruno Loureiro,
Luca Pesce,
Ludovic Stephan
Abstract:
For high-dimensional Gaussian data, we investigate theoretically how the features of a two-layer neural network adapt to the structure of the target function through a few large batch gradient descent steps, leading to an improvement in the approximation capacity from initialization. First, we compare the influence of batch size to that of multiple steps. For a single step, a batch of size…
▽ More
For high-dimensional Gaussian data, we investigate theoretically how the features of a two-layer neural network adapt to the structure of the target function through a few large batch gradient descent steps, leading to an improvement in the approximation capacity from initialization. First, we compare the influence of batch size to that of multiple steps. For a single step, a batch of size $n = \mathcal{O}(d)$ is both necessary and sufficient to align with the target function, although only a single direction can be learned. In contrast, $n = \mathcal{O}(d^2)$ is essential for neurons to specialize in multiple relevant directions of the target with a single gradient step. Even in this case, we show there might exist ``hard'' directions requiring $n = \mathcal{O}(d^\ell)$ samples to be learned, where $\ell$ is known as the leap index of the target. Second, we show that the picture drastically improves over multiple gradient steps: a batch size of $n = \mathcal{O}(d)$ is indeed sufficient to learn multiple target directions satisfying a staircase property, where more and more directions can be learned over time. Finally, we discuss how these directions allow for a drastic improvement in the approximation capacity and generalization error over the initialization, illustrating a separation of scale between the random features/lazy regime and the feature learning regime. Our technical analysis leverages a combination of techniques related to concentration, projection-based conditioning, and Gaussian equivalence, which we believe are of independent interest. By pinning down the conditions necessary for specialization and learning, our results highlight the intertwined role of the structure of the task to learn, the details of the algorithm, and the architecture, shedding new light on how neural networks adapt to the feature and learn complex task from data over time.
△ Less
Submitted 3 June, 2025; v1 submitted 29 May, 2023;
originally announced May 2023.
-
Are Gaussian data all you need? Extents and limits of universality in high-dimensional generalized linear estimation
Authors:
Luca Pesce,
Florent Krzakala,
Bruno Loureiro,
Ludovic Stephan
Abstract:
In this manuscript we consider the problem of generalized linear estimation on Gaussian mixture data with labels given by a single-index model. Our first result is a sharp asymptotic expression for the test and training errors in the high-dimensional regime. Motivated by the recent stream of results on the Gaussian universality of the test and training errors in generalized linear estimation, we a…
▽ More
In this manuscript we consider the problem of generalized linear estimation on Gaussian mixture data with labels given by a single-index model. Our first result is a sharp asymptotic expression for the test and training errors in the high-dimensional regime. Motivated by the recent stream of results on the Gaussian universality of the test and training errors in generalized linear estimation, we ask ourselves the question: "when is a single Gaussian enough to characterize the error?". Our formula allow us to give sharp answers to this question, both in the positive and negative directions. More precisely, we show that the sufficient conditions for Gaussian universality (or lack of thereof) crucially depend on the alignment between the target weights and the means and covariances of the mixture clusters, which we precisely quantify. In the particular case of least-squares interpolation, we prove a strong universality property of the training error, and show it follows a simple, closed-form expression. Finally, we apply our results to real datasets, clarifying some recent discussion in the literature about Gaussian universality of the errors in this context.
△ Less
Submitted 17 February, 2023;
originally announced February 2023.
-
Innate Dynamics and Identity Crisis of a Metal Surface Unveiled by Machine Learning of Atomic Environments
Authors:
Matteo Cioni,
Daniela Polino,
Daniele Rapetti,
Luca Pesce,
Massimo Delle Piane,
Giovanni M. Pavan
Abstract:
Metals are traditionally considered hard matter. However, it is well known that their atomic lattices may become dynamic and undergo reconfigurations even well-below the melting temperature. The innate atomic dynamics of metals is directly related to their bulk and surface properties. Understanding their complex structural dynamics is thus important for many applications but is not easy. Here we r…
▽ More
Metals are traditionally considered hard matter. However, it is well known that their atomic lattices may become dynamic and undergo reconfigurations even well-below the melting temperature. The innate atomic dynamics of metals is directly related to their bulk and surface properties. Understanding their complex structural dynamics is thus important for many applications but is not easy. Here we report deep-potential molecular dynamics simulations allowing to resolve at atomic-resolution the complex dynamics of various types of copper (Cu) surfaces, used as an example, near the Hüttig ($\sim1/3$ of melting) temperature. The development of a deep neural network potential trained on DFT calculations provides a dynamically-accurate force field that we use to simulate large atomistic models of different Cu surface types. A combination of high-dimensional structural descriptors and unsupervised machine learning allows identifying and tracking all the atomic environments (AEs) emerging in the surfaces at finite temperatures. We can directly observe how AEs that are non-native in a specific (ideal) surface, but that are instead typical of other surface types, continuously emerge/disappear in that surface in relevant regimes in dynamic equilibrium with the native ones. Our analyses allow estimating the lifetime of all the AEs populating these Cu surfaces and to reconstruct their dynamic interconversions networks. This reveals the elusive identity of these metal surfaces, which preserve their identity only in part and in part transform into something else in relevant conditions. This also proposes a concept of "statistical identity" for metal surfaces, which is key for understanding their behaviors and properties.
△ Less
Submitted 21 February, 2023; v1 submitted 29 July, 2022;
originally announced July 2022.
-
Subspace clustering in high-dimensions: Phase transitions & Statistical-to-Computational gap
Authors:
Luca Pesce,
Bruno Loureiro,
Florent Krzakala,
Lenka Zdeborová
Abstract:
A simple model to study subspace clustering is the high-dimensional $k$-Gaussian mixture model where the cluster means are sparse vectors. Here we provide an exact asymptotic characterization of the statistically optimal reconstruction error in this model in the high-dimensional regime with extensive sparsity, i.e. when the fraction of non-zero components of the cluster means $ρ$, as well as the r…
▽ More
A simple model to study subspace clustering is the high-dimensional $k$-Gaussian mixture model where the cluster means are sparse vectors. Here we provide an exact asymptotic characterization of the statistically optimal reconstruction error in this model in the high-dimensional regime with extensive sparsity, i.e. when the fraction of non-zero components of the cluster means $ρ$, as well as the ratio $α$ between the number of samples and the dimension are fixed, while the dimension diverges. We identify the information-theoretic threshold below which obtaining a positive correlation with the true cluster means is statistically impossible. Additionally, we investigate the performance of the approximate message passing (AMP) algorithm analyzed via its state evolution, which is conjectured to be optimal among polynomial algorithm for this task. We identify in particular the existence of a statistical-to-computational gap between the algorithm that require a signal-to-noise ratio $λ_{\text{alg}} \ge k / \sqrtα $ to perform better than random, and the information theoretic threshold at $λ_{\text{it}} \approx \sqrt{-k ρ\logρ} / \sqrtα$. Finally, we discuss the case of sub-extensive sparsity $ρ$ by comparing the performance of the AMP with other sparsity-enhancing algorithms, such as sparse-PCA and diagonal thresholding.
△ Less
Submitted 1 December, 2022; v1 submitted 26 May, 2022;
originally announced May 2022.