-
Computational Thresholds in Multi-Modal Learning via the Spiked Matrix-Tensor Model
Authors:
Hugo Tabanelli,
Pierre Mergny,
Lenka Zdeborova,
Florent Krzakala
Abstract:
We study the recovery of multiple high-dimensional signals from two noisy, correlated modalities: a spiked matrix and a spiked tensor sharing a common low-rank structure. This setting generalizes classical spiked matrix and tensor models, unveiling intricate interactions between inference channels and surprising algorithmic behaviors. Notably, while the spiked tensor model is typically intractable…
▽ More
We study the recovery of multiple high-dimensional signals from two noisy, correlated modalities: a spiked matrix and a spiked tensor sharing a common low-rank structure. This setting generalizes classical spiked matrix and tensor models, unveiling intricate interactions between inference channels and surprising algorithmic behaviors. Notably, while the spiked tensor model is typically intractable at low signal-to-noise ratios, its correlation with the matrix enables efficient recovery via Bayesian Approximate Message Passing, inducing staircase-like phase transitions reminiscent of neural network phenomena. In contrast, empirical risk minimization for joint learning fails: the tensor component obstructs effective matrix recovery, and joint optimization significantly degrades performance, highlighting the limitations of naive multi-modal learning. We show that a simple Sequential Curriculum Learning strategy-first recovering the matrix, then leveraging it to guide tensor recovery-resolves this bottleneck and achieves optimal weak recovery thresholds. This strategy, implementable with spectral methods, emphasizes the critical role of structural correlation and learning order in multi-modal high-dimensional inference.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Asymptotics of SGD in Sequence-Single Index Models and Single-Layer Attention Networks
Authors:
Luca Arnaboldi,
Bruno Loureiro,
Ludovic Stephan,
Florent Krzakala,
Lenka Zdeborova
Abstract:
We study the dynamics of stochastic gradient descent (SGD) for a class of sequence models termed Sequence Single-Index (SSI) models, where the target depends on a single direction in input space applied to a sequence of tokens. This setting generalizes classical single-index models to the sequential domain, encompassing simplified one-layer attention architectures. We derive a closed-form expressi…
▽ More
We study the dynamics of stochastic gradient descent (SGD) for a class of sequence models termed Sequence Single-Index (SSI) models, where the target depends on a single direction in input space applied to a sequence of tokens. This setting generalizes classical single-index models to the sequential domain, encompassing simplified one-layer attention architectures. We derive a closed-form expression for the population loss in terms of a pair of sufficient statistics capturing semantic and positional alignment, and characterize the induced high-dimensional SGD dynamics for these coordinates. Our analysis reveals two distinct training phases: escape from uninformative initialization and alignment with the target subspace, and demonstrates how the sequence length and positional encoding influence convergence speed and learning trajectories. These results provide a rigorous and interpretable foundation for understanding how sequential structure in data can be beneficial for learning with attention-based models.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Learning with Restricted Boltzmann Machines: Asymptotics of AMP and GD in High Dimensions
Authors:
Yizhou Xu,
Florent Krzakala,
Lenka Zdeborová
Abstract:
The Restricted Boltzmann Machine (RBM) is one of the simplest generative neural networks capable of learning input distributions. Despite its simplicity, the analysis of its performance in learning from the training data is only well understood in cases that essentially reduce to singular value decomposition of the data. Here, we consider the limit of a large dimension of the input space and a con…
▽ More
The Restricted Boltzmann Machine (RBM) is one of the simplest generative neural networks capable of learning input distributions. Despite its simplicity, the analysis of its performance in learning from the training data is only well understood in cases that essentially reduce to singular value decomposition of the data. Here, we consider the limit of a large dimension of the input space and a constant number of hidden units. In this limit, we simplify the standard RBM training objective into a form that is equivalent to the multi-index model with non-separable regularization. This opens a path to analyze training of the RBM using methods that are established for multi-index models, such as Approximate Message Passing (AMP) and its state evolution, and the analysis of Gradient Descent (GD) via the dynamical mean-field theory. We then give rigorous asymptotics of the training dynamics of RBM on data generated by the spiked covariance model as a prototype of a structure suitable for unsupervised learning. We show in particular that RBM reaches the optimal computational weak recovery threshold, aligning with the BBP transition, in the spiked covariance model.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
The Nuclear Route: Sharp Asymptotics of ERM in Overparameterized Quadratic Networks
Authors:
Vittorio Erba,
Emanuele Troiani,
Lenka Zdeborová,
Florent Krzakala
Abstract:
We study the high-dimensional asymptotics of empirical risk minimization (ERM) in over-parametrized two-layer neural networks with quadratic activations trained on synthetic data. We derive sharp asymptotics for both training and test errors by mapping the $\ell_2$-regularized learning problem to a convex matrix sensing task with nuclear norm penalization. This reveals that capacity control in suc…
▽ More
We study the high-dimensional asymptotics of empirical risk minimization (ERM) in over-parametrized two-layer neural networks with quadratic activations trained on synthetic data. We derive sharp asymptotics for both training and test errors by mapping the $\ell_2$-regularized learning problem to a convex matrix sensing task with nuclear norm penalization. This reveals that capacity control in such networks emerges from a low-rank structure in the learned feature maps. Our results characterize the global minima of the loss and yield precise generalization thresholds, showing how the width of the target function governs learnability. This analysis bridges and extends ideas from spin-glass methods, matrix factorization, and convex optimization and emphasizes the deep link between low-rank matrix sensing and learning in quadratic neural networks.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Fundamental Limits of Matrix Sensing: Exact Asymptotics, Universality, and Applications
Authors:
Yizhou Xu,
Antoine Maillard,
Lenka Zdeborová,
Florent Krzakala
Abstract:
In the matrix sensing problem, one wishes to reconstruct a matrix from (possibly noisy) observations of its linear projections along given directions. We consider this model in the high-dimensional limit: while previous works on this model primarily focused on the recovery of low-rank matrices, we consider in this work more general classes of structured signal matrices with potentially large rank,…
▽ More
In the matrix sensing problem, one wishes to reconstruct a matrix from (possibly noisy) observations of its linear projections along given directions. We consider this model in the high-dimensional limit: while previous works on this model primarily focused on the recovery of low-rank matrices, we consider in this work more general classes of structured signal matrices with potentially large rank, e.g. a product of two matrices of sizes proportional to the dimension. We provide rigorous asymptotic equations characterizing the Bayes-optimal learning performance from a number of samples which is proportional to the number of entries in the matrix. Our proof is composed of three key ingredients: $(i)$ we prove universality properties to handle structured sensing matrices, related to the ''Gaussian equivalence'' phenomenon in statistical learning, $(ii)$ we provide a sharp characterization of Bayes-optimal learning in generalized linear models with Gaussian data and structured matrix priors, generalizing previously studied settings, and $(iii)$ we leverage previous works on the problem of matrix denoising. The generality of our results allow for a variety of applications: notably, we mathematically establish predictions obtained via non-rigorous methods from statistical physics in [ETB+24] regarding Bilinear Sequence Regression, a benchmark model for learning from sequences of tokens, and in [MTM+24] on Bayes-optimal learning in neural networks with quadratic activation function, and width proportional to the dimension.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
Optimal Spectral Transitions in High-Dimensional Multi-Index Models
Authors:
Leonardo Defilippis,
Yatin Dandi,
Pierre Mergny,
Florent Krzakala,
Bruno Loureiro
Abstract:
We consider the problem of how many samples from a Gaussian multi-index model are required to weakly reconstruct the relevant index subspace. Despite its increasing popularity as a testbed for investigating the computational complexity of neural networks, results beyond the single-index setting remain elusive. In this work, we introduce spectral algorithms based on the linearization of a message p…
▽ More
We consider the problem of how many samples from a Gaussian multi-index model are required to weakly reconstruct the relevant index subspace. Despite its increasing popularity as a testbed for investigating the computational complexity of neural networks, results beyond the single-index setting remain elusive. In this work, we introduce spectral algorithms based on the linearization of a message passing scheme tailored to this problem. Our main contribution is to show that the proposed methods achieve the optimal reconstruction threshold. Leveraging a high-dimensional characterization of the algorithms, we show that above the critical threshold the leading eigenvector correlates with the relevant index subspace, a phenomenon reminiscent of the Baik-Ben Arous-Peche (BBP) transition in spiked models arising in random matrix theory. Supported by numerical experiments and a rigorous theoretical framework, our work bridges critical gaps in the computational limits of weak learnability in multi-index model.
△ Less
Submitted 10 June, 2025; v1 submitted 4 February, 2025;
originally announced February 2025.
-
Fundamental limits of learning in sequence multi-index models and deep attention networks: High-dimensional asymptotics and sharp thresholds
Authors:
Emanuele Troiani,
Hugo Cui,
Yatin Dandi,
Florent Krzakala,
Lenka Zdeborová
Abstract:
In this manuscript, we study the learning of deep attention neural networks, defined as the composition of multiple self-attention layers, with tied and low-rank weights. We first establish a mapping of such models to sequence multi-index models, a generalization of the widely studied multi-index model to sequential covariates, for which we establish a number of general results. In the context of…
▽ More
In this manuscript, we study the learning of deep attention neural networks, defined as the composition of multiple self-attention layers, with tied and low-rank weights. We first establish a mapping of such models to sequence multi-index models, a generalization of the widely studied multi-index model to sequential covariates, for which we establish a number of general results. In the context of Bayesian-optimal learning, in the limit of large dimension $D$ and commensurably large number of samples $N$, we derive a sharp asymptotic characterization of the optimal performance as well as the performance of the best-known polynomial-time algorithm for this setting --namely approximate message-passing--, and characterize sharp thresholds on the minimal sample complexity required for better-than-random prediction performance. Our analysis uncovers, in particular, how the different layers are learned sequentially. Finally, we discuss how this sequential learning can also be observed in a realistic setup.
△ Less
Submitted 2 February, 2025;
originally announced February 2025.
-
Streamlined optical training of large-scale modern deep learning architectures with direct feedback alignment
Authors:
Ziao Wang,
Kilian Müller,
Matthew Filipovich,
Julien Launay,
Ruben Ohana,
Gustave Pariente,
Safa Mokaadi,
Charles Brossollet,
Fabien Moreau,
Alessandro Cappelli,
Iacopo Poli,
Igor Carron,
Laurent Daudet,
Florent Krzakala,
Sylvain Gigan
Abstract:
Modern deep learning relies nearly exclusively on dedicated electronic hardware accelerators. Photonic approaches, with low consumption and high operation speed, are increasingly considered for inference but, to date, remain mostly limited to relatively basic tasks. Simultaneously, the problem of training deep and complex neural networks, overwhelmingly performed through backpropagation, remains a…
▽ More
Modern deep learning relies nearly exclusively on dedicated electronic hardware accelerators. Photonic approaches, with low consumption and high operation speed, are increasingly considered for inference but, to date, remain mostly limited to relatively basic tasks. Simultaneously, the problem of training deep and complex neural networks, overwhelmingly performed through backpropagation, remains a significant limitation to the size and, consequently, the performance of current architectures and a major compute and energy bottleneck. Here, we experimentally implement a versatile and scalable training algorithm, called direct feedback alignment, on a hybrid electronic-photonic platform. An optical processing unit performs large-scale random matrix multiplications, which is the central operation of this algorithm, at speeds up to 1500 TeraOPS under 30 Watts of power. We perform optical training of modern deep learning architectures, including Transformers, with more than 1B parameters, and obtain good performances on language, vision, and diffusion-based generative tasks. We study the scaling of the training time, and demonstrate a potential advantage of our hybrid opto-electronic approach for ultra-deep and wide neural networks, thus opening a promising route to sustain the exponential growth of modern artificial intelligence beyond traditional von Neumann approaches.
△ Less
Submitted 2 April, 2025; v1 submitted 1 September, 2024;
originally announced September 2024.
-
The phase diagram of compressed sensing with $\ell_0$-norm regularization
Authors:
Damien Barbier,
Carlo Lucibello,
Luca Saglietti,
Florent Krzakala,
Lenka Zdeborová
Abstract:
Noiseless compressive sensing is a two-steps setting that allows for undersampling a sparse signal and then reconstructing it without loss of information. The LASSO algorithm, based on $\lone$ regularization, provides an efficient and robust to address this problem, but it fails in the regime of very high compression rate. Here we present two algorithms based on $\lzero$-norm regularization instea…
▽ More
Noiseless compressive sensing is a two-steps setting that allows for undersampling a sparse signal and then reconstructing it without loss of information. The LASSO algorithm, based on $\lone$ regularization, provides an efficient and robust to address this problem, but it fails in the regime of very high compression rate. Here we present two algorithms based on $\lzero$-norm regularization instead that outperform the LASSO in terms of compression rate in the Gaussian design setting for measurement matrix. These algorithms are based on the Approximate Survey Propagation, an algorithmic family within the Approximate Message Passing class. In the large system limit, they can be rigorously tracked through State Evolution equations and it is possible to exactly predict the range compression rates for which perfect signal reconstruction is possible. We also provide a statistical physics analysis of the $\lzero$-norm noiseless compressive sensing model. We show the existence of both a replica symmetric state and a 1-step replica symmmetry broken (1RSB) state for sufficiently low $\lzero$-norm regularization. The recovery limits of our algorithms are linked to the behavior of the 1RSB solution.
△ Less
Submitted 22 August, 2024; v1 submitted 31 July, 2024;
originally announced August 2024.
-
Bayes-optimal learning of an extensive-width neural network from quadratically many samples
Authors:
Antoine Maillard,
Emanuele Troiani,
Simon Martin,
Florent Krzakala,
Lenka Zdeborová
Abstract:
We consider the problem of learning a target function corresponding to a single hidden layer neural network, with a quadratic activation function after the first layer, and random weights. We consider the asymptotic limit where the input dimension and the network width are proportionally large. Recent work [Cui & al '23] established that linear regression provides Bayes-optimal test error to learn…
▽ More
We consider the problem of learning a target function corresponding to a single hidden layer neural network, with a quadratic activation function after the first layer, and random weights. We consider the asymptotic limit where the input dimension and the network width are proportionally large. Recent work [Cui & al '23] established that linear regression provides Bayes-optimal test error to learn such a function when the number of available samples is only linear in the dimension. That work stressed the open challenge of theoretically analyzing the optimal test error in the more interesting regime where the number of samples is quadratic in the dimension. In this paper, we solve this challenge for quadratic activations and derive a closed-form expression for the Bayes-optimal test error. We also provide an algorithm, that we call GAMP-RIE, which combines approximate message passing with rotationally invariant matrix denoising, and that asymptotically achieves the optimal performance. Technically, our result is enabled by establishing a link with recent works on optimal denoising of extensive-rank matrices and on the ellipsoid fitting problem. We further show empirically that, in the absence of noise, randomly-initialized gradient descent seems to sample the space of weights, leading to zero training loss, and averaging over initialization leads to a test error equal to the Bayes-optimal one.
△ Less
Submitted 7 August, 2024;
originally announced August 2024.
-
Fundamental computational limits of weak learnability in high-dimensional multi-index models
Authors:
Emanuele Troiani,
Yatin Dandi,
Leonardo Defilippis,
Lenka Zdeborová,
Bruno Loureiro,
Florent Krzakala
Abstract:
Multi-index models - functions which only depend on the covariates through a non-linear transformation of their projection on a subspace - are a useful benchmark for investigating feature learning with neural nets. This paper examines the theoretical boundaries of efficient learnability in this hypothesis class, focusing on the minimum sample complexity required for weakly recovering their low-dim…
▽ More
Multi-index models - functions which only depend on the covariates through a non-linear transformation of their projection on a subspace - are a useful benchmark for investigating feature learning with neural nets. This paper examines the theoretical boundaries of efficient learnability in this hypothesis class, focusing on the minimum sample complexity required for weakly recovering their low-dimensional structure with first-order iterative algorithms, in the high-dimensional regime where the number of samples $n\!=\!αd$ is proportional to the covariate dimension $d$. Our findings unfold in three parts: (i) we identify under which conditions a trivial subspace can be learned with a single step of a first-order algorithm for any $α\!>\!0$; (ii) if the trivial subspace is empty, we provide necessary and sufficient conditions for the existence of an easy subspace where directions that can be learned only above a certain sample complexity $α\!>\!α_c$, where $α_{c}$ marks a computational phase transition. In a limited but interesting set of really hard directions -- akin to the parity problem -- $α_c$ is found to diverge. Finally, (iii) we show that interactions between different directions can result in an intricate hierarchical learning phenomenon, where directions can be learned sequentially when coupled to easier ones. We discuss in detail the grand staircase picture associated to these functions (and contrast it with the original staircase one). Our theory builds on the optimality of approximate message-passing among first-order iterative methods, delineating the fundamental learnability limit across a broad spectrum of algorithms, including neural networks trained with gradient descent, which we discuss in this context.
△ Less
Submitted 2 April, 2025; v1 submitted 24 May, 2024;
originally announced May 2024.
-
Quenches in the Sherrington-Kirkpatrick model
Authors:
Vittorio Erba,
Freya Behrens,
Florent Krzakala,
Lenka Zdeborová
Abstract:
The Sherrington-Kirkpatrick (SK) model is a prototype of a complex non-convex energy landscape. Dynamical processes evolving on such landscapes and locally aiming to reach minima are generally poorly understood. Here, we study quenches, i.e. dynamics that locally aim to decrease energy. We analyse the energy at convergence for two distinct algorithmic classes, single-spin flip and synchronous dyna…
▽ More
The Sherrington-Kirkpatrick (SK) model is a prototype of a complex non-convex energy landscape. Dynamical processes evolving on such landscapes and locally aiming to reach minima are generally poorly understood. Here, we study quenches, i.e. dynamics that locally aim to decrease energy. We analyse the energy at convergence for two distinct algorithmic classes, single-spin flip and synchronous dynamics, focusing on greedy and reluctant strategies. We provide precise numerical analysis of the finite size effects and conclude that, perhaps counter-intuitively, the reluctant algorithm is compatible with converging to the ground state energy density, while the greedy strategy is not. Inspired by the single-spin reluctant and greedy algorithms, we investigate two synchronous time algorithms, the sync-greedy and sync-reluctant algorithms. These synchronous processes can be analysed using dynamical mean field theory (DMFT), and a new backtracking version of DMFT. Notably, this is the first time the backtracking DMFT is applied to study dynamical convergence properties in fully connected disordered models. The analysis suggests that the sync-greedy algorithm can also achieve energies compatible with the ground state, and that it undergoes a dynamical phase transition.
△ Less
Submitted 17 July, 2024; v1 submitted 7 May, 2024;
originally announced May 2024.
-
Spectral Phase Transition and Optimal PCA in Block-Structured Spiked models
Authors:
Pierre Mergny,
Justin Ko,
Florent Krzakala
Abstract:
We discuss the inhomogeneous spiked Wigner model, a theoretical framework recently introduced to study structured noise in various learning scenarios, through the prism of random matrix theory, with a specific focus on its spectral properties. Our primary objective is to find an optimal spectral method and to extend the celebrated \cite{BBP} (BBP) phase transition criterion -- well-known in the ho…
▽ More
We discuss the inhomogeneous spiked Wigner model, a theoretical framework recently introduced to study structured noise in various learning scenarios, through the prism of random matrix theory, with a specific focus on its spectral properties. Our primary objective is to find an optimal spectral method and to extend the celebrated \cite{BBP} (BBP) phase transition criterion -- well-known in the homogeneous case -- to our inhomogeneous, block-structured, Wigner model. We provide a thorough rigorous analysis of a transformed matrix and show that the transition for the appearance of 1) an outlier outside the bulk of the limiting spectral distribution and 2) a positive overlap between the associated eigenvector and the signal, occurs precisely at the optimal threshold, making the proposed spectral method optimal within the class of iterative methods for the inhomogeneous Wigner problem.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
Analysis of Bootstrap and Subsampling in High-dimensional Regularized Regression
Authors:
Lucas Clarté,
Adrien Vandenbroucque,
Guillaume Dalle,
Bruno Loureiro,
Florent Krzakala,
Lenka Zdeborová
Abstract:
We investigate popular resampling methods for estimating the uncertainty of statistical models, such as subsampling, bootstrap and the jackknife, and their performance in high-dimensional supervised regression tasks. We provide a tight asymptotic description of the biases and variances estimated by these methods in the context of generalized linear models, such as ridge and logistic regression, ta…
▽ More
We investigate popular resampling methods for estimating the uncertainty of statistical models, such as subsampling, bootstrap and the jackknife, and their performance in high-dimensional supervised regression tasks. We provide a tight asymptotic description of the biases and variances estimated by these methods in the context of generalized linear models, such as ridge and logistic regression, taking the limit where the number of samples $n$ and dimension $d$ of the covariates grow at a comparable fixed rate $α\!=\! n/d$. Our findings are three-fold: i) resampling methods are fraught with problems in high dimensions and exhibit the double-descent-like behavior typical of these situations; ii) only when $α$ is large enough do they provide consistent and reliable error estimations (we give convergence rates); iii) in the over-parametrized regime $α\!<\!1$ relevant to modern machine learning practice, their predictions are not consistent, even with optimal regularization.
△ Less
Submitted 1 November, 2024; v1 submitted 21 February, 2024;
originally announced February 2024.
-
A High Dimensional Statistical Model for Adversarial Training: Geometry and Trade-Offs
Authors:
Kasimir Tanner,
Matteo Vilucchio,
Bruno Loureiro,
Florent Krzakala
Abstract:
This work investigates adversarial training in the context of margin-based linear classifiers in the high-dimensional regime where the dimension $d$ and the number of data points $n$ diverge with a fixed ratio $α= n / d$. We introduce a tractable mathematical model where the interplay between the data and adversarial attacker geometries can be studied, while capturing the core phenomenology observ…
▽ More
This work investigates adversarial training in the context of margin-based linear classifiers in the high-dimensional regime where the dimension $d$ and the number of data points $n$ diverge with a fixed ratio $α= n / d$. We introduce a tractable mathematical model where the interplay between the data and adversarial attacker geometries can be studied, while capturing the core phenomenology observed in the adversarial robustness literature. Our main theoretical contribution is an exact asymptotic description of the sufficient statistics for the adversarial empirical risk minimiser, under generic convex and non-increasing losses for a Block Feature Model. Our result allow us to precisely characterise which directions in the data are associated with a higher generalisation/robustness trade-off, as defined by a robustness and a usefulness metric. We show that the the presence of multiple different feature types is crucial to the high sample complexity performances of adversarial training. In particular, we unveil the existence of directions which can be defended without penalising accuracy. Finally, we show the advantage of defending non-robust features during training, identifying a uniform protection as an inherently effective defence mechanism.
△ Less
Submitted 27 December, 2024; v1 submitted 8 February, 2024;
originally announced February 2024.
-
Asymptotics of feature learning in two-layer networks after one gradient-step
Authors:
Hugo Cui,
Luca Pesce,
Yatin Dandi,
Florent Krzakala,
Yue M. Lu,
Lenka Zdeborová,
Bruno Loureiro
Abstract:
In this manuscript, we investigate the problem of how two-layer neural networks learn features from data, and improve over the kernel regime, after being trained with a single gradient descent step. Leveraging the insight from (Ba et al., 2022), we model the trained network by a spiked Random Features (sRF) model. Further building on recent progress on Gaussian universality (Dandi et al., 2023), w…
▽ More
In this manuscript, we investigate the problem of how two-layer neural networks learn features from data, and improve over the kernel regime, after being trained with a single gradient descent step. Leveraging the insight from (Ba et al., 2022), we model the trained network by a spiked Random Features (sRF) model. Further building on recent progress on Gaussian universality (Dandi et al., 2023), we provide an exact asymptotic description of the generalization error of the sRF in the high-dimensional limit where the number of samples, the width, and the input dimension grow at a proportional rate. The resulting characterization for sRFs also captures closely the learning curves of the original network model. This enables us to understand how adapting to the data is crucial for the network to efficiently learn non-linear functions in the direction of the gradient -- where at initialization it can only express linear functions in this regime.
△ Less
Submitted 4 June, 2024; v1 submitted 7 February, 2024;
originally announced February 2024.
-
On the Atypical Solutions of the Symmetric Binary Perceptron
Authors:
Damien Barbier,
Ahmed El Alaoui,
Florent Krzakala,
Lenka Zdeborová
Abstract:
We study the random binary symmetric perceptron problem, focusing on the behavior of rare high-margin solutions. While most solutions are isolated, we demonstrate that these rare solutions are part of clusters of extensive entropy, heuristically corresponding to non-trivial fixed points of an approximate message-passing algorithm. We enumerate these clusters via a local entropy, defined as a Franz…
▽ More
We study the random binary symmetric perceptron problem, focusing on the behavior of rare high-margin solutions. While most solutions are isolated, we demonstrate that these rare solutions are part of clusters of extensive entropy, heuristically corresponding to non-trivial fixed points of an approximate message-passing algorithm. We enumerate these clusters via a local entropy, defined as a Franz-Parisi potential, which we rigorously evaluate using the first and second moment methods in the limit of a small constraint density $α$ (corresponding to vanishing margin $κ$) under a certain assumption on the concentration of the entropy. This examination unveils several intriguing phenomena: i) We demonstrate that these clusters have an entropic barrier in the sense that the entropy as a function of the distance from the reference high-margin solution is non-monotone when $κ\le 1.429 \sqrt{-α/\logα}$, while it is monotone otherwise, and that they have an energetic barrier in the sense that there are no solutions at an intermediate distance from the reference solution when $κ\le 1.239 \sqrt{-α/ \logα}$. The critical scaling of the margin $κ$ in $\sqrt{-α/\logα}$ corresponds to the one obtained from the earlier work of Gamarnik et al. (2022) for the overlap-gap property, a phenomenon known to present a barrier to certain efficient algorithms. ii) We establish using the replica method that the complexity (the logarithm of the number of clusters of such solutions) versus entropy (the logarithm of the number of solutions in the clusters) curves are partly non-concave and correspond to very large values of the Parisi parameter, with the equilibrium being reached when the Parisi parameter diverges.
△ Less
Submitted 28 June, 2024; v1 submitted 4 October, 2023;
originally announced October 2023.
-
Sampling with flows, diffusion and autoregressive neural networks: A spin-glass perspective
Authors:
Davide Ghio,
Yatin Dandi,
Florent Krzakala,
Lenka Zdeborová
Abstract:
Recent years witnessed the development of powerful generative models based on flows, diffusion or autoregressive neural networks, achieving remarkable success in generating data from examples with applications in a broad range of areas. A theoretical analysis of the performance and understanding of the limitations of these methods remain, however, challenging. In this paper, we undertake a step in…
▽ More
Recent years witnessed the development of powerful generative models based on flows, diffusion or autoregressive neural networks, achieving remarkable success in generating data from examples with applications in a broad range of areas. A theoretical analysis of the performance and understanding of the limitations of these methods remain, however, challenging. In this paper, we undertake a step in this direction by analysing the efficiency of sampling by these methods on a class of problems with a known probability distribution and comparing it with the sampling performance of more traditional methods such as the Monte Carlo Markov chain and Langevin dynamics. We focus on a class of probability distribution widely studied in the statistical physics of disordered systems that relate to spin glasses, statistical inference and constraint satisfaction problems.
We leverage the fact that sampling via flow-based, diffusion-based or autoregressive networks methods can be equivalently mapped to the analysis of a Bayes optimal denoising of a modified probability measure. Our findings demonstrate that these methods encounter difficulties in sampling stemming from the presence of a first-order phase transition along the algorithm's denoising path. Our conclusions go both ways: we identify regions of parameters where these methods are unable to sample efficiently, while that is possible using standard Monte Carlo or Langevin approaches. We also identify regions where the opposite happens: standard approaches are inefficient while the discussed generative methods work well.
△ Less
Submitted 27 August, 2023;
originally announced August 2023.
-
Compressed sensing with l0-norm: statistical physics analysis and algorithms for signal recovery
Authors:
D. Barbier,
C Lucibello,
L. Saglietti,
F. Krzakala,
L. Zdeborova
Abstract:
Noiseless compressive sensing is a protocol that enables undersampling and later recovery of a signal without loss of information. This compression is possible because the signal is usually sufficiently sparse in a given basis. Currently, the algorithm offering the best tradeoff between compression rate, robustness, and speed for compressive sensing is the LASSO (l1-norm bias) algorithm. However,…
▽ More
Noiseless compressive sensing is a protocol that enables undersampling and later recovery of a signal without loss of information. This compression is possible because the signal is usually sufficiently sparse in a given basis. Currently, the algorithm offering the best tradeoff between compression rate, robustness, and speed for compressive sensing is the LASSO (l1-norm bias) algorithm. However, many studies have pointed out the possibility that the implementation of lp-norms biases, with p smaller than one, could give better performance while sacrificing convexity. In this work, we focus specifically on the extreme case of the l0-based reconstruction, a task that is complicated by the discontinuity of the loss. In the first part of the paper, we describe via statistical physics methods, and in particular the replica method, how the solutions to this optimization problem are arranged in a clustered structure. We observe two distinct regimes: one at low compression rate where the signal can be recovered exactly, and one at high compression rate where the signal cannot be recovered accurately. In the second part, we present two message-passing algorithms based on our first results for the l0-norm optimization problem. The proposed algorithms are able to recover the signal at compression rates higher than the ones achieved by LASSO while being computationally efficient.
△ Less
Submitted 24 April, 2023;
originally announced April 2023.
-
Statistical mechanics of the maximum-average submatrix problem
Authors:
Vittorio Erba,
Florent Krzakala,
Rodrigo Pérez,
Lenka Zdeborová
Abstract:
We study the maximum-average submatrix problem, in which given an $N \times N$ matrix $J$ one needs to find the $k \times k$ submatrix with the largest average of entries. We study the problem for random matrices $J$ whose entries are i.i.d. random variables by mapping it to a variant of the Sherrington-Kirkpatrick spin-glass model at fixed magnetization. We characterize analytically the phase dia…
▽ More
We study the maximum-average submatrix problem, in which given an $N \times N$ matrix $J$ one needs to find the $k \times k$ submatrix with the largest average of entries. We study the problem for random matrices $J$ whose entries are i.i.d. random variables by mapping it to a variant of the Sherrington-Kirkpatrick spin-glass model at fixed magnetization. We characterize analytically the phase diagram of the model as a function of the submatrix average and the size of the submatrix $k$ in the limit $N\to\infty$. We consider submatrices of size $k = m N$ with $0 < m < 1$. We find a rich phase diagram, including dynamical, static one-step replica symmetry breaking and full-step replica symmetry breaking. In the limit of $m \to 0$, we find a simpler phase diagram featuring a frozen 1-RSB phase, where the Gibbs measure is composed of exponentially many pure states each with zero entropy. We discover an interesting phenomenon, reminiscent of the phenomenology of the binary perceptron: there exist efficient algorithms that provably work in the frozen 1-RSB phase.
△ Less
Submitted 21 September, 2023; v1 submitted 9 March, 2023;
originally announced March 2023.
-
Are Gaussian data all you need? Extents and limits of universality in high-dimensional generalized linear estimation
Authors:
Luca Pesce,
Florent Krzakala,
Bruno Loureiro,
Ludovic Stephan
Abstract:
In this manuscript we consider the problem of generalized linear estimation on Gaussian mixture data with labels given by a single-index model. Our first result is a sharp asymptotic expression for the test and training errors in the high-dimensional regime. Motivated by the recent stream of results on the Gaussian universality of the test and training errors in generalized linear estimation, we a…
▽ More
In this manuscript we consider the problem of generalized linear estimation on Gaussian mixture data with labels given by a single-index model. Our first result is a sharp asymptotic expression for the test and training errors in the high-dimensional regime. Motivated by the recent stream of results on the Gaussian universality of the test and training errors in generalized linear estimation, we ask ourselves the question: "when is a single Gaussian enough to characterize the error?". Our formula allow us to give sharp answers to this question, both in the positive and negative directions. More precisely, we show that the sufficient conditions for Gaussian universality (or lack of thereof) crucially depend on the alignment between the target weights and the means and covariances of the mixture clusters, which we precisely quantify. In the particular case of least-squares interpolation, we prove a strong universality property of the training error, and show it follows a simple, closed-form expression. Finally, we apply our results to real datasets, clarifying some recent discussion in the literature about Gaussian universality of the errors in this context.
△ Less
Submitted 17 February, 2023;
originally announced February 2023.
-
Optimal Algorithms for the Inhomogeneous Spiked Wigner Model
Authors:
Aleksandr Pak,
Justin Ko,
Florent Krzakala
Abstract:
In this paper, we study a spiked Wigner problem with an inhomogeneous noise profile. Our aim in this problem is to recover the signal passed through an inhomogeneous low-rank matrix channel. While the information-theoretic performances are well-known, we focus on the algorithmic problem. We derive an approximate message-passing algorithm (AMP) for the inhomogeneous problem and show that its rigoro…
▽ More
In this paper, we study a spiked Wigner problem with an inhomogeneous noise profile. Our aim in this problem is to recover the signal passed through an inhomogeneous low-rank matrix channel. While the information-theoretic performances are well-known, we focus on the algorithmic problem. We derive an approximate message-passing algorithm (AMP) for the inhomogeneous problem and show that its rigorous state evolution coincides with the information-theoretic optimal Bayes fixed-point equations. We identify in particular the existence of a statistical-to-computational gap where known algorithms require a signal-to-noise ratio bigger than the information-theoretic threshold to perform better than random. Finally, from the adapted AMP iteration we deduce a simple and efficient spectral method that can be used to recover the transition for matrices with general variance profiles. This spectral method matches the conjectured optimal computational phase transition.
△ Less
Submitted 13 February, 2023;
originally announced February 2023.
-
From high-dimensional & mean-field dynamics to dimensionless ODEs: A unifying approach to SGD in two-layers networks
Authors:
Luca Arnaboldi,
Ludovic Stephan,
Florent Krzakala,
Bruno Loureiro
Abstract:
This manuscript investigates the one-pass stochastic gradient descent (SGD) dynamics of a two-layer neural network trained on Gaussian data and labels generated by a similar, though not necessarily identical, target function. We rigorously analyse the limiting dynamics via a deterministic and low-dimensional description in terms of the sufficient statistics for the population risk. Our unifying an…
▽ More
This manuscript investigates the one-pass stochastic gradient descent (SGD) dynamics of a two-layer neural network trained on Gaussian data and labels generated by a similar, though not necessarily identical, target function. We rigorously analyse the limiting dynamics via a deterministic and low-dimensional description in terms of the sufficient statistics for the population risk. Our unifying analysis bridges different regimes of interest, such as the classical gradient-flow regime of vanishing learning rate, the high-dimensional regime of large input dimension, and the overparameterised "mean-field" regime of large network width, covering as well the intermediate regimes where the limiting dynamics is determined by the interplay between these behaviours. In particular, in the high-dimensional limit, the infinite-width dynamics is found to remain close to a low-dimensional subspace spanned by the target principal directions. Our results therefore provide a unifying picture of the limiting SGD dynamics with synthetic data.
△ Less
Submitted 12 February, 2023;
originally announced February 2023.
-
Bayes-optimal Learning of Deep Random Networks of Extensive-width
Authors:
Hugo Cui,
Florent Krzakala,
Lenka Zdeborová
Abstract:
We consider the problem of learning a target function corresponding to a deep, extensive-width, non-linear neural network with random Gaussian weights. We consider the asymptotic limit where the number of samples, the input dimension and the network width are proportionally large. We propose a closed-form expression for the Bayes-optimal test error, for regression and classification tasks. We furt…
▽ More
We consider the problem of learning a target function corresponding to a deep, extensive-width, non-linear neural network with random Gaussian weights. We consider the asymptotic limit where the number of samples, the input dimension and the network width are proportionally large. We propose a closed-form expression for the Bayes-optimal test error, for regression and classification tasks. We further compute closed-form expressions for the test errors of ridge regression, kernel and random features regression. We find, in particular, that optimally regularized ridge regression, as well as kernel regression, achieve Bayes-optimal performances, while the logistic loss yields a near-optimal test error for classification. We further show numerically that when the number of samples grows faster than the dimension, ridge and kernel methods become suboptimal, while neural networks achieve test error close to zero from quadratically many samples.
△ Less
Submitted 21 June, 2023; v1 submitted 1 February, 2023;
originally announced February 2023.
-
Low-rank Matrix Estimation with Inhomogeneous Noise
Authors:
Alice Guionnet,
Justin Ko,
Florent Krzakala,
Lenka Zdeborová
Abstract:
We study low-rank matrix estimation for a generic inhomogeneous output channel through which the matrix is observed. This generalizes the commonly considered spiked matrix model with homogeneous noise to include for instance the dense degree-corrected stochastic block model. We adapt techniques used to study multispecies spin glasses to derive and rigorously prove an expression for the free energy…
▽ More
We study low-rank matrix estimation for a generic inhomogeneous output channel through which the matrix is observed. This generalizes the commonly considered spiked matrix model with homogeneous noise to include for instance the dense degree-corrected stochastic block model. We adapt techniques used to study multispecies spin glasses to derive and rigorously prove an expression for the free energy of the problem in the large size limit, providing a framework to study the signal detection thresholds. We discuss an application of this framework to the degree corrected stochastic block models.
△ Less
Submitted 11 August, 2022;
originally announced August 2022.
-
Subspace clustering in high-dimensions: Phase transitions & Statistical-to-Computational gap
Authors:
Luca Pesce,
Bruno Loureiro,
Florent Krzakala,
Lenka Zdeborová
Abstract:
A simple model to study subspace clustering is the high-dimensional $k$-Gaussian mixture model where the cluster means are sparse vectors. Here we provide an exact asymptotic characterization of the statistically optimal reconstruction error in this model in the high-dimensional regime with extensive sparsity, i.e. when the fraction of non-zero components of the cluster means $ρ$, as well as the r…
▽ More
A simple model to study subspace clustering is the high-dimensional $k$-Gaussian mixture model where the cluster means are sparse vectors. Here we provide an exact asymptotic characterization of the statistically optimal reconstruction error in this model in the high-dimensional regime with extensive sparsity, i.e. when the fraction of non-zero components of the cluster means $ρ$, as well as the ratio $α$ between the number of samples and the dimension are fixed, while the dimension diverges. We identify the information-theoretic threshold below which obtaining a positive correlation with the true cluster means is statistically impossible. Additionally, we investigate the performance of the approximate message passing (AMP) algorithm analyzed via its state evolution, which is conjectured to be optimal among polynomial algorithm for this task. We identify in particular the existence of a statistical-to-computational gap between the algorithm that require a signal-to-noise ratio $λ_{\text{alg}} \ge k / \sqrtα $ to perform better than random, and the information theoretic threshold at $λ_{\text{it}} \approx \sqrt{-k ρ\logρ} / \sqrtα$. Finally, we discuss the case of sub-extensive sparsity $ρ$ by comparing the performance of the AMP with other sparsity-enhancing algorithms, such as sparse-PCA and diagonal thresholding.
△ Less
Submitted 1 December, 2022; v1 submitted 26 May, 2022;
originally announced May 2022.
-
Gaussian Universality of Perceptrons with Random Labels
Authors:
Federica Gerace,
Florent Krzakala,
Bruno Loureiro,
Ludovic Stephan,
Lenka Zdeborová
Abstract:
While classical in many theoretical settings - and in particular in statistical physics-inspired works - the assumption of Gaussian i.i.d. input data is often perceived as a strong limitation in the context of statistics and machine learning. In this study, we redeem this line of work in the case of generalized linear classification, a.k.a. the perceptron model, with random labels. We argue that t…
▽ More
While classical in many theoretical settings - and in particular in statistical physics-inspired works - the assumption of Gaussian i.i.d. input data is often perceived as a strong limitation in the context of statistics and machine learning. In this study, we redeem this line of work in the case of generalized linear classification, a.k.a. the perceptron model, with random labels. We argue that there is a large universality class of high-dimensional input data for which we obtain the same minimum training loss as for Gaussian data with corresponding data covariance. In the limit of vanishing regularization, we further demonstrate that the training loss is independent of the data covariance. On the theoretical side, we prove this universality for an arbitrary mixture of homogeneous Gaussian clouds. Empirically, we show that the universality holds also for a broad range of real datasets.
△ Less
Submitted 2 March, 2023; v1 submitted 26 May, 2022;
originally announced May 2022.
-
Optimal denoising of rotationally invariant rectangular matrices
Authors:
Emanuele Troiani,
Vittorio Erba,
Florent Krzakala,
Antoine Maillard,
Lenka Zdeborová
Abstract:
In this manuscript we consider denoising of large rectangular matrices: given a noisy observation of a signal matrix, what is the best way of recovering the signal matrix itself? For Gaussian noise and rotationally-invariant signal priors, we completely characterize the optimal denoiser and its performance in the high-dimensional limit, in which the size of the signal matrix goes to infinity with…
▽ More
In this manuscript we consider denoising of large rectangular matrices: given a noisy observation of a signal matrix, what is the best way of recovering the signal matrix itself? For Gaussian noise and rotationally-invariant signal priors, we completely characterize the optimal denoiser and its performance in the high-dimensional limit, in which the size of the signal matrix goes to infinity with fixed aspects ratio, and under the Bayes optimal setting, that is when the statistician knows how the signal and the observations were generated. Our results generalise previous works that considered only symmetric matrices to the more general case of non-symmetric and rectangular ones. We explore analytically and numerically a particular choice of factorized signal prior that models cross-covariance matrices and the matrix factorization problem. As a byproduct of our analysis, we provide an explicit asymptotic evaluation of the rectangular Harish-Chandra-Itzykson-Zuber integral in a special case.
△ Less
Submitted 15 March, 2022;
originally announced March 2022.
-
Theoretical characterization of uncertainty in high-dimensional linear classification
Authors:
Lucas Clarté,
Bruno Loureiro,
Florent Krzakala,
Lenka Zdeborová
Abstract:
Being able to reliably assess not only the \emph{accuracy} but also the \emph{uncertainty} of models' predictions is an important endeavour in modern machine learning. Even if the model generating the data and labels is known, computing the intrinsic uncertainty after learning the model from a limited number of samples amounts to sampling the corresponding posterior probability measure. Such sampl…
▽ More
Being able to reliably assess not only the \emph{accuracy} but also the \emph{uncertainty} of models' predictions is an important endeavour in modern machine learning. Even if the model generating the data and labels is known, computing the intrinsic uncertainty after learning the model from a limited number of samples amounts to sampling the corresponding posterior probability measure. Such sampling is computationally challenging in high-dimensional problems and theoretical results on heuristic uncertainty estimators in high-dimensions are thus scarce. In this manuscript, we characterise uncertainty for learning from limited number of samples of high-dimensional Gaussian input data and labels generated by the probit model. In this setting, the Bayesian uncertainty (i.e. the posterior marginals) can be asymptotically obtained by the approximate message passing algorithm, bypassing the canonical but costly Monte Carlo sampling of the posterior. We then provide a closed-form formula for the joint statistics between the logistic classifier, the uncertainty of the statistically optimal Bayesian classifier and the ground-truth probit uncertainty. The formula allows us to investigate calibration of the logistic classifier learning from limited amount of samples. We discuss how over-confidence can be mitigated by appropriately regularising.
△ Less
Submitted 14 November, 2022; v1 submitted 7 February, 2022;
originally announced February 2022.
-
Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks
Authors:
Rodrigo Veiga,
Ludovic Stephan,
Bruno Loureiro,
Florent Krzakala,
Lenka Zdeborová
Abstract:
Despite the non-convex optimization landscape, over-parametrized shallow networks are able to achieve global convergence under gradient descent. The picture can be radically different for narrow networks, which tend to get stuck in badly-generalizing local minima. Here we investigate the cross-over between these two regimes in the high-dimensional setting, and in particular investigate the connect…
▽ More
Despite the non-convex optimization landscape, over-parametrized shallow networks are able to achieve global convergence under gradient descent. The picture can be radically different for narrow networks, which tend to get stuck in badly-generalizing local minima. Here we investigate the cross-over between these two regimes in the high-dimensional setting, and in particular investigate the connection between the so-called mean-field/hydrodynamic regime and the seminal approach of Saad & Solla. Focusing on the case of Gaussian data, we study the interplay between the learning rate, the time scale, and the number of hidden units in the high-dimensional dynamics of stochastic gradient descent (SGD). Our work builds on a deterministic description of SGD in high-dimensions from statistical physics, which we extend and for which we provide rigorous convergence rates.
△ Less
Submitted 14 June, 2023; v1 submitted 1 February, 2022;
originally announced February 2022.
-
Fluctuations, Bias, Variance & Ensemble of Learners: Exact Asymptotics for Convex Losses in High-Dimension
Authors:
Bruno Loureiro,
Cédric Gerbelot,
Maria Refinetti,
Gabriele Sicuro,
Florent Krzakala
Abstract:
From the sampling of data to the initialisation of parameters, randomness is ubiquitous in modern Machine Learning practice. Understanding the statistical fluctuations engendered by the different sources of randomness in prediction is therefore key to understanding robust generalisation. In this manuscript we develop a quantitative and rigorous theory for the study of fluctuations in an ensemble o…
▽ More
From the sampling of data to the initialisation of parameters, randomness is ubiquitous in modern Machine Learning practice. Understanding the statistical fluctuations engendered by the different sources of randomness in prediction is therefore key to understanding robust generalisation. In this manuscript we develop a quantitative and rigorous theory for the study of fluctuations in an ensemble of generalised linear models trained on different, but correlated, features in high-dimensions. In particular, we provide a complete description of the asymptotic joint distribution of the empirical risk minimiser for generic convex loss and regularisation in the high-dimensional limit. Our result encompasses a rich set of classification and regression tasks, such as the lazy regime of overparametrised neural networks, or equivalently the random features approximation of kernels. While allowing to study directly the mitigating effect of ensembling (or bagging) on the bias-variance decomposition of the test error, our analysis also helps disentangle the contribution of statistical fluctuations, and the singular role played by the interpolation threshold that are at the roots of the "double-descent" phenomenon.
△ Less
Submitted 31 January, 2022;
originally announced January 2022.
-
Perturbative construction of mean-field equations in extensive-rank matrix factorization and denoising
Authors:
Antoine Maillard,
Florent Krzakala,
Marc Mézard,
Lenka Zdeborová
Abstract:
Factorization of matrices where the rank of the two factors diverges linearly with their sizes has many applications in diverse areas such as unsupervised representation learning, dictionary learning or sparse coding. We consider a setting where the two factors are generated from known component-wise independent prior distributions, and the statistician observes a (possibly noisy) component-wise f…
▽ More
Factorization of matrices where the rank of the two factors diverges linearly with their sizes has many applications in diverse areas such as unsupervised representation learning, dictionary learning or sparse coding. We consider a setting where the two factors are generated from known component-wise independent prior distributions, and the statistician observes a (possibly noisy) component-wise function of their matrix product. In the limit where the dimensions of the matrices tend to infinity, but their ratios remain fixed, we expect to be able to derive closed form expressions for the optimal mean squared error on the estimation of the two factors. However, this remains a very involved mathematical and algorithmic problem. A related, but simpler, problem is extensive-rank matrix denoising, where one aims to reconstruct a matrix with extensive but usually small rank from noisy measurements. In this paper, we approach both these problems using high-temperature expansions at fixed order parameters. This allows to clarify how previous attempts at solving these problems failed at finding an asymptotically exact solution. We provide a systematic way to derive the corrections to these existing approximations, taking into account the structure of correlations particular to the problem. Finally, we illustrate our approach in detail on the case of extensive-rank matrix denoising. We compare our results with known optimal rotationally-invariant estimators, and show how exact asymptotic calculations of the minimal error can be performed using extensive-rank matrix integrals.
△ Less
Submitted 8 June, 2022; v1 submitted 17 October, 2021;
originally announced October 2021.
-
Learning Gaussian Mixtures with Generalised Linear Models: Precise Asymptotics in High-dimensions
Authors:
Bruno Loureiro,
Gabriele Sicuro,
Cédric Gerbelot,
Alessandro Pacco,
Florent Krzakala,
Lenka Zdeborová
Abstract:
Generalised linear models for multi-class classification problems are one of the fundamental building blocks of modern machine learning tasks. In this manuscript, we characterise the learning of a mixture of $K$ Gaussians with generic means and covariances via empirical risk minimisation (ERM) with any convex loss and regularisation. In particular, we prove exact asymptotics characterising the ERM…
▽ More
Generalised linear models for multi-class classification problems are one of the fundamental building blocks of modern machine learning tasks. In this manuscript, we characterise the learning of a mixture of $K$ Gaussians with generic means and covariances via empirical risk minimisation (ERM) with any convex loss and regularisation. In particular, we prove exact asymptotics characterising the ERM estimator in high-dimensions, extending several previous results about Gaussian mixture classification in the literature. We exemplify our result in two tasks of interest in statistical learning: a) classification for a mixture with sparse means, where we study the efficiency of $\ell_1$ penalty with respect to $\ell_2$; b) max-margin multi-class classification, where we characterise the phase transition on the existence of the multi-class logistic maximum likelihood estimator for $K>2$. Finally, we discuss how our theory can be applied beyond the scope of synthetic data, showing that in different cases Gaussian mixtures capture closely the learning curve of classification tasks in real data sets.
△ Less
Submitted 14 December, 2021; v1 submitted 7 June, 2021;
originally announced June 2021.
-
Generalization Error Rates in Kernel Regression: The Crossover from the Noiseless to Noisy Regime
Authors:
Hugo Cui,
Bruno Loureiro,
Florent Krzakala,
Lenka Zdeborová
Abstract:
In this manuscript we consider Kernel Ridge Regression (KRR) under the Gaussian design. Exponents for the decay of the excess generalization error of KRR have been reported in various works under the assumption of power-law decay of eigenvalues of the features co-variance. These decays were, however, provided for sizeably different setups, namely in the noiseless case with constant regularization…
▽ More
In this manuscript we consider Kernel Ridge Regression (KRR) under the Gaussian design. Exponents for the decay of the excess generalization error of KRR have been reported in various works under the assumption of power-law decay of eigenvalues of the features co-variance. These decays were, however, provided for sizeably different setups, namely in the noiseless case with constant regularization and in the noisy optimally regularized case. Intermediary settings have been left substantially uncharted. In this work, we unify and extend this line of work, providing characterization of all regimes and excess error decay rates that can be observed in terms of the interplay of noise and regularization. In particular, we show the existence of a transition in the noisy setting between the noiseless exponents to its noisy values as the sample complexity is increased. Finally, we illustrate how this crossover can also be observed on real data sets.
△ Less
Submitted 15 December, 2021; v1 submitted 31 May, 2021;
originally announced May 2021.
-
Bayesian reconstruction of memories stored in neural networks from their connectivity
Authors:
Sebastian Goldt,
Florent Krzakala,
Lenka Zdeborová,
Nicolas Brunel
Abstract:
The advent of comprehensive synaptic wiring diagrams of large neural circuits has created the field of connectomics and given rise to a number of open research questions. One such question is whether it is possible to reconstruct the information stored in a recurrent network of neurons, given its synaptic connectivity matrix. Here, we address this question by determining when solving such an infer…
▽ More
The advent of comprehensive synaptic wiring diagrams of large neural circuits has created the field of connectomics and given rise to a number of open research questions. One such question is whether it is possible to reconstruct the information stored in a recurrent network of neurons, given its synaptic connectivity matrix. Here, we address this question by determining when solving such an inference problem is theoretically possible in specific attractor network models and by providing a practical algorithm to do so. The algorithm builds on ideas from statistical physics to perform approximate Bayesian inference and is amenable to exact analysis. We study its performance on three different models, compare the algorithm to standard algorithms such as PCA, and explore the limitations of reconstructing stored patterns from synaptic connectivity.
△ Less
Submitted 29 August, 2022; v1 submitted 16 May, 2021;
originally announced May 2021.
-
Classifying high-dimensional Gaussian mixtures: Where kernel methods fail and neural networks succeed
Authors:
Maria Refinetti,
Sebastian Goldt,
Florent Krzakala,
Lenka Zdeborová
Abstract:
A recent series of theoretical works showed that the dynamics of neural networks with a certain initialisation are well-captured by kernel methods. Concurrent empirical work demonstrated that kernel methods can come close to the performance of neural networks on some image classification tasks. These results raise the question of whether neural networks only learn successfully if kernels also lear…
▽ More
A recent series of theoretical works showed that the dynamics of neural networks with a certain initialisation are well-captured by kernel methods. Concurrent empirical work demonstrated that kernel methods can come close to the performance of neural networks on some image classification tasks. These results raise the question of whether neural networks only learn successfully if kernels also learn successfully, despite neural networks being more expressive. Here, we show theoretically that two-layer neural networks (2LNN) with only a few hidden neurons can beat the performance of kernel learning on a simple Gaussian mixture classification task. We study the high-dimensional limit where the number of samples is linearly proportional to the input dimension, and show that while small 2LNN achieve near-optimal performance on this task, lazy training approaches such as random features and kernel methods do not. Our analysis is based on the derivation of a closed set of equations that track the learning dynamics of the 2LNN and thus allow to extract the asymptotic performance of the network as a function of signal-to-noise ratio and other hyperparameters. We finally illustrate how over-parametrising the neural network leads to faster convergence, but does not improve its final performance.
△ Less
Submitted 10 June, 2021; v1 submitted 23 February, 2021;
originally announced February 2021.
-
Learning curves of generic features maps for realistic datasets with a teacher-student model
Authors:
Bruno Loureiro,
Cédric Gerbelot,
Hugo Cui,
Sebastian Goldt,
Florent Krzakala,
Marc Mézard,
Lenka Zdeborová
Abstract:
Teacher-student models provide a framework in which the typical-case performance of high-dimensional supervised learning can be described in closed form. The assumptions of Gaussian i.i.d. input data underlying the canonical teacher-student model may, however, be perceived as too restrictive to capture the behaviour of realistic data sets. In this paper, we introduce a Gaussian covariate generalis…
▽ More
Teacher-student models provide a framework in which the typical-case performance of high-dimensional supervised learning can be described in closed form. The assumptions of Gaussian i.i.d. input data underlying the canonical teacher-student model may, however, be perceived as too restrictive to capture the behaviour of realistic data sets. In this paper, we introduce a Gaussian covariate generalisation of the model where the teacher and student can act on different spaces, generated with fixed, but generic feature maps. While still solvable in a closed form, this generalization is able to capture the learning curves for a broad range of realistic data sets, thus redeeming the potential of the teacher-student framework. Our contribution is then two-fold: First, we prove a rigorous formula for the asymptotic training loss and generalisation error. Second, we present a number of situations where the learning curve of the model captures the one of a realistic data set learned with kernel regression and classification, with out-of-the-box feature maps such as random projections or scattering transforms, or with pre-learned ones - such as the features learned by training multi-layer neural networks. We discuss both the power and the limitations of the framework.
△ Less
Submitted 14 December, 2021; v1 submitted 16 February, 2021;
originally announced February 2021.
-
Construction of optimal spectral methods in phase retrieval
Authors:
Antoine Maillard,
Florent Krzakala,
Yue M. Lu,
Lenka Zdeborová
Abstract:
We consider the phase retrieval problem, in which the observer wishes to recover a $n$-dimensional real or complex signal $\mathbf{X}^\star$ from the (possibly noisy) observation of $|\mathbfΦ \mathbf{X}^\star|$, in which $\mathbfΦ$ is a matrix of size $m \times n$. We consider a \emph{high-dimensional} setting where $n,m \to \infty$ with $m/n = \mathcal{O}(1)$, and a large class of (possibly corr…
▽ More
We consider the phase retrieval problem, in which the observer wishes to recover a $n$-dimensional real or complex signal $\mathbf{X}^\star$ from the (possibly noisy) observation of $|\mathbfΦ \mathbf{X}^\star|$, in which $\mathbfΦ$ is a matrix of size $m \times n$. We consider a \emph{high-dimensional} setting where $n,m \to \infty$ with $m/n = \mathcal{O}(1)$, and a large class of (possibly correlated) random matrices $\mathbfΦ$ and observation channels. Spectral methods are a powerful tool to obtain approximate observations of the signal $\mathbf{X}^\star$ which can be then used as initialization for a subsequent algorithm, at a low computational cost. In this paper, we extend and unify previous results and approaches on spectral methods for the phase retrieval problem. More precisely, we combine the linearization of message-passing algorithms and the analysis of the \emph{Bethe Hessian}, a classical tool of statistical physics. Using this toolbox, we show how to derive optimal spectral methods for arbitrary channel noise and right-unitarily invariant matrix $\mathbfΦ$, in an automated manner (i.e. with no optimization over any hyperparameter or preprocessing function).
△ Less
Submitted 14 October, 2021; v1 submitted 8 December, 2020;
originally announced December 2020.
-
Epidemic mitigation by statistical inference from contact tracing data
Authors:
Antoine Baker,
Indaco Biazzo,
Alfredo Braunstein,
Giovanni Catania,
Luca Dall'Asta,
Alessandro Ingrosso,
Florent Krzakala,
Fabio Mazza,
Marc Mézard,
Anna Paola Muntoni,
Maria Refinetti,
Stefano Sarao Mannelli,
Lenka Zdeborová
Abstract:
Contact-tracing is an essential tool in order to mitigate the impact of pandemic such as the COVID-19. In order to achieve efficient and scalable contact-tracing in real time, digital devices can play an important role. While a lot of attention has been paid to analyzing the privacy and ethical risks of the associated mobile applications, so far much less research has been devoted to optimizing th…
▽ More
Contact-tracing is an essential tool in order to mitigate the impact of pandemic such as the COVID-19. In order to achieve efficient and scalable contact-tracing in real time, digital devices can play an important role. While a lot of attention has been paid to analyzing the privacy and ethical risks of the associated mobile applications, so far much less research has been devoted to optimizing their performance and assessing their impact on the mitigation of the epidemic. We develop Bayesian inference methods to estimate the risk that an individual is infected. This inference is based on the list of his recent contacts and their own risk levels, as well as personal information such as results of tests or presence of syndromes. We propose to use probabilistic risk estimation in order to optimize testing and quarantining strategies for the control of an epidemic. Our results show that in some range of epidemic spreading (typically when the manual tracing of all contacts of infected people becomes practically impossible, but before the fraction of infected people reaches the scale where a lock-down becomes unavoidable), this inference of individuals at risk could be an efficient way to mitigate the epidemic. Our approaches translate into fully distributed algorithms that only require communication between individuals who have recently been in contact. Such communication may be encrypted and anonymized and thus compatible with privacy preserving standards. We conclude that probabilistic risk estimation is capable to enhance performance of digital contact tracing and should be considered in the currently developed mobile applications.
△ Less
Submitted 20 September, 2020;
originally announced September 2020.
-
The Gaussian equivalence of generative models for learning with shallow neural networks
Authors:
Sebastian Goldt,
Bruno Loureiro,
Galen Reeves,
Florent Krzakala,
Marc Mézard,
Lenka Zdeborová
Abstract:
Understanding the impact of data structure on the computational tractability of learning is a key challenge for the theory of neural networks. Many theoretical works do not explicitly model training data, or assume that inputs are drawn component-wise independently from some simple probability distribution. Here, we go beyond this simple paradigm by studying the performance of neural networks trai…
▽ More
Understanding the impact of data structure on the computational tractability of learning is a key challenge for the theory of neural networks. Many theoretical works do not explicitly model training data, or assume that inputs are drawn component-wise independently from some simple probability distribution. Here, we go beyond this simple paradigm by studying the performance of neural networks trained on data drawn from pre-trained generative models. This is possible due to a Gaussian equivalence stating that the key metrics of interest, such as the training and test errors, can be fully captured by an appropriately chosen Gaussian model. We provide three strands of rigorous, analytical and numerical evidence corroborating this equivalence. First, we establish rigorous conditions for the Gaussian equivalence to hold in the case of single-layer generative models, as well as deterministic rates for convergence in distribution. Second, we leverage this equivalence to derive a closed set of equations describing the generalisation performance of two widely studied machine learning problems: two-layer neural networks trained using one-pass stochastic gradient descent, and full-batch pre-learned features or kernel methods. Finally, we perform experiments demonstrating how our theory applies to deep, pre-trained generative models. These results open a viable path to the theoretical study of machine learning models with realistic data.
△ Less
Submitted 21 May, 2021; v1 submitted 25 June, 2020;
originally announced June 2020.
-
Complex Dynamics in Simple Neural Networks: Understanding Gradient Flow in Phase Retrieval
Authors:
Stefano Sarao Mannelli,
Giulio Biroli,
Chiara Cammarota,
Florent Krzakala,
Pierfrancesco Urbani,
Lenka Zdeborová
Abstract:
Despite the widespread use of gradient-based algorithms for optimizing high-dimensional non-convex functions, understanding their ability of finding good minima instead of being trapped in spurious ones remains to a large extent an open problem. Here we focus on gradient flow dynamics for phase retrieval from random measurements. When the ratio of the number of measurements over the input dimensio…
▽ More
Despite the widespread use of gradient-based algorithms for optimizing high-dimensional non-convex functions, understanding their ability of finding good minima instead of being trapped in spurious ones remains to a large extent an open problem. Here we focus on gradient flow dynamics for phase retrieval from random measurements. When the ratio of the number of measurements over the input dimension is small the dynamics remains trapped in spurious minima with large basins of attraction. We find analytically that above a critical ratio those critical points become unstable developing a negative direction toward the signal. By numerical experiments we show that in this regime the gradient flow algorithm is not trapped; it drifts away from the spurious critical points along the unstable direction and succeeds in finding the global minimum. Using tools from statistical physics we characterize this phenomenon, which is related to a BBP-type transition in the Hessian of the spurious minima.
△ Less
Submitted 12 June, 2020;
originally announced June 2020.
-
Asymptotic Errors for Teacher-Student Convex Generalized Linear Models (or : How to Prove Kabashima's Replica Formula)
Authors:
Cedric Gerbelot,
Alia Abbara,
Florent Krzakala
Abstract:
There has been a recent surge of interest in the study of asymptotic reconstruction performance in various cases of generalized linear estimation problems in the teacher-student setting, especially for the case of i.i.d standard normal matrices. Here, we go beyond these matrices, and prove an analytical formula for the reconstruction performance of convex generalized linear models with rotationall…
▽ More
There has been a recent surge of interest in the study of asymptotic reconstruction performance in various cases of generalized linear estimation problems in the teacher-student setting, especially for the case of i.i.d standard normal matrices. Here, we go beyond these matrices, and prove an analytical formula for the reconstruction performance of convex generalized linear models with rotationally-invariant data matrices with arbitrary bounded spectrum, rigorously confirming, under suitable assumptions, a conjecture originally derived using the replica method from statistical physics. The proof is achieved by leveraging on message passing algorithms and the statistical properties of their iterates, allowing to characterize the asymptotic empirical distribution of the estimator. For sufficiently strongly convex problems, we show that the two-layer vector approximate message passing algorithm (2-MLVAMP) converges, where the convergence analysis is done by checking the stability of an equivalent dynamical system, which gives the result for such problems. We then show that, under a concentration assumption, an analytical continuation may be carried out to extend the result to convex (non-strongly) problems. We illustrate our claim with numerical examples on mainstream learning methods such as sparse logistic regression and linear support vector classifiers, showing excellent agreement between moderate size simulation and the asymptotic prediction.
△ Less
Submitted 10 November, 2022; v1 submitted 11 June, 2020;
originally announced June 2020.
-
Generalization error in high-dimensional perceptrons: Approaching Bayes error with convex optimization
Authors:
Benjamin Aubin,
Florent Krzakala,
Yue M. Lu,
Lenka Zdeborová
Abstract:
We consider a commonly studied supervised classification of a synthetic dataset whose labels are generated by feeding a one-layer neural network with random iid inputs. We study the generalization performances of standard classifiers in the high-dimensional regime where $α=n/d$ is kept finite in the limit of a high dimension $d$ and number of samples $n$. Our contribution is three-fold: First, we…
▽ More
We consider a commonly studied supervised classification of a synthetic dataset whose labels are generated by feeding a one-layer neural network with random iid inputs. We study the generalization performances of standard classifiers in the high-dimensional regime where $α=n/d$ is kept finite in the limit of a high dimension $d$ and number of samples $n$. Our contribution is three-fold: First, we prove a formula for the generalization error achieved by $\ell_2$ regularized classifiers that minimize a convex loss. This formula was first obtained by the heuristic replica method of statistical physics. Secondly, focussing on commonly used loss functions and optimizing the $\ell_2$ regularization strength, we observe that while ridge regression performance is poor, logistic and hinge regression are surprisingly able to approach the Bayes-optimal generalization error extremely closely. As $α\to \infty$ they lead to Bayes-optimal rates, a fact that does not follow from predictions of margin-based generalization error bounds. Third, we design an optimal loss and regularizer that provably leads to Bayes-optimal generalization error.
△ Less
Submitted 7 November, 2020; v1 submitted 11 June, 2020;
originally announced June 2020.
-
Dynamical mean-field theory for stochastic gradient descent in Gaussian mixture classification
Authors:
Francesca Mignacco,
Florent Krzakala,
Pierfrancesco Urbani,
Lenka Zdeborová
Abstract:
We analyze in a closed form the learning dynamics of stochastic gradient descent (SGD) for a single-layer neural network classifying a high-dimensional Gaussian mixture where each cluster is assigned one of two labels. This problem provides a prototype of a non-convex loss landscape with interpolating regimes and a large generalization gap. We define a particular stochastic process for which SGD c…
▽ More
We analyze in a closed form the learning dynamics of stochastic gradient descent (SGD) for a single-layer neural network classifying a high-dimensional Gaussian mixture where each cluster is assigned one of two labels. This problem provides a prototype of a non-convex loss landscape with interpolating regimes and a large generalization gap. We define a particular stochastic process for which SGD can be extended to a continuous-time limit that we call stochastic gradient flow. In the full-batch limit, we recover the standard gradient flow. We apply dynamical mean-field theory from statistical physics to track the dynamics of the algorithm in the high-dimensional limit via a self-consistent stochastic process. We explore the performance of the algorithm as a function of the control parameters shedding light on how it navigates the loss landscape.
△ Less
Submitted 9 November, 2021; v1 submitted 10 June, 2020;
originally announced June 2020.
-
Phase retrieval in high dimensions: Statistical and computational phase transitions
Authors:
Antoine Maillard,
Bruno Loureiro,
Florent Krzakala,
Lenka Zdeborová
Abstract:
We consider the phase retrieval problem of reconstructing a $n$-dimensional real or complex signal $\mathbf{X}^{\star}$ from $m$ (possibly noisy) observations $Y_μ= | \sum_{i=1}^n Φ_{μi} X^{\star}_i/\sqrt{n}|$, for a large class of correlated real and complex random sensing matrices $\mathbfΦ$, in a high-dimensional setting where $m,n\to\infty$ while $α= m/n=Θ(1)$. First, we derive sharp asymptoti…
▽ More
We consider the phase retrieval problem of reconstructing a $n$-dimensional real or complex signal $\mathbf{X}^{\star}$ from $m$ (possibly noisy) observations $Y_μ= | \sum_{i=1}^n Φ_{μi} X^{\star}_i/\sqrt{n}|$, for a large class of correlated real and complex random sensing matrices $\mathbfΦ$, in a high-dimensional setting where $m,n\to\infty$ while $α= m/n=Θ(1)$. First, we derive sharp asymptotics for the lowest possible estimation error achievable statistically and we unveil the existence of sharp phase transitions for the weak- and full-recovery thresholds as a function of the singular values of the matrix $\mathbfΦ$. This is achieved by providing a rigorous proof of a result first obtained by the replica method from statistical mechanics. In particular, the information-theoretic transition to perfect recovery for full-rank matrices appears at $α=1$ (real case) and $α=2$ (complex case). Secondly, we analyze the performance of the best-known polynomial time algorithm for this problem -- approximate message-passing -- establishing the existence of a statistical-to-algorithmic gap depending, again, on the spectral properties of $\mathbfΦ$. Our work provides an extensive classification of the statistical and algorithmic thresholds in high-dimensional phase retrieval for a broad class of random matrices.
△ Less
Submitted 23 October, 2020; v1 submitted 9 June, 2020;
originally announced June 2020.
-
Tree-AMP: Compositional Inference with Tree Approximate Message Passing
Authors:
Antoine Baker,
Benjamin Aubin,
Florent Krzakala,
Lenka Zdeborová
Abstract:
We introduce Tree-AMP, standing for Tree Approximate Message Passing, a python package for compositional inference in high-dimensional tree-structured models. The package provides a unifying framework to study several approximate message passing algorithms previously derived for a variety of machine learning tasks such as generalized linear models, inference in multi-layer networks, matrix factori…
▽ More
We introduce Tree-AMP, standing for Tree Approximate Message Passing, a python package for compositional inference in high-dimensional tree-structured models. The package provides a unifying framework to study several approximate message passing algorithms previously derived for a variety of machine learning tasks such as generalized linear models, inference in multi-layer networks, matrix factorization, and reconstruction using non-separable penalties. For some models, the asymptotic performance of the algorithm can be theoretically predicted by the state evolution, and the measurements entropy estimated by the free entropy formalism. The implementation is modular by design: each module, which implements a factor, can be composed at will with other modules to solve complex inference tasks. The user only needs to declare the factor graph of the model: the inference algorithm, state evolution and entropy estimation are fully automated.
△ Less
Submitted 11 December, 2021; v1 submitted 3 April, 2020;
originally announced April 2020.
-
Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime
Authors:
Stéphane d'Ascoli,
Maria Refinetti,
Giulio Biroli,
Florent Krzakala
Abstract:
Deep neural networks can achieve remarkable generalization performances while interpolating the training data perfectly. Rather than the U-curve emblematic of the bias-variance trade-off, their test error often follows a "double descent" - a mark of the beneficial role of overparametrization. In this work, we develop a quantitative theory for this phenomenon in the so-called lazy learning regime o…
▽ More
Deep neural networks can achieve remarkable generalization performances while interpolating the training data perfectly. Rather than the U-curve emblematic of the bias-variance trade-off, their test error often follows a "double descent" - a mark of the beneficial role of overparametrization. In this work, we develop a quantitative theory for this phenomenon in the so-called lazy learning regime of neural networks, by considering the problem of learning a high-dimensional function with random features regression. We obtain a precise asymptotic expression for the bias-variance decomposition of the test error, and show that the bias displays a phase transition at the interpolation threshold, beyond which it remains constant. We disentangle the variances stemming from the sampling of the dataset, from the additive noise corrupting the labels, and from the initialization of the weights. Following up on Geiger et al. 2019, we first show that the latter two contributions are the crux of the double descent: they lead to the overfitting peak at the interpolation threshold and to the decay of the test error upon overparametrization. We then quantify how they are suppressed by ensemble averaging the outputs of K independently initialized estimators. When K is sent to infinity, the test error remains constant beyond the interpolation threshold. We further compare the effects of overparametrizing, ensembling and regularizing. Finally, we present numerical experiments on classic deep learning setups to show that our results hold qualitatively in realistic lazy learning scenarios.
△ Less
Submitted 3 April, 2020; v1 submitted 2 March, 2020;
originally announced March 2020.
-
The role of regularization in classification of high-dimensional noisy Gaussian mixture
Authors:
Francesca Mignacco,
Florent Krzakala,
Yue M. Lu,
Lenka Zdeborová
Abstract:
We consider a high-dimensional mixture of two Gaussians in the noisy regime where even an oracle knowing the centers of the clusters misclassifies a small but finite fraction of the points. We provide a rigorous analysis of the generalization error of regularized convex classifiers, including ridge, hinge and logistic regression, in the high-dimensional limit where the number $n$ of samples and th…
▽ More
We consider a high-dimensional mixture of two Gaussians in the noisy regime where even an oracle knowing the centers of the clusters misclassifies a small but finite fraction of the points. We provide a rigorous analysis of the generalization error of regularized convex classifiers, including ridge, hinge and logistic regression, in the high-dimensional limit where the number $n$ of samples and their dimension $d$ go to infinity while their ratio is fixed to $α= n/d$. We discuss surprising effects of the regularization that in some cases allows to reach the Bayes-optimal performances. We also illustrate the interpolation peak at low regularization, and analyze the role of the respective sizes of the two clusters.
△ Less
Submitted 26 February, 2020;
originally announced February 2020.
-
Asymptotic errors for convex penalized linear regression beyond Gaussian matrices
Authors:
Cédric Gerbelot,
Alia Abbara,
Florent Krzakala
Abstract:
We consider the problem of learning a coefficient vector $x_{0}$ in $R^{N}$ from noisy linear observations $y=Fx_{0}+w$ in $R^{M}$ in the high dimensional limit $M,N$ to infinity with $α=M/N$ fixed. We provide a rigorous derivation of an explicit formula -- first conjectured using heuristic methods from statistical physics -- for the asymptotic mean squared error obtained by penalized convex regre…
▽ More
We consider the problem of learning a coefficient vector $x_{0}$ in $R^{N}$ from noisy linear observations $y=Fx_{0}+w$ in $R^{M}$ in the high dimensional limit $M,N$ to infinity with $α=M/N$ fixed. We provide a rigorous derivation of an explicit formula -- first conjectured using heuristic methods from statistical physics -- for the asymptotic mean squared error obtained by penalized convex regression estimators such as the LASSO or the elastic net, for a class of very generic random matrices corresponding to rotationally invariant data matrices with arbitrary spectrum. The proof is based on a convergence analysis of an oracle version of vector approximate message-passing (oracle-VAMP) and on the properties of its state evolution equations. Our method leverages on and highlights the link between vector approximate message-passing, Douglas-Rachford splitting and proximal descent algorithms, extending previous results obtained with i.i.d. matrices for a large class of problems. We illustrate our results on some concrete examples and show that even though they are asymptotic, our predictions agree remarkably well with numerics even for very moderate sizes.
△ Less
Submitted 11 February, 2020;
originally announced February 2020.
-
Rademacher complexity and spin glasses: A link between the replica and statistical theories of learning
Authors:
Alia Abbara,
Benjamin Aubin,
Florent Krzakala,
Lenka Zdeborová
Abstract:
Statistical learning theory provides bounds of the generalization gap, using in particular the Vapnik-Chervonenkis dimension and the Rademacher complexity. An alternative approach, mainly studied in the statistical physics literature, is the study of generalization in simple synthetic-data models. Here we discuss the connections between these approaches and focus on the link between the Rademacher…
▽ More
Statistical learning theory provides bounds of the generalization gap, using in particular the Vapnik-Chervonenkis dimension and the Rademacher complexity. An alternative approach, mainly studied in the statistical physics literature, is the study of generalization in simple synthetic-data models. Here we discuss the connections between these approaches and focus on the link between the Rademacher complexity in statistical learning and the theories of generalization for typical-case synthetic models from statistical physics, involving quantities known as Gardner capacity and ground state energy. We show that in these models the Rademacher complexity is closely related to the ground state energy computed by replica theories. Using this connection, one may reinterpret many results of the literature as rigorous Rademacher bounds in a variety of models in the high-dimensional statistics limit. Somewhat surprisingly, we also show that statistical learning theory provides predictions for the behavior of the ground-state energies in some full replica symmetry breaking models.
△ Less
Submitted 15 June, 2020; v1 submitted 5 December, 2019;
originally announced December 2019.