Skip to main content

Showing 1–29 of 29 results for author: Wyart, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.18651  [pdf, ps, other

    cs.CL cond-mat.dis-nn cs.LG

    On the Emergence of Linear Analogies in Word Embeddings

    Authors: Daniel J. Korchinski, Dhruva Karkada, Yasaman Bahri, Matthieu Wyart

    Abstract: Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability $P(i,j)$ of words $i$ and $j$ in text corpora. The resulting vectors $W_i$ not only group semantically similar words but also exhibit a striking linear analogy structure -- for example, $W_{\text{king}} - W_{\text{man}} + W_{\text{woman}} \approx W_{\text{queen}}$ -- whose theoretical origin remains… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

    Comments: Main: 12 pages, 3 figures. Appendices: 8 pages, 7 figures

  2. arXiv:2505.16959  [pdf, other

    cs.LG stat.ML

    Bigger Isn't Always Memorizing: Early Stopping Overparameterized Diffusion Models

    Authors: Alessandro Favero, Antonio Sclocchi, Matthieu Wyart

    Abstract: Diffusion probabilistic models have become a cornerstone of modern generative AI, yet the mechanisms underlying their generalization remain poorly understood. In fact, if these models were perfectly minimizing their training loss, they would just generate data belonging to their training set, i.e., memorize, as empirically found in the overparameterized regime. We revisit this view by showing that… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

  3. arXiv:2505.07070  [pdf, ps, other

    cs.LG cond-mat.dis-nn stat.ML

    Scaling Laws and Representation Learning in Simple Hierarchical Languages: Transformers vs. Convolutional Architectures

    Authors: Francesco Cagnetta, Alessandro Favero, Antonio Sclocchi, Matthieu Wyart

    Abstract: How do neural language models acquire a language's structure when trained for next-token prediction? We address this question by deriving theoretical scaling laws for neural network performance on synthetic datasets generated by the Random Hierarchy Model (RHM) -- an ensemble of probabilistic context-free grammars designed to capture the hierarchical structure of natural language while remaining a… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

    Comments: 14 pages, 8 figures

  4. arXiv:2505.07067  [pdf, other

    stat.ML cond-mat.dis-nn cs.LG

    Learning curves theory for hierarchically compositional data with power-law distributed features

    Authors: Francesco Cagnetta, Hyunmo Kang, Matthieu Wyart

    Abstract: Recent theories suggest that Neural Scaling Laws arise whenever the task is linearly decomposed into power-law distributed units. Alternatively, scaling laws also emerge when data exhibit a hierarchically compositional structure, as is thought to occur in language and images. To unify these views, we consider classification and next-token prediction tasks based on probabilistic context-free gramma… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  5. arXiv:2502.12089  [pdf, ps, other

    stat.ML cs.LG

    How Compositional Generalization and Creativity Improve as Diffusion Models are Trained

    Authors: Alessandro Favero, Antonio Sclocchi, Francesco Cagnetta, Pascal Frossard, Matthieu Wyart

    Abstract: Natural data is often organized as a hierarchical composition of features. How many samples do generative models need in order to learn the composition rules, so as to produce a combinatorially large number of novel data? What signal in the data is exploited to learn those rules? We investigate these questions in the context of diffusion models both theoretically and empirically. Theoretically, we… ▽ More

    Submitted 4 June, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

    Journal ref: Proceedings of the 42nd International Conference on Machine Learning (ICML), PMLR 267, 2025

  6. arXiv:2410.13770  [pdf, other

    stat.ML cond-mat.dis-nn cs.LG

    Probing the Latent Hierarchical Structure of Data via Diffusion Models

    Authors: Antonio Sclocchi, Alessandro Favero, Noam Itzhak Levi, Matthieu Wyart

    Abstract: High-dimensional data must be highly structured to be learnable. Although the compositional and hierarchical nature of data is often put forward to explain learnability, quantitative measurements establishing these properties are scarce. Likewise, accessing the latent variables underlying such a data structure remains a challenge. In this work, we show that forward-backward experiments in diffusio… ▽ More

    Submitted 28 February, 2025; v1 submitted 17 October, 2024; originally announced October 2024.

    Comments: 10 pages, 6 figures

  7. arXiv:2408.11841  [pdf, other

    cs.CY cs.AI cs.CL

    Could ChatGPT get an Engineering Degree? Evaluating Higher Education Vulnerability to AI Assistants

    Authors: Beatriz Borges, Negar Foroutan, Deniz Bayazit, Anna Sotnikova, Syrielle Montariol, Tanya Nazaretzky, Mohammadreza Banaei, Alireza Sakhaeirad, Philippe Servant, Seyed Parsa Neshaei, Jibril Frej, Angelika Romanou, Gail Weiss, Sepideh Mamooler, Zeming Chen, Simin Fan, Silin Gao, Mete Ismayilzada, Debjit Paul, Alexandre Schöpfer, Andrej Janchevski, Anja Tiede, Clarence Linden, Emanuele Troiani, Francesco Salvi , et al. (65 additional authors not shown)

    Abstract: AI assistants are being increasingly used by students enrolled in higher education institutions. While these tools provide opportunities for improved teaching and education, they also pose significant challenges for assessment and learning outcomes. We conceptualize these challenges through the lens of vulnerability, the potential for university assessments and learning outcomes to be impacted by… ▽ More

    Submitted 27 November, 2024; v1 submitted 7 August, 2024; originally announced August 2024.

    Comments: 20 pages, 8 figures

    Journal ref: PNAS (2024) Vol. 121 | No. 49

  8. arXiv:2406.00048  [pdf, other

    cs.CL cond-mat.dis-nn cs.LG

    Towards a theory of how the structure of language is acquired by deep neural networks

    Authors: Francesco Cagnetta, Matthieu Wyart

    Abstract: How much data is required to learn the structure of a language via next-token prediction? We study this question for synthetic datasets generated via a Probabilistic Context-Free Grammar (PCFG) -- a tree-like generative model that captures many of the hierarchical structures found in natural languages. We determine token-token correlations analytically in our model and show that they can be used t… ▽ More

    Submitted 29 October, 2024; v1 submitted 28 May, 2024; originally announced June 2024.

    Comments: NeurIPS 2024

  9. arXiv:2404.10727  [pdf, other

    stat.ML cond-mat.dis-nn cs.LG

    How Deep Networks Learn Sparse and Hierarchical Data: the Sparse Random Hierarchy Model

    Authors: Umberto Tomasini, Matthieu Wyart

    Abstract: Understanding what makes high-dimensional data learnable is a fundamental question in machine learning. On the one hand, it is believed that the success of deep learning lies in its ability to build a hierarchy of representations that become increasingly more abstract with depth, going from simple features like edges to more complex concepts. On the other hand, learning to be insensitive to invari… ▽ More

    Submitted 2 May, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

    Comments: 9 pages, 6 figures

  10. arXiv:2402.16991  [pdf, other

    stat.ML cond-mat.dis-nn cs.CV cs.LG

    A Phase Transition in Diffusion Models Reveals the Hierarchical Nature of Data

    Authors: Antonio Sclocchi, Alessandro Favero, Matthieu Wyart

    Abstract: Understanding the structure of real data is paramount in advancing modern deep-learning methodologies. Natural data such as images are believed to be composed of features organized in a hierarchical and combinatorial manner, which neural networks capture during learning. Recent advancements show that diffusion models can generate high-quality images, hinting at their ability to capture this underl… ▽ More

    Submitted 23 December, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

    Comments: 9 pages, 7 figures. Appendix: 11 pages, 9 figures

  11. arXiv:2309.10688  [pdf, other

    cs.LG cond-mat.dis-nn stat.ML

    On the different regimes of Stochastic Gradient Descent

    Authors: Antonio Sclocchi, Matthieu Wyart

    Abstract: Modern deep networks are trained with stochastic gradient descent (SGD) whose key hyperparameters are the number of data considered at each step or batch size $B$, and the step size or learning rate $η$. For small $B$ and large $η$, SGD corresponds to a stochastic evolution of the parameters, whose noise amplitude is governed by the ''temperature'' $T\equiv η/B$. Yet this description is observed t… ▽ More

    Submitted 27 February, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

    Comments: Main: 8 pages, 4 figures; Appendix: 15 pages, 11 figures

    Journal ref: Proceedings of the National Academy of Sciences 121.9 (2024): e2316301121

  12. arXiv:2307.02129  [pdf, other

    cs.LG cs.CV stat.ML

    How Deep Neural Networks Learn Compositional Data: The Random Hierarchy Model

    Authors: Francesco Cagnetta, Leonardo Petrini, Umberto M. Tomasini, Alessandro Favero, Matthieu Wyart

    Abstract: Deep learning algorithms demonstrate a surprising ability to learn high-dimensional tasks from limited examples. This is commonly attributed to the depth of neural networks, enabling them to build a hierarchy of abstract, low-dimensional data representations. However, how many training examples are required to learn such representations remains unknown. To quantitatively study this question, we in… ▽ More

    Submitted 3 July, 2024; v1 submitted 5 July, 2023; originally announced July 2023.

    Comments: 9 pages, 8 figures

    Journal ref: Phys. Rev. X 14, 031001 (2024)

  13. arXiv:2301.13703  [pdf, other

    cs.LG cond-mat.dis-nn

    Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning

    Authors: Antonio Sclocchi, Mario Geiger, Matthieu Wyart

    Abstract: Understanding when the noise in stochastic gradient descent (SGD) affects generalization of deep neural networks remains a challenge, complicated by the fact that networks can operate in distinct training regimes. Here we study how the magnitude of this noise $T$ affects performance as the size of the training set $P$ and the scale of initialization $α$ are varied. For gradient descent, $α$ is a k… ▽ More

    Submitted 30 May, 2023; v1 submitted 31 January, 2023; originally announced January 2023.

    Comments: 25 pages, 21 figures, added analysis in feature-learning

  14. arXiv:2210.01506  [pdf, other

    cs.LG cs.CV

    How deep convolutional neural networks lose spatial information with training

    Authors: Umberto M. Tomasini, Leonardo Petrini, Francesco Cagnetta, Matthieu Wyart

    Abstract: A central question of machine learning is how deep nets manage to learn tasks in high dimensions. An appealing hypothesis is that they achieve this feat by building a representation of the data where information irrelevant to the task is lost. For image datasets, this view is supported by the observation that after (and not before) training, the neural representation becomes less and less sensitiv… ▽ More

    Submitted 23 November, 2022; v1 submitted 4 October, 2022; originally announced October 2022.

  15. arXiv:2208.01003  [pdf, other

    stat.ML cs.LG

    What Can Be Learnt With Wide Convolutional Neural Networks?

    Authors: Francesco Cagnetta, Alessandro Favero, Matthieu Wyart

    Abstract: Understanding how convolutional neural networks (CNNs) can efficiently learn high-dimensional functions remains a fundamental challenge. A popular belief is that these models harness the local and hierarchical structure of natural data such as images. Yet, we lack a quantitative understanding of how such structure affects performance, e.g., the rate of decay of the generalisation error with the nu… ▽ More

    Submitted 31 May, 2023; v1 submitted 1 August, 2022; originally announced August 2022.

    Journal ref: Proceedings of the 40th International Conference on Machine Learning, PMLR 202. 2023

  16. arXiv:2206.12314  [pdf, other

    stat.ML cs.LG

    Learning sparse features can lead to overfitting in neural networks

    Authors: Leonardo Petrini, Francesco Cagnetta, Eric Vanden-Eijnden, Matthieu Wyart

    Abstract: It is widely believed that the success of deep networks lies in their ability to learn a meaningful representation of the features of the data. Yet, understanding when and how this feature learning improves performance remains a challenge: for example, it is beneficial for modern architectures trained to classify images, whereas it is detrimental for fully-connected networks trained for the same t… ▽ More

    Submitted 12 October, 2022; v1 submitted 24 June, 2022; originally announced June 2022.

  17. arXiv:2202.03348  [pdf, other

    cs.LG cond-mat.stat-mech

    Failure and success of the spectral bias prediction for Kernel Ridge Regression: the case of low-dimensional data

    Authors: Umberto M. Tomasini, Antonio Sclocchi, Matthieu Wyart

    Abstract: Recently, several theories including the replica method made predictions for the generalization error of Kernel Ridge Regression. In some regimes, they predict that the method has a `spectral bias': decomposing the true function $f^*$ on the eigenbasis of the kernel, it fits well the coefficients associated with the O(P) largest eigenvalues, where $P$ is the size of the training set. This predicti… ▽ More

    Submitted 16 February, 2022; v1 submitted 7 February, 2022; originally announced February 2022.

    Comments: 34 pages, 11 figures

  18. arXiv:2106.08849  [pdf, other

    cs.LG

    How memory architecture affects learning in a simple POMDP: the two-hypothesis testing problem

    Authors: Mario Geiger, Christophe Eloy, Matthieu Wyart

    Abstract: Reinforcement learning is generally difficult for partially observable Markov decision processes (POMDPs), which occurs when the agent's observation is partial or noisy. To seek good performance in POMDPs, one strategy is to endow the agent with a finite memory, whose update is governed by the policy. However, policy optimization is non-convex in that case and can lead to poor training performance… ▽ More

    Submitted 18 November, 2021; v1 submitted 16 June, 2021; originally announced June 2021.

  19. arXiv:2106.08619  [pdf, other

    stat.ML cond-mat.dis-nn cs.LG

    Locality defeats the curse of dimensionality in convolutional teacher-student scenarios

    Authors: Alessandro Favero, Francesco Cagnetta, Matthieu Wyart

    Abstract: Convolutional neural networks perform a local and translationally-invariant treatment of the data: quantifying which of these two aspects is central to their success remains a challenge. We study this problem within a teacher-student framework for kernel regression, using `convolutional' kernels inspired by the neural tangent kernel of simple convolutional architectures of given filter size. Using… ▽ More

    Submitted 12 November, 2021; v1 submitted 16 June, 2021; originally announced June 2021.

    Comments: 32 pages, 7 figures

  20. Relative stability toward diffeomorphisms indicates performance in deep nets

    Authors: Leonardo Petrini, Alessandro Favero, Mario Geiger, Matthieu Wyart

    Abstract: Understanding why deep nets can classify data in large dimensions remains a challenge. It has been proposed that they do so by becoming stable to diffeomorphisms, yet existing empirical measurements support that it is often not the case. We revisit this question by defining a maximum-entropy distribution on diffeomorphisms, that allows to study typical diffeomorphisms of a given norm. We confirm t… ▽ More

    Submitted 4 November, 2021; v1 submitted 6 May, 2021; originally announced May 2021.

    Comments: NeurIPS 2021 Conference

  21. arXiv:2012.15110  [pdf, other

    cs.LG

    Perspective: A Phase Diagram for Deep Learning unifying Jamming, Feature Learning and Lazy Training

    Authors: Mario Geiger, Leonardo Petrini, Matthieu Wyart

    Abstract: Deep learning algorithms are responsible for a technological revolution in a variety of tasks including image recognition or Go playing. Yet, why they work is not understood. Ultimately, they manage to classify data lying in high dimension -- a feat generically impossible due to the geometry of high dimensional space and the associated curse of dimensionality. Understanding what kind of structure,… ▽ More

    Submitted 30 December, 2020; originally announced December 2020.

  22. Geometric compression of invariant manifolds in neural nets

    Authors: Jonas Paccolat, Leonardo Petrini, Mario Geiger, Kevin Tyloo, Matthieu Wyart

    Abstract: We study how neural networks compress uninformative input space in models where data lie in $d$ dimensions, but whose label only vary within a linear manifold of dimension $d_\parallel < d$. We show that for a one-hidden layer network initialized with infinitesimal weights (i.e. in the feature learning regime) trained with gradient descent, the first layer of weights evolve to become nearly insens… ▽ More

    Submitted 11 March, 2021; v1 submitted 22 July, 2020; originally announced July 2020.

    Journal ref: Journal of Statistical Mechanics: Theory and Experiment, Volume 2021, April 2021

  23. arXiv:2006.09754  [pdf, other

    cs.LG cond-mat.dis-nn stat.ML

    How isotropic kernels perform on simple invariants

    Authors: Jonas Paccolat, Stefano Spigler, Matthieu Wyart

    Abstract: We investigate how the training curve of isotropic kernel methods depends on the symmetry of the task to be learned, in several settings. (i) We consider a regression task, where the target function is a Gaussian random field that depends only on $d_\parallel$ variables, fewer than the input dimension $d$. We compute the expected test error $ε$ that follows $ε\sim p^{-β}$ where $p$ is the size of… ▽ More

    Submitted 14 December, 2020; v1 submitted 17 June, 2020; originally announced June 2020.

  24. Disentangling feature and lazy training in deep neural networks

    Authors: Mario Geiger, Stefano Spigler, Arthur Jacot, Matthieu Wyart

    Abstract: Two distinct limits for deep learning have been derived as the network width $h\rightarrow \infty$, depending on how the weights of the last layer scale with $h$. In the Neural Tangent Kernel (NTK) limit, the dynamics becomes linear in the weights and is described by a frozen kernel $Θ$. By contrast, in the Mean-Field limit, the dynamics can be expressed in terms of the distribution of the paramet… ▽ More

    Submitted 4 October, 2020; v1 submitted 19 June, 2019; originally announced June 2019.

    Comments: minor revisions

  25. arXiv:1905.10843  [pdf, other

    stat.ML cond-mat.dis-nn cs.LG

    Asymptotic learning curves of kernel methods: empirical data v.s. Teacher-Student paradigm

    Authors: Stefano Spigler, Mario Geiger, Matthieu Wyart

    Abstract: How many training data are needed to learn a supervised task? It is often observed that the generalization error decreases as $n^{-β}$ where $n$ is the number of training examples and $β$ an exponent that depends on both data and algorithm. In this work we measure $β$ when applying kernel methods to real datasets. For MNIST we find $β\approx 0.4$ and for CIFAR10 $β\approx 0.1$, for both regression… ▽ More

    Submitted 18 August, 2020; v1 submitted 26 May, 2019; originally announced May 2019.

    Comments: We added (i) the prediction of the exponent $β$ for real data using kernel PCA; (ii) the generalization of our results to non-Gaussian data from reference [11] (Bordelon et al., "Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks")

  26. arXiv:1901.01608  [pdf, other

    cond-mat.dis-nn cs.LG

    Scaling description of generalization with number of parameters in deep learning

    Authors: Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d'Ascoli, Giulio Biroli, Clément Hongler, Matthieu Wyart

    Abstract: Supervised deep learning involves the training of neural networks with a large number $N$ of parameters. For large enough $N$, in the so-called over-parametrized regime, one can essentially fit the training data points. Sparsity-based arguments would suggest that the generalization error increases as $N$ grows past a certain threshold $N^{*}$. Instead, empirical studies have shown that in the over… ▽ More

    Submitted 8 October, 2019; v1 submitted 6 January, 2019; originally announced January 2019.

    Comments: The clarity of the text has been improved: the section "Related works" has been updated and the section "3.1 Regression task" has been added

  27. arXiv:1810.09665  [pdf, other

    cs.LG cond-mat.dis-nn stat.ML

    A jamming transition from under- to over-parametrization affects loss landscape and generalization

    Authors: Stefano Spigler, Mario Geiger, Stéphane d'Ascoli, Levent Sagun, Giulio Biroli, Matthieu Wyart

    Abstract: We argue that in fully-connected networks a phase transition delimits the over- and under-parametrized regimes where fitting can or cannot be achieved. Under some general conditions, we show that this transition is sharp for the hinge loss. In the whole over-parametrized regime, poor minima of the loss are not encountered during training since the number of constraints to satisfy is too small to h… ▽ More

    Submitted 18 June, 2019; v1 submitted 22 October, 2018; originally announced October 2018.

    Comments: arXiv admin note: text overlap with arXiv:1809.09349

  28. arXiv:1809.09349  [pdf, other

    cond-mat.dis-nn cs.LG

    The jamming transition as a paradigm to understand the loss landscape of deep neural networks

    Authors: Mario Geiger, Stefano Spigler, Stéphane d'Ascoli, Levent Sagun, Marco Baity-Jesi, Giulio Biroli, Matthieu Wyart

    Abstract: Deep learning has been immensely successful at a variety of tasks, ranging from classification to AI. Learning corresponds to fitting training data, which is implemented by descending a very high-dimensional loss function. Understanding under which conditions neural networks do not get stuck in poor minima of the loss, and how the landscape of that loss evolves as depth is increased remains a chal… ▽ More

    Submitted 17 June, 2019; v1 submitted 25 September, 2018; originally announced September 2018.

    Journal ref: Phys. Rev. E 100, 012115 (2019)

  29. arXiv:1803.06969  [pdf, other

    stat.ML cond-mat.dis-nn cs.LG

    Comparing Dynamics: Deep Neural Networks versus Glassy Systems

    Authors: M. Baity-Jesi, L. Sagun, M. Geiger, S. Spigler, G. Ben Arous, C. Cammarota, Y. LeCun, M. Wyart, G. Biroli

    Abstract: We analyze numerically the training dynamics of deep neural networks (DNN) by using methods developed in statistical physics of glassy systems. The two main issues we address are (1) the complexity of the loss landscape and of the dynamics within it, and (2) to what extent DNNs share similarities with glassy systems. Our findings, obtained for different architectures and datasets, suggest that dur… ▽ More

    Submitted 7 June, 2018; v1 submitted 19 March, 2018; originally announced March 2018.

    Comments: 10 pages, 5 figures. Version accepted at ICML 2018

    Journal ref: PMLR 80:324-333, 2018; Republication with DOI (cite this one): J. Stat. Mech. (2019) 124013