Skip to main content

Showing 1–20 of 20 results for author: Jacot, A

Searching in archive stat. Search in all archives.
.
  1. arXiv:2505.21722  [pdf, other

    cs.LG cs.AI stat.ML

    Saddle-To-Saddle Dynamics in Deep ReLU Networks: Low-Rank Bias in the First Saddle Escape

    Authors: Ioannis Bantzis, James B. Simon, Arthur Jacot

    Abstract: When a deep ReLU network is initialized with small weights, GD is at first dominated by the saddle at the origin in parameter space. We study the so-called escape directions, which play a similar role as the eigenvectors of the Hessian for strict saddles. We show that the optimal escape direction features a low-rank bias in its deeper layers: the first singular value of the $\ell$-th layer weight… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  2. arXiv:2410.11275  [pdf, ps, other

    cs.LG stat.ML

    Shallow diffusion networks provably learn hidden low-dimensional structure

    Authors: Nicholas M. Boffi, Arthur Jacot, Stephen Tu, Ingvar Ziemann

    Abstract: Diffusion-based generative models provide a powerful framework for learning to sample from a complex target distribution. The remarkable empirical success of these models applied to high-dimensional signals, including images and video, stands in stark contrast to classical results highlighting the curse of dimensionality for distribution recovery. In this work, we take a step towards understanding… ▽ More

    Submitted 15 October, 2024; originally announced October 2024.

  3. arXiv:2410.04887  [pdf, other

    cs.LG math.OC stat.ML

    Wide Neural Networks Trained with Weight Decay Provably Exhibit Neural Collapse

    Authors: Arthur Jacot, Peter Súkeník, Zihan Wang, Marco Mondelli

    Abstract: Deep neural networks (DNNs) at convergence consistently represent the training data in the last layer via a highly symmetric geometric structure referred to as neural collapse. This empirical evidence has spurred a line of theoretical research aimed at proving the emergence of neural collapse, mostly focusing on the unconstrained features model. Here, the features of the penultimate layer are free… ▽ More

    Submitted 7 October, 2024; originally announced October 2024.

    Comments: 29 pages, 5 figures

  4. arXiv:2407.05664  [pdf, other

    stat.ML cs.AI cs.LG

    How DNNs break the Curse of Dimensionality: Compositionality and Symmetry Learning

    Authors: Arthur Jacot, Seok Hoan Choi, Yuxiao Wen

    Abstract: We show that deep neural networks (DNNs) can efficiently learn any composition of functions with bounded $F_{1}$-norm, which allows DNNs to break the curse of dimensionality in ways that shallow networks cannot. More specifically, we derive a generalization bound that combines a covering number argument for compositionality, and the $F_{1}$-norm (or the related Barron norm) for large width adaptiv… ▽ More

    Submitted 6 March, 2025; v1 submitted 8 July, 2024; originally announced July 2024.

  5. arXiv:2405.17580  [pdf, other

    cs.LG cs.AI stat.ML

    Mixed Dynamics In Linear Networks: Unifying the Lazy and Active Regimes

    Authors: Zhenfeng Tu, Santiago Aranguri, Arthur Jacot

    Abstract: The training dynamics of linear networks are well studied in two distinct setups: the lazy regime and balanced/active regime, depending on the initialization and width of the network. We provide a surprisingly simple unifying formula for the evolution of the learned matrix that contains as special cases both lazy and balanced regimes but also a mixed regime in between the two. In the mixed regime,… ▽ More

    Submitted 29 October, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

  6. arXiv:2405.17573  [pdf, other

    stat.ML cs.AI cs.LG

    Hamiltonian Mechanics of Feature Learning: Bottleneck Structure in Leaky ResNets

    Authors: Arthur Jacot, Alexandre Kaiser

    Abstract: We study Leaky ResNets, which interpolate between ResNets and Fully-Connected nets depending on an 'effective depth' hyper-parameter $\tilde{L}$. In the infinite depth limit, we study 'representation geodesics' $A_{p}$: continuous paths in representation space (similar to NeuralODEs) from input $p=0$ to output $p=1$ that minimize the parameter norm of the network. We give a Lagrangian and Hamilton… ▽ More

    Submitted 6 March, 2025; v1 submitted 27 May, 2024; originally announced May 2024.

  7. arXiv:2402.08010  [pdf, other

    cs.LG cs.AI stat.ML

    Which Frequencies do CNNs Need? Emergent Bottleneck Structure in Feature Learning

    Authors: Yuxiao Wen, Arthur Jacot

    Abstract: We describe the emergence of a Convolution Bottleneck (CBN) structure in CNNs, where the network uses its first few layers to transform the input representation into a representation that is supported only along a few frequencies and channels, before using the last few layers to map back to the outputs. We define the CBN rank, which describes the number and type of frequencies that are kept inside… ▽ More

    Submitted 6 March, 2025; v1 submitted 12 February, 2024; originally announced February 2024.

  8. arXiv:2305.19008  [pdf, other

    cs.LG cs.AI stat.ML

    Bottleneck Structure in Learned Features: Low-Dimension vs Regularity Tradeoff

    Authors: Arthur Jacot

    Abstract: Previous work has shown that DNNs with large depth $L$ and $L_{2}$-regularization are biased towards learning low-dimensional representations of the inputs, which can be interpreted as minimizing a notion of rank $R^{(0)}(f)$ of the learned function $f$, conjectured to be the Bottleneck rank. We compute finite depth corrections to this result, revealing a measure $R^{(1)}$ of regularity which boun… ▽ More

    Submitted 14 August, 2024; v1 submitted 30 May, 2023; originally announced May 2023.

  9. arXiv:2305.16038  [pdf, other

    cs.LG cs.AI stat.ML

    Implicit bias of SGD in $L_{2}$-regularized linear DNNs: One-way jumps from high to low rank

    Authors: Zihan Wang, Arthur Jacot

    Abstract: The $L_{2}$-regularized loss of Deep Linear Networks (DLNs) with more than one hidden layers has multiple local minima, corresponding to matrices with different ranks. In tasks such as matrix completion, the goal is to converge to the local minimum with the smallest rank that still fits the training data. While rank-underestimating minima can be avoided since they do not fit the data, GD might get… ▽ More

    Submitted 29 September, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

  10. arXiv:2209.15055  [pdf, other

    stat.ML cs.AI cs.LG

    Implicit Bias of Large Depth Networks: a Notion of Rank for Nonlinear Functions

    Authors: Arthur Jacot

    Abstract: We show that the representation cost of fully connected neural networks with homogeneous nonlinearities - which describes the implicit bias in function space of networks with $L_2$-regularization or with losses such as the cross-entropy - converges as the depth of the network goes to infinity to a notion of rank over nonlinear functions. We then inquire under which conditions the global minima of… ▽ More

    Submitted 23 March, 2023; v1 submitted 29 September, 2022; originally announced September 2022.

  11. arXiv:2205.15809  [pdf, other

    stat.ML cs.AI cs.LG cs.NE

    Feature Learning in $L_{2}$-regularized DNNs: Attraction/Repulsion and Sparsity

    Authors: Arthur Jacot, Eugene Golikov, Clément Hongler, Franck Gabriel

    Abstract: We study the loss surface of DNNs with $L_{2}$ regularization. We show that the loss in terms of the parameters can be reformulated into a loss in terms of the layerwise activations $Z_{\ell}$ of the training set. This reformulation reveals the dynamics behind feature learning: each hidden representations $Z_{\ell}$ are optimal w.r.t. to an attraction/repulsion problem and interpolate between the… ▽ More

    Submitted 13 October, 2022; v1 submitted 31 May, 2022; originally announced May 2022.

  12. arXiv:2111.03972  [pdf, other

    cs.LG stat.ML

    Understanding Layer-wise Contributions in Deep Neural Networks through Spectral Analysis

    Authors: Yatin Dandi, Arthur Jacot

    Abstract: Spectral analysis is a powerful tool, decomposing any function into simpler parts. In machine learning, Mercer's theorem generalizes this idea, providing for any kernel and input distribution a natural basis of functions of increasing frequency. More recently, several works have extended this analysis to deep neural networks through the framework of Neural Tangent Kernel. In this work, we analyze… ▽ More

    Submitted 7 January, 2022; v1 submitted 6 November, 2021; originally announced November 2021.

  13. arXiv:2106.15933  [pdf, other

    stat.ML cs.LG

    Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity

    Authors: Arthur Jacot, François Ged, Berfin Şimşek, Clément Hongler, Franck Gabriel

    Abstract: The dynamics of Deep Linear Networks (DLNs) is dramatically affected by the variance $σ^2$ of the parameters at initialization $θ_0$. For DLNs of width $w$, we show a phase transition w.r.t. the scaling $γ$ of the variance $σ^2=w^{-γ}$ as $w\to\infty$: for large variance ($γ<1$), $θ_0$ is very close to a global minimum but far from any saddle point, and for small variance ($γ>1$), $θ_0$ is close t… ▽ More

    Submitted 31 January, 2022; v1 submitted 30 June, 2021; originally announced June 2021.

  14. arXiv:2106.05710  [pdf, other

    stat.ML cs.LG

    DNN-Based Topology Optimisation: Spatial Invariance and Neural Tangent Kernel

    Authors: Benjamin Dupuis, Arthur Jacot

    Abstract: We study the Solid Isotropic Material Penalisation (SIMP) method with a density field generated by a fully-connected neural network, taking the coordinates as inputs. In the large width limit, we show that the use of DNNs leads to a filtering effect similar to traditional filtering techniques for SIMP, with a filter described by the Neural Tangent Kernel (NTK). This filter is however not invariant… ▽ More

    Submitted 28 November, 2021; v1 submitted 10 June, 2021; originally announced June 2021.

    Journal ref: Advances in Neural Information Processing Systems, 34:27659-27669, 2021

  15. arXiv:2006.09796  [pdf, other

    stat.ML cs.LG math.PR

    Kernel Alignment Risk Estimator: Risk Prediction from Training Data

    Authors: Arthur Jacot, Berfin Şimşek, Francesco Spadaro, Clément Hongler, Franck Gabriel

    Abstract: We study the risk (i.e. generalization error) of Kernel Ridge Regression (KRR) for a kernel $K$ with ridge $λ>0$ and i.i.d. observations. For this, we introduce two objects: the Signal Capture Threshold (SCT) and the Kernel Alignment Risk Estimator (KARE). The SCT $\vartheta_{K,λ}$ is a function of the data distribution: it can be used to identify the components of the data that the KRR predictor… ▽ More

    Submitted 17 June, 2020; originally announced June 2020.

  16. arXiv:2002.08404  [pdf, other

    stat.ML cs.LG

    Implicit Regularization of Random Feature Models

    Authors: Arthur Jacot, Berfin Şimşek, Francesco Spadaro, Clément Hongler, Franck Gabriel

    Abstract: Random Feature (RF) models are used as efficient parametric approximations of kernel methods. We investigate, by means of random matrix theory, the connection between Gaussian RF models and Kernel Ridge Regression (KRR). For a Gaussian RF model with $P$ features, $N$ data points, and a ridge $λ$, we show that the average (i.e. expected) RF predictor is close to a KRR predictor with an effective ri… ▽ More

    Submitted 23 September, 2020; v1 submitted 19 February, 2020; originally announced February 2020.

    Journal ref: Proceedings of the International Conference on Machine Learning, 2020, pp. 7397-7406

  17. arXiv:1910.02875  [pdf, other

    cs.LG cs.NE stat.ML

    The asymptotic spectrum of the Hessian of DNN throughout training

    Authors: Arthur Jacot, Franck Gabriel, Clément Hongler

    Abstract: The dynamics of DNNs during gradient descent is described by the so-called Neural Tangent Kernel (NTK). In this article, we show that the NTK allows one to gain precise insight into the Hessian of the cost of DNNs. When the NTK is fixed during training, we obtain a full characterization of the asymptotics of the spectrum of the Hessian, at initialization and during training. In the so-called mean-… ▽ More

    Submitted 10 February, 2020; v1 submitted 1 October, 2019; originally announced October 2019.

  18. arXiv:1907.05715  [pdf, other

    cs.LG stat.ML

    Order and Chaos: NTK views on DNN Normalization, Checkerboard and Boundary Artifacts

    Authors: Arthur Jacot, Franck Gabriel, François Ged, Clément Hongler

    Abstract: We analyze architectural features of Deep Neural Networks (DNNs) using the so-called Neural Tangent Kernel (NTK), which describes the training and generalization of DNNs in the infinite-width setting. In this setting, we show that for fully-connected DNNs, as the depth grows, two regimes appear: "order", where the (scaled) NTK converges to a constant, and "chaos", where it converges to a Kronecker… ▽ More

    Submitted 22 June, 2020; v1 submitted 11 July, 2019; originally announced July 2019.

  19. Disentangling feature and lazy training in deep neural networks

    Authors: Mario Geiger, Stefano Spigler, Arthur Jacot, Matthieu Wyart

    Abstract: Two distinct limits for deep learning have been derived as the network width $h\rightarrow \infty$, depending on how the weights of the last layer scale with $h$. In the Neural Tangent Kernel (NTK) limit, the dynamics becomes linear in the weights and is described by a frozen kernel $Θ$. By contrast, in the Mean-Field limit, the dynamics can be expressed in terms of the distribution of the paramet… ▽ More

    Submitted 4 October, 2020; v1 submitted 19 June, 2019; originally announced June 2019.

    Comments: minor revisions

  20. arXiv:1806.07572  [pdf, other

    cs.LG cs.NE math.PR stat.ML

    Neural Tangent Kernel: Convergence and Generalization in Neural Networks

    Authors: Arthur Jacot, Franck Gabriel, Clément Hongler

    Abstract: At initialization, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit, thus connecting them to kernel methods. We prove that the evolution of an ANN during training can also be described by a kernel: during gradient descent on the parameters of an ANN, the network function $f_θ$ (which maps input vectors to output vectors) follows the kernel gradient… ▽ More

    Submitted 10 February, 2020; v1 submitted 20 June, 2018; originally announced June 2018.

    Journal ref: In Advances in neural information processing systems (pp. 8571-8580) 2018