-
Feature Learning in $L_{2}$-regularized DNNs: Attraction/Repulsion and Sparsity
Authors:
Arthur Jacot,
Eugene Golikov,
Clément Hongler,
Franck Gabriel
Abstract:
We study the loss surface of DNNs with $L_{2}$ regularization. We show that the loss in terms of the parameters can be reformulated into a loss in terms of the layerwise activations $Z_{\ell}$ of the training set. This reformulation reveals the dynamics behind feature learning: each hidden representations $Z_{\ell}$ are optimal w.r.t. to an attraction/repulsion problem and interpolate between the…
▽ More
We study the loss surface of DNNs with $L_{2}$ regularization. We show that the loss in terms of the parameters can be reformulated into a loss in terms of the layerwise activations $Z_{\ell}$ of the training set. This reformulation reveals the dynamics behind feature learning: each hidden representations $Z_{\ell}$ are optimal w.r.t. to an attraction/repulsion problem and interpolate between the input and output representations, keeping as little information from the input as necessary to construct the activation of the next layer. For positively homogeneous non-linearities, the loss can be further reformulated in terms of the covariances of the hidden representations, which takes the form of a partially convex optimization over a convex cone.
This second reformulation allows us to prove a sparsity result for homogeneous DNNs: any local minimum of the $L_{2}$-regularized loss can be achieved with at most $N(N+1)$ neurons in each hidden layer (where $N$ is the size of the training set). We show that this bound is tight by giving an example of a local minimum that requires $N^{2}/4$ hidden neurons. But we also observe numerically that in more traditional settings much less than $N^{2}$ neurons are required to reach the minima.
△ Less
Submitted 13 October, 2022; v1 submitted 31 May, 2022;
originally announced May 2022.
-
Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity
Authors:
Arthur Jacot,
François Ged,
Berfin Şimşek,
Clément Hongler,
Franck Gabriel
Abstract:
The dynamics of Deep Linear Networks (DLNs) is dramatically affected by the variance $σ^2$ of the parameters at initialization $θ_0$. For DLNs of width $w$, we show a phase transition w.r.t. the scaling $γ$ of the variance $σ^2=w^{-γ}$ as $w\to\infty$: for large variance ($γ<1$), $θ_0$ is very close to a global minimum but far from any saddle point, and for small variance ($γ>1$), $θ_0$ is close t…
▽ More
The dynamics of Deep Linear Networks (DLNs) is dramatically affected by the variance $σ^2$ of the parameters at initialization $θ_0$. For DLNs of width $w$, we show a phase transition w.r.t. the scaling $γ$ of the variance $σ^2=w^{-γ}$ as $w\to\infty$: for large variance ($γ<1$), $θ_0$ is very close to a global minimum but far from any saddle point, and for small variance ($γ>1$), $θ_0$ is close to a saddle point and far from any global minimum. While the first case corresponds to the well-studied NTK regime, the second case is less understood. This motivates the study of the case $γ\to +\infty$, where we conjecture a Saddle-to-Saddle dynamics: throughout training, gradient descent visits the neighborhoods of a sequence of saddles, each corresponding to linear maps of increasing rank, until reaching a sparse global minimum. We support this conjecture with a theorem for the dynamics between the first two saddles, as well as some numerical experiments.
△ Less
Submitted 31 January, 2022; v1 submitted 30 June, 2021;
originally announced June 2021.
-
Kernel Alignment Risk Estimator: Risk Prediction from Training Data
Authors:
Arthur Jacot,
Berfin Şimşek,
Francesco Spadaro,
Clément Hongler,
Franck Gabriel
Abstract:
We study the risk (i.e. generalization error) of Kernel Ridge Regression (KRR) for a kernel $K$ with ridge $λ>0$ and i.i.d. observations. For this, we introduce two objects: the Signal Capture Threshold (SCT) and the Kernel Alignment Risk Estimator (KARE). The SCT $\vartheta_{K,λ}$ is a function of the data distribution: it can be used to identify the components of the data that the KRR predictor…
▽ More
We study the risk (i.e. generalization error) of Kernel Ridge Regression (KRR) for a kernel $K$ with ridge $λ>0$ and i.i.d. observations. For this, we introduce two objects: the Signal Capture Threshold (SCT) and the Kernel Alignment Risk Estimator (KARE). The SCT $\vartheta_{K,λ}$ is a function of the data distribution: it can be used to identify the components of the data that the KRR predictor captures, and to approximate the (expected) KRR risk. This then leads to a KRR risk approximation by the KARE $ρ_{K, λ}$, an explicit function of the training data, agnostic of the true data distribution. We phrase the regression problem in a functional setting. The key results then follow from a finite-size analysis of the Stieltjes transform of general Wishart random matrices. Under a natural universality assumption (that the KRR moments depend asymptotically on the first two moments of the observations) we capture the mean and variance of the KRR predictor. We numerically investigate our findings on the Higgs and MNIST datasets for various classical kernels: the KARE gives an excellent approximation of the risk, thus supporting our universality assumption. Using the KARE, one can compare choices of Kernels and hyperparameters directly from the training set. The KARE thus provides a promising data-dependent procedure to select Kernels that generalize well.
△ Less
Submitted 17 June, 2020;
originally announced June 2020.
-
Implicit Regularization of Random Feature Models
Authors:
Arthur Jacot,
Berfin Şimşek,
Francesco Spadaro,
Clément Hongler,
Franck Gabriel
Abstract:
Random Feature (RF) models are used as efficient parametric approximations of kernel methods. We investigate, by means of random matrix theory, the connection between Gaussian RF models and Kernel Ridge Regression (KRR). For a Gaussian RF model with $P$ features, $N$ data points, and a ridge $λ$, we show that the average (i.e. expected) RF predictor is close to a KRR predictor with an effective ri…
▽ More
Random Feature (RF) models are used as efficient parametric approximations of kernel methods. We investigate, by means of random matrix theory, the connection between Gaussian RF models and Kernel Ridge Regression (KRR). For a Gaussian RF model with $P$ features, $N$ data points, and a ridge $λ$, we show that the average (i.e. expected) RF predictor is close to a KRR predictor with an effective ridge $\tildeλ$. We show that $\tildeλ > λ$ and $\tildeλ \searrow λ$ monotonically as $P$ grows, thus revealing the implicit regularization effect of finite RF sampling. We then compare the risk (i.e. test error) of the $\tildeλ$-KRR predictor with the average risk of the $λ$-RF predictor and obtain a precise and explicit bound on their difference. Finally, we empirically find an extremely good agreement between the test errors of the average $λ$-RF predictor and $\tildeλ$-KRR predictor.
△ Less
Submitted 23 September, 2020; v1 submitted 19 February, 2020;
originally announced February 2020.
-
The asymptotic spectrum of the Hessian of DNN throughout training
Authors:
Arthur Jacot,
Franck Gabriel,
Clément Hongler
Abstract:
The dynamics of DNNs during gradient descent is described by the so-called Neural Tangent Kernel (NTK). In this article, we show that the NTK allows one to gain precise insight into the Hessian of the cost of DNNs. When the NTK is fixed during training, we obtain a full characterization of the asymptotics of the spectrum of the Hessian, at initialization and during training. In the so-called mean-…
▽ More
The dynamics of DNNs during gradient descent is described by the so-called Neural Tangent Kernel (NTK). In this article, we show that the NTK allows one to gain precise insight into the Hessian of the cost of DNNs. When the NTK is fixed during training, we obtain a full characterization of the asymptotics of the spectrum of the Hessian, at initialization and during training. In the so-called mean-field limit, where the NTK is not fixed during training, we describe the first two moments of the Hessian at initialization.
△ Less
Submitted 10 February, 2020; v1 submitted 1 October, 2019;
originally announced October 2019.
-
Order and Chaos: NTK views on DNN Normalization, Checkerboard and Boundary Artifacts
Authors:
Arthur Jacot,
Franck Gabriel,
François Ged,
Clément Hongler
Abstract:
We analyze architectural features of Deep Neural Networks (DNNs) using the so-called Neural Tangent Kernel (NTK), which describes the training and generalization of DNNs in the infinite-width setting. In this setting, we show that for fully-connected DNNs, as the depth grows, two regimes appear: "order", where the (scaled) NTK converges to a constant, and "chaos", where it converges to a Kronecker…
▽ More
We analyze architectural features of Deep Neural Networks (DNNs) using the so-called Neural Tangent Kernel (NTK), which describes the training and generalization of DNNs in the infinite-width setting. In this setting, we show that for fully-connected DNNs, as the depth grows, two regimes appear: "order", where the (scaled) NTK converges to a constant, and "chaos", where it converges to a Kronecker delta. Extreme order slows down training while extreme chaos hinders generalization. Using the scaled ReLU as a nonlinearity, we end up in the ordered regime. In contrast, Layer Normalization brings the network into the chaotic regime. We observe a similar effect for Batch Normalization (BN) applied after the last nonlinearity. We uncover the same order and chaos modes in Deep Deconvolutional Networks (DC-NNs). Our analysis explains the appearance of so-called checkerboard patterns and border artifacts. Moving the network into the chaotic regime prevents checkerboard patterns; we propose a graph-based parametrization which eliminates border artifacts; finally, we introduce a new layer-dependent learning rate to improve the convergence of DC-NNs. We illustrate our findings on DCGANs: the ordered regime leads to a collapse of the generator to a checkerboard mode, which can be avoided by tuning the nonlinearity to reach the chaotic regime. As a result, we are able to obtain good quality samples for DCGANs without BN.
△ Less
Submitted 22 June, 2020; v1 submitted 11 July, 2019;
originally announced July 2019.
-
Neural Tangent Kernel: Convergence and Generalization in Neural Networks
Authors:
Arthur Jacot,
Franck Gabriel,
Clément Hongler
Abstract:
At initialization, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit, thus connecting them to kernel methods. We prove that the evolution of an ANN during training can also be described by a kernel: during gradient descent on the parameters of an ANN, the network function $f_θ$ (which maps input vectors to output vectors) follows the kernel gradient…
▽ More
At initialization, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit, thus connecting them to kernel methods. We prove that the evolution of an ANN during training can also be described by a kernel: during gradient descent on the parameters of an ANN, the network function $f_θ$ (which maps input vectors to output vectors) follows the kernel gradient of the functional cost (which is convex, in contrast to the parameter cost) w.r.t. a new kernel: the Neural Tangent Kernel (NTK). This kernel is central to describe the generalization features of ANNs. While the NTK is random at initialization and varies during training, in the infinite-width limit it converges to an explicit limiting kernel and it stays constant during training. This makes it possible to study the training of ANNs in function space instead of parameter space. Convergence of the training can then be related to the positive-definiteness of the limiting NTK. We prove the positive-definiteness of the limiting NTK when the data is supported on the sphere and the non-linearity is non-polynomial. We then focus on the setting of least-squares regression and show that in the infinite-width limit, the network function $f_θ$ follows a linear differential equation during training. The convergence is fastest along the largest kernel principal components of the input data with respect to the NTK, hence suggesting a theoretical motivation for early stopping. Finally we study the NTK numerically, observe its behavior for wide networks, and compare it to the infinite-width limit.
△ Less
Submitted 10 February, 2020; v1 submitted 20 June, 2018;
originally announced June 2018.