Skip to main content

Showing 1–7 of 7 results for author: Pennington, J

Searching in archive math. Search in all archives.
.
  1. arXiv:2405.15074  [pdf, other

    stat.ML cs.LG math.OC math.PR math.ST

    4+3 Phases of Compute-Optimal Neural Scaling Laws

    Authors: Elliot Paquette, Courtney Paquette, Lechao Xiao, Jeffrey Pennington

    Abstract: We consider the solvable neural scaling model with three parameters: data complexity, target complexity, and model-parameter-count. We use this neural scaling model to derive new predictions about the compute-limited, infinite-data scaling law regime. To train the neural scaling model, we run one-pass stochastic gradient descent on a mean-squared loss. We derive a representation of the loss curves… ▽ More

    Submitted 18 April, 2025; v1 submitted 23 May, 2024; originally announced May 2024.

  2. arXiv:2404.19261  [pdf, other

    cs.LG math.OC math.ST physics.data-an

    High dimensional analysis reveals conservative sharpening and a stochastic edge of stability

    Authors: Atish Agarwala, Jeffrey Pennington

    Abstract: Recent empirical and theoretical work has shown that the dynamics of the large eigenvalues of the training loss Hessian have some remarkably robust features across models and datasets in the full batch regime. There is often an early period of progressive sharpening where the large eigenvalues increase, followed by stabilization at a predictable value known as the edge of stability. Previous work… ▽ More

    Submitted 31 January, 2025; v1 submitted 30 April, 2024; originally announced April 2024.

  3. arXiv:2210.04860  [pdf, other

    cs.LG cs.AI math.OC

    Second-order regression models exhibit progressive sharpening to the edge of stability

    Authors: Atish Agarwala, Fabian Pedregosa, Jeffrey Pennington

    Abstract: Recent studies of gradient descent with large step sizes have shown that there is often a regime with an initial increase in the largest eigenvalue of the loss Hessian (progressive sharpening), followed by a stabilization of the eigenvalue near the maximum value which allows convergence (edge of stability). These phenomena are intrinsically non-linear and do not happen for models in the constant N… ▽ More

    Submitted 10 October, 2022; originally announced October 2022.

  4. arXiv:2206.07252  [pdf, other

    stat.ML cs.LG math.OC math.PR math.ST

    Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

    Authors: Courtney Paquette, Elliot Paquette, Ben Adlam, Jeffrey Pennington

    Abstract: Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems. While the empirical success of SGD is often attributed to its computational efficiency and favorable generalization behavior, neither effect is well understood and disentangling them remains an open problem. Even in the simple setting of convex quad… ▽ More

    Submitted 14 June, 2022; originally announced June 2022.

    Comments: arXiv admin note: text overlap with arXiv:2205.07069

  5. arXiv:2205.07069  [pdf, other

    math.ST math.OC math.PR stat.ML

    Homogenization of SGD in high-dimensions: Exact dynamics and generalization properties

    Authors: Courtney Paquette, Elliot Paquette, Ben Adlam, Jeffrey Pennington

    Abstract: We develop a stochastic differential equation, called homogenized SGD, for analyzing the dynamics of stochastic gradient descent (SGD) on a high-dimensional random least squares problem with $\ell^2$-regularization. We show that homogenized SGD is the high-dimensional equivalence of SGD -- for any quadratic statistic (e.g., population risk with quadratic loss), the statistic under the iterates of… ▽ More

    Submitted 14 May, 2022; originally announced May 2022.

  6. arXiv:2001.05992  [pdf, other

    cs.LG cs.NE math.OC stat.ML

    Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

    Authors: Wei Hu, Lechao Xiao, Jeffrey Pennington

    Abstract: The selection of initial parameter values for gradient-based optimization of deep neural networks is one of the most impactful hyperparameter choices in deep learning systems, affecting both convergence times and model performance. Yet despite significant empirical and theoretical analysis, relatively little has been proved about the concrete effects of different initialization schemes. In this wo… ▽ More

    Submitted 16 January, 2020; originally announced January 2020.

    Comments: International Conference on Learning Representations (ICLR) 2020

  7. arXiv:1902.08129  [pdf, other

    cs.NE cond-mat.dis-nn cs.LG math.DS

    A Mean Field Theory of Batch Normalization

    Authors: Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, Samuel S. Schoenholz

    Abstract: We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. Our theory shows that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initi… ▽ More

    Submitted 5 March, 2019; v1 submitted 21 February, 2019; originally announced February 2019.

    Comments: To appear in ICLR 2019