Skip to main content

Showing 1–5 of 5 results for author: Meterez, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.15535  [pdf, ps, other

    cs.LG math.OC stat.ML

    A Simplified Analysis of SGD for Linear Regression with Weight Averaging

    Authors: Alexandru Meterez, Depen Morwani, Costin-Andrei Oncescu, Jingfeng Wu, Cengiz Pehlevan, Sham Kakade

    Abstract: Theoretically understanding stochastic gradient descent (SGD) in overparameterized models has led to the development of several optimization algorithms that are widely used in practice today. Recent work by~\citet{zou2021benign} provides sharp rates for SGD optimization in linear regression using constant learning rate, both with and without tail iterate averaging, based on a bias-variance decompo… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  2. arXiv:2504.07912  [pdf, other

    cs.LG

    Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining

    Authors: Rosie Zhao, Alexandru Meterez, Sham Kakade, Cengiz Pehlevan, Samy Jelassi, Eran Malach

    Abstract: Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models for advanced mathematical reasoning and coding. Following the success of frontier reasoning models, recent work has demonstrated that RL fine-tuning consistently improves performance, even in smaller-scale models; however, the underlying mechanisms driving these improvements are not well-unders… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

    ACM Class: I.2.7

  3. arXiv:2410.04642  [pdf, other

    cs.LG stat.ML

    The Optimization Landscape of SGD Across the Feature Learning Strength

    Authors: Alexander Atanasov, Alexandru Meterez, James B. Simon, Cengiz Pehlevan

    Abstract: We consider neural networks (NNs) where the final layer is down-scaled by a fixed hyperparameter $γ$. Recent work has identified $γ$ as controlling the strength of feature learning. As $γ$ increases, network evolution changes from "lazy" kernel dynamics to "rich" feature-learning dynamics, with a host of associated benefits including improved performance on common tasks. In this work, we conduct a… ▽ More

    Submitted 2 March, 2025; v1 submitted 6 October, 2024; originally announced October 2024.

    Comments: ICLR 2025 Final Copy, 40 Pages, 45 figures

  4. arXiv:2402.17457  [pdf, other

    cs.LG

    Super Consistency of Neural Network Landscapes and Learning Rate Transfer

    Authors: Lorenzo Noci, Alexandru Meterez, Thomas Hofmann, Antonio Orvieto

    Abstract: Recently, there has been growing evidence that if the width and depth of a neural network are scaled toward the so-called rich feature learning limit (\mup and its depth extension), then some hyperparameters -- such as the learning rate -- exhibit transfer from small to very large models. From an optimization perspective, this phenomenon is puzzling, as it implies that the loss landscape is consis… ▽ More

    Submitted 12 November, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

    Comments: The paper has been accepted at Neurips 2024. This is a revised version of the paper previously titled "Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning"

  5. arXiv:2310.02012  [pdf, other

    cs.LG cs.AI

    Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion

    Authors: Alexandru Meterez, Amir Joudaki, Francesco Orabona, Alexander Immer, Gunnar Rätsch, Hadi Daneshmand

    Abstract: Normalization layers are one of the key building blocks for deep neural networks. Several theoretical studies have shown that batch normalization improves the signal propagation, by avoiding the representations from becoming collinear across the layers. However, results on mean-field theory of batch normalization also conclude that this benefit comes at the expense of exploding gradients in depth.… ▽ More

    Submitted 3 October, 2023; originally announced October 2023.