Skip to main content

Showing 1–9 of 9 results for author: Kosson, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2410.23922  [pdf, other

    cs.LG

    Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training

    Authors: Atli Kosson, Bettina Messmer, Martin Jaggi

    Abstract: Learning Rate Warmup is a popular heuristic for training neural networks, especially at larger batch sizes, despite limited understanding of its benefits. Warmup decreases the update size $Δ\mathbf{w}_t = η_t \mathbf{u}_t$ early in training by using lower values for the learning rate $η_t$. In this work we argue that warmup benefits training by keeping the overall size of $Δ\mathbf{w}_t$ limited,… ▽ More

    Submitted 31 October, 2024; originally announced October 2024.

    Comments: Accepted to NeurIPS 2024

  2. arXiv:2405.18392  [pdf, other

    cs.LG

    Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

    Authors: Alexander Hägele, Elie Bakouch, Atli Kosson, Loubna Ben Allal, Leandro Von Werra, Martin Jaggi

    Abstract: Scale has become a main ingredient in obtaining strong machine learning models. As a result, understanding a model's scaling properties is key to effectively designing both the right training setup as well as future generations of architectures. In this work, we argue that scale and training research has been needlessly complex due to reliance on the cosine schedule, which prevents training across… ▽ More

    Submitted 17 October, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: Spotlight at NeurIPS 2024

  3. arXiv:2309.12381  [pdf, other

    cs.LG

    Memory Efficient Mixed-Precision Optimizers

    Authors: Basile Lewandowski, Atli Kosson

    Abstract: Traditional optimization methods rely on the use of single-precision floating point arithmetic, which can be costly in terms of memory size and computing power. However, mixed precision optimization techniques leverage the use of both single and half-precision floating point arithmetic to reduce memory requirements while maintaining model accuracy. We provide here an algorithm to further reduce me… ▽ More

    Submitted 21 September, 2023; originally announced September 2023.

  4. arXiv:2305.17212  [pdf, other

    cs.LG

    Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks

    Authors: Atli Kosson, Bettina Messmer, Martin Jaggi

    Abstract: This study investigates how weight decay affects the update behavior of individual neurons in deep neural networks through a combination of applied analysis and experimentation. Weight decay can cause the expected magnitude and angular updates of a neuron's weight vector to converge to a steady state we call rotational equilibrium. These states can be highly homogeneous, effectively balancing the… ▽ More

    Submitted 3 June, 2024; v1 submitted 26 May, 2023; originally announced May 2023.

    Comments: Accepted to ICML 2024; Code available at https://github.com/epfml/REQ

  5. arXiv:2305.17205  [pdf, other

    cs.LG

    Ghost Noise for Regularizing Deep Neural Networks

    Authors: Atli Kosson, Dongyang Fan, Martin Jaggi

    Abstract: Batch Normalization (BN) is widely used to stabilize the optimization process and improve the test performance of deep neural networks. The regularization effect of BN depends on the batch size and explicitly using smaller batch sizes with Batch Normalization, a method known as Ghost Batch Normalization (GBN), has been found to improve generalization in many settings. We investigate the effectiven… ▽ More

    Submitted 19 December, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

    Journal ref: AAAI 2024

  6. arXiv:2305.17190  [pdf, other

    cs.LG

    Multiplication-Free Transformer Training via Piecewise Affine Operations

    Authors: Atli Kosson, Martin Jaggi

    Abstract: Multiplications are responsible for most of the computational cost involved in neural network training and inference. Recent research has thus looked for ways to reduce the cost associated with them. Inspired by Mogami (2020), we replace multiplication with a cheap piecewise affine approximation that is achieved by adding the bit representation of the floating point numbers together as integers. W… ▽ More

    Submitted 25 October, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

    Comments: Accepted to the 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

  7. arXiv:2007.01397  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    Adaptive Braking for Mitigating Gradient Delay

    Authors: Abhinav Venigalla, Atli Kosson, Vitaliy Chiley, Urs Köster

    Abstract: Neural network training is commonly accelerated by using multiple synchronized workers to compute gradient updates in parallel. Asynchronous methods remove synchronization overheads and improve hardware utilization at the cost of introducing gradient delay, which impedes optimization and can lead to lower final model performance. We introduce Adaptive Braking (AB), a modification for momentum-base… ▽ More

    Submitted 10 July, 2020; v1 submitted 2 July, 2020; originally announced July 2020.

    Comments: In Beyond First Order Methods in ML Systems workshop at the 37th International Conference on Machine Learning, 2020

  8. arXiv:2003.11666  [pdf, other

    cs.LG cs.DC stat.ML

    Pipelined Backpropagation at Scale: Training Large Models without Batches

    Authors: Atli Kosson, Vitaliy Chiley, Abhinav Venigalla, Joel Hestness, Urs Köster

    Abstract: New hardware can substantially increase the speed and efficiency of deep neural network training. To guide the development of future hardware architectures, it is pertinent to explore the hardware and machine learning properties of alternative training algorithms. In this work we evaluate the use of small batch, fine-grained Pipelined Backpropagation, an asynchronous pipeline parallel training alg… ▽ More

    Submitted 9 April, 2021; v1 submitted 25 March, 2020; originally announced March 2020.

    Comments: Proceedings of the 4th MLSys Conference, 2021

  9. arXiv:1905.05894  [pdf, other

    cs.LG stat.ML

    Online Normalization for Training Neural Networks

    Authors: Vitaliy Chiley, Ilya Sharapov, Atli Kosson, Urs Koster, Ryan Reece, Sofia Samaniego de la Fuente, Vishal Subbiah, Michael James

    Abstract: Online Normalization is a new technique for normalizing the hidden activations of a neural network. Like Batch Normalization, it normalizes the sample dimension. While Online Normalization does not use batches, it is as accurate as Batch Normalization. We resolve a theoretical limitation of Batch Normalization by introducing an unbiased technique for computing the gradient of normalized activation… ▽ More

    Submitted 3 December, 2019; v1 submitted 14 May, 2019; originally announced May 2019.

    Comments: Published at the Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. Code: https://github.com/Cerebras/online-normalization