Skip to main content

Showing 1–14 of 14 results for author: Therien, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.10315  [pdf, ps, other

    cs.LG

    PyLO: Towards Accessible Learned Optimizers in PyTorch

    Authors: Paul Janson, Benjamin Therien, Quentin Anthony, Xiaolong Huang, Abhinav Moudgil, Eugene Belilovsky

    Abstract: Learned optimizers have been an active research topic over the past decade, with increasing progress toward practical, general-purpose optimizers that can serve as drop-in replacements for widely used methods like Adam. However, recent advances -- such as VeLO, which was meta-trained for 4000 TPU-months -- remain largely inaccessible to the broader community, in part due to their reliance on JAX a… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: Accepted at ICML CODEML Workshop 2025

  2. arXiv:2505.23725  [pdf, ps, other

    cs.LG

    MuLoCo: Muon is a practical inner optimizer for DiLoCo

    Authors: Benjamin Thérien, Xiaolong Huang, Irina Rish, Eugene Belilovsky

    Abstract: DiLoCo is a powerful framework for training large language models (LLMs) under networking constraints with advantages for increasing parallelism and accelerator utilization in data center settings. Despite significantly reducing communication frequency, however, DiLoCo's communication steps still involve all-reducing a complete copy of the model's parameters. While existing works have explored way… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  3. arXiv:2504.12463  [pdf, other

    cs.LG cs.AI

    Dense Backpropagation Improves Training for Sparse Mixture-of-Experts

    Authors: Ashwinee Panda, Vatsal Baherwani, Zain Sarwar, Benjamin Therien, Supriyo Chakraborty, Tom Goldstein

    Abstract: Mixture of Experts (MoE) pretraining is more scalable than dense Transformer pretraining, because MoEs learn to route inputs to a sparse set of their feedforward parameters. However, this means that MoEs only receive a sparse backward update, leading to training instability and suboptimal performance. We present a lightweight approximation method that gives the MoE router a dense gradient update w… ▽ More

    Submitted 17 April, 2025; v1 submitted 16 April, 2025; originally announced April 2025.

  4. arXiv:2503.05029  [pdf, other

    cs.LG cs.AI cs.CL

    Continual Pre-training of MoEs: How robust is your router?

    Authors: Benjamin Thérien, Charles-Étienne Joseph, Zain Sarwar, Ashwinee Panda, Anirban Das, Shi-Xiong Zhang, Stephen Rawls, Sambit Sahu, Eugene Belilovsky, Irina Rish

    Abstract: Sparsely-activated Mixture of Experts (MoE) transformers are promising architectures for foundation models. Compared to dense transformers that require the same amount of floating point operations (FLOPs) per forward pass, MoEs benefit from improved sample efficiency at training time and achieve much stronger performance. Many closed-source and open-source frontier language models have thus adopte… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

  5. arXiv:2503.02844  [pdf, other

    cs.LG

    Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training

    Authors: Vaibhav Singh, Paul Janson, Paria Mehrbod, Adam Ibrahim, Irina Rish, Eugene Belilovsky, Benjamin Thérien

    Abstract: The ever-growing availability of unlabeled data presents both opportunities and challenges for training artificial intelligence systems. While self-supervised learning (SSL) has emerged as a powerful paradigm for extracting meaningful representations from vast amounts of unlabeled data, existing methods still struggle to adapt to the non-stationary, non-IID nature of real-world data streams withou… ▽ More

    Submitted 5 March, 2025; v1 submitted 4 March, 2025; originally announced March 2025.

  6. arXiv:2406.00153  [pdf, ps, other

    cs.LG

    $μ$LO: Compute-Efficient Meta-Generalization of Learned Optimizers

    Authors: Benjamin Thérien, Charles-Étienne Joseph, Boris Knyazev, Edouard Oyallon, Irina Rish, Eugene Belilovsky

    Abstract: Learned optimizers (LOs) can significantly reduce the wall-clock training time of neural networks, substantially reducing training costs. However, they can struggle to optimize unseen tasks (meta-generalize), especially when training networks wider than those seen during meta-training. To address this, we derive the Maximal Update Parametrization ($μ$P) for two state-of-the-art learned optimizer a… ▽ More

    Submitted 4 June, 2025; v1 submitted 31 May, 2024; originally announced June 2024.

  7. arXiv:2403.08763  [pdf, other

    cs.LG cs.AI cs.CL

    Simple and Scalable Strategies to Continually Pre-train Large Language Models

    Authors: Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats L. Richter, Quentin Anthony, Timothée Lesort, Eugene Belilovsky, Irina Rish

    Abstract: Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptati… ▽ More

    Submitted 4 September, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

  8. arXiv:2312.02204  [pdf, ps, other

    cs.LG

    Meta-learning Optimizers for Communication-Efficient Learning

    Authors: Charles-Étienne Joseph, Benjamin Thérien, Abhinav Moudgil, Boris Knyazev, Eugene Belilovsky

    Abstract: Communication-efficient variants of SGD, specifically local SGD, have received a great deal of interest in recent years. These approaches compute multiple gradient steps locally on each worker, before averaging model parameters, helping relieve the critical communication bottleneck in distributed deep learning training. Although many variants of these approaches have been proposed, they can someti… ▽ More

    Submitted 11 June, 2025; v1 submitted 2 December, 2023; originally announced December 2023.

  9. arXiv:2308.04014  [pdf, other

    cs.CL cs.LG

    Continual Pre-Training of Large Language Models: How to (re)warm your model?

    Authors: Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L. Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, Timothée Lesort

    Abstract: Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the process over again once new data becomes available. A much cheaper and more efficient solution would be to enable the continual pre-training of these models, i.e. updating pre-trained models with new data instead of re-training them from scratch. However, the distribution shift induced by novel data t… ▽ More

    Submitted 6 September, 2023; v1 submitted 7 August, 2023; originally announced August 2023.

  10. arXiv:2305.10210  [pdf, other

    cs.CV cs.LG

    Object Re-Identification from Point Clouds

    Authors: Benjamin Thérien, Chengjie Huang, Adrian Chow, Krzysztof Czarnecki

    Abstract: Object re-identification (ReID) from images plays a critical role in application domains of image retrieval (surveillance, retail analytics, etc.) and multi-object tracking (autonomous driving, robotics, etc.). However, systems that additionally or exclusively perceive the world from depth sensors are becoming more commonplace without any corresponding methods for object ReID. In this work, we fil… ▽ More

    Submitted 11 August, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

  11. arXiv:2210.02577  [pdf, other

    cs.LG cs.CR

    A Closer Look at Robustness to L-infinity and Spatial Perturbations and their Composition

    Authors: Luke Rowe, Benjamin Thérien, Krzysztof Czarnecki, Hongyang Zhang

    Abstract: In adversarial machine learning, the popular $\ell_\infty$ threat model has been the focus of much previous work. While this mathematical definition of imperceptibility successfully captures an infinite set of additive image transformations that a model should be robust to, this is only a subset of all transformations which leave the semantic label of an image unchanged. Indeed, previous work also… ▽ More

    Submitted 5 October, 2022; originally announced October 2022.

    Comments: 16 pages, 5 figures, and 3 tables

  12. arXiv:2210.01266  [pdf, other

    cs.CV cs.AI cs.LG

    Interpretable Deep Tracking

    Authors: Benjamin Thérien, Krzysztof Czarnecki

    Abstract: Imagine experiencing a crash as the passenger of an autonomous vehicle. Wouldn't you want to know why it happened? Current end-to-end optimizable deep neural networks (DNNs) in 3D detection, multi-object tracking, and motion forecasting provide little to no explanations about how they make their decisions. To help bridge this gap, we design an end-to-end optimizable multi-object tracking architect… ▽ More

    Submitted 3 October, 2022; originally announced October 2022.

  13. arXiv:2209.14435  [pdf, other

    cs.CV cs.AI cs.LG cs.RO eess.IV

    Out-of-Distribution Detection for LiDAR-based 3D Object Detection

    Authors: Chengjie Huang, Van Duong Nguyen, Vahdat Abdelzad, Christopher Gus Mannes, Luke Rowe, Benjamin Therien, Rick Salay, Krzysztof Czarnecki

    Abstract: 3D object detection is an essential part of automated driving, and deep neural networks (DNNs) have achieved state-of-the-art performance for this task. However, deep models are notorious for assigning high confidence scores to out-of-distribution (OOD) inputs, that is, inputs that are not drawn from the training distribution. Detecting OOD inputs is challenging and essential for the safe deployme… ▽ More

    Submitted 28 September, 2022; originally announced September 2022.

    Comments: Accepted at ITSC 2022

  14. arXiv:2107.09539  [pdf, other

    cs.LG eess.SP

    Parametric Scattering Networks

    Authors: Shanel Gauthier, Benjamin Thérien, Laurent Alsène-Racicot, Muawiz Chaudhary, Irina Rish, Eugene Belilovsky, Michael Eickenberg, Guy Wolf

    Abstract: The wavelet scattering transform creates geometric invariants and deformation stability. In multiple signal domains, it has been shown to yield more discriminative representations compared to other non-learned representations and to outperform learned representations in certain tasks, particularly on limited labeled data and highly structured signals. The wavelet filters used in the scattering tra… ▽ More

    Submitted 15 August, 2022; v1 submitted 20 July, 2021; originally announced July 2021.

    ACM Class: F.2.2; I.2.7