Skip to main content

Showing 1–18 of 18 results for author: Vladymyrov, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.09522  [pdf, other

    cs.CL cs.AI

    How new data permeates LLM knowledge and how to dilute it

    Authors: Chen Sun, Renat Aksitov, Andrey Zhmoginov, Nolan Andrew Miller, Max Vladymyrov, Ulrich Rueckert, Been Kim, Mark Sandler

    Abstract: Large language models learn and continually learn through the accumulation of gradient-based updates, but how individual pieces of new information affect existing knowledge, leading to both beneficial generalization and problematic hallucination, remains poorly understood. We demonstrate that when learning new information, LLMs exhibit a "priming" effect: learning a new fact can cause the model to… ▽ More

    Submitted 13 April, 2025; originally announced April 2025.

  2. arXiv:2504.08934  [pdf, other

    cs.LG cs.AI

    Long Context In-Context Compression by Getting to the Gist of Gisting

    Authors: Aleksandar Petrov, Mark Sandler, Andrey Zhmoginov, Nolan Miller, Max Vladymyrov

    Abstract: Long context processing is critical for the adoption of LLMs, but existing methods often introduce architectural complexity that hinders their practical adoption. Gisting, an in-context compression method with no architectural modification to the decoder transformer, is a promising approach due to its simplicity and compatibility with existing frameworks. While effective for short instructions, we… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

  3. arXiv:2410.21750  [pdf, other

    cs.CL cs.AI

    Learning and Unlearning of Fabricated Knowledge in Language Models

    Authors: Chen Sun, Nolan Andrew Miller, Andrey Zhmoginov, Max Vladymyrov, Mark Sandler

    Abstract: What happens when a new piece of knowledge is introduced into the training data and how long does it last while a large language model (LM) continues to train? We investigate this question by injecting facts into LMs from a new probing dataset, "Outlandish", which is designed to permit the testing of a spectrum of different fact types. When studying how robust these memories are, there appears to… ▽ More

    Submitted 29 October, 2024; originally announced October 2024.

    Journal ref: ICML 2024 Workshop on Mechanistic Interpretability

  4. arXiv:2408.09310  [pdf, other

    cs.LG

    Narrowing the Focus: Learned Optimizers for Pretrained Models

    Authors: Gus Kristiansen, Mark Sandler, Andrey Zhmoginov, Nolan Miller, Anirudh Goyal, Jihwan Lee, Max Vladymyrov

    Abstract: In modern deep learning, the models are learned by applying gradient updates using an optimizer, which transforms the updates based on various statistics. Optimizers are often hand-designed and tuning their hyperparameters is a big part of the training process. Learned optimizers have shown some initial promise, but are generally unsuccessful as a general optimization mechanism applicable to every… ▽ More

    Submitted 4 October, 2024; v1 submitted 17 August, 2024; originally announced August 2024.

  5. arXiv:2402.14180  [pdf, other

    cs.LG

    Linear Transformers are Versatile In-Context Learners

    Authors: Max Vladymyrov, Johannes von Oswald, Mark Sandler, Rong Ge

    Abstract: Recent research has demonstrated that transformers, particularly linear attention models, implicitly execute gradient-descent-like algorithms on data provided in-context during their forward inference step. However, their capability in handling more complex problems remains unexplored. In this paper, we prove that each layer of a linear transformer maintains a weight vector for an implicit linear… ▽ More

    Submitted 30 October, 2024; v1 submitted 21 February, 2024; originally announced February 2024.

  6. arXiv:2309.05858  [pdf, other

    cs.LG cs.AI

    Uncovering mesa-optimization algorithms in Transformers

    Authors: Johannes von Oswald, Maximilian Schlegel, Alexander Meulemans, Seijin Kobayashi, Eyvind Niklasson, Nicolas Zucchet, Nino Scherrer, Nolan Miller, Mark Sandler, Blaise Agüera y Arcas, Max Vladymyrov, Razvan Pascanu, João Sacramento

    Abstract: Some autoregressive models exhibit in-context learning capabilities: being able to learn as an input sequence is processed, without undergoing any parameter changes, and without being explicitly trained to do so. The origins of this phenomenon are still poorly understood. Here we analyze a series of Transformer models trained to perform synthetic sequence prediction tasks, and discover that standa… ▽ More

    Submitted 15 October, 2024; v1 submitted 11 September, 2023; originally announced September 2023.

  7. arXiv:2301.04584  [pdf, other

    cs.LG cs.CV

    Continual HyperTransformer: A Meta-Learner for Continual Few-Shot Learning

    Authors: Max Vladymyrov, Andrey Zhmoginov, Mark Sandler

    Abstract: We focus on the problem of learning without forgetting from multiple tasks arriving sequentially, where each task is defined using a few-shot episode of novel or already seen classes. We approach this problem using the recently published HyperTransformer (HT), a Transformer-based hypernetwork that generates specialized task-specific CNN weights directly from the support set. In order to learn from… ▽ More

    Submitted 17 August, 2024; v1 submitted 11 January, 2023; originally announced January 2023.

    Comments: TMLR

  8. arXiv:2301.02312  [pdf, other

    cs.LG

    Training trajectories, mini-batch losses and the curious role of the learning rate

    Authors: Mark Sandler, Andrey Zhmoginov, Max Vladymyrov, Nolan Miller

    Abstract: Stochastic gradient descent plays a fundamental role in nearly all applications of deep learning. However its ability to converge to a global minimum remains shrouded in mystery. In this paper we propose to study the behavior of the loss function on fixed mini-batches along SGD trajectories. We show that the loss function on a fixed batch appears to be remarkably convex-like. In particular for Res… ▽ More

    Submitted 1 February, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

    Comments: 21 pages, 14 figures

  9. arXiv:2212.07677  [pdf, other

    cs.LG cs.AI cs.CL

    Transformers learn in-context by gradient descent

    Authors: Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, Max Vladymyrov

    Abstract: At present, the mechanisms of in-context learning in Transformers are not well understood and remain mostly an intuition. In this paper, we suggest that training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations. We start by providing a simple weight construction that shows the equivalence of data transformations induced by 1) a single linea… ▽ More

    Submitted 31 May, 2023; v1 submitted 15 December, 2022; originally announced December 2022.

  10. arXiv:2211.15774  [pdf, other

    cs.LG cs.CV

    Decentralized Learning with Multi-Headed Distillation

    Authors: Andrey Zhmoginov, Mark Sandler, Nolan Miller, Gus Kristiansen, Max Vladymyrov

    Abstract: Decentralized learning with private data is a central problem in machine learning. We propose a novel distillation-based decentralized learning technique that allows multiple agents with private non-iid data to learn from each other, without having to share their data, weights or weight updates. Our approach is communication efficient, utilizes an unlabeled public dataset and uses multiple auxilia… ▽ More

    Submitted 28 November, 2022; originally announced November 2022.

  11. arXiv:2203.15243  [pdf, other

    cs.CV

    Fine-tuning Image Transformers using Learnable Memory

    Authors: Mark Sandler, Andrey Zhmoginov, Max Vladymyrov, Andrew Jackson

    Abstract: In this paper we propose augmenting Vision Transformer models with learnable memory tokens. Our approach allows the model to adapt to new tasks, using few parameters, while optionally preserving its capabilities on previously learned tasks. At each layer we introduce a set of learnable embedding vectors that provide contextual information useful for specific datasets. We call these "memory tokens"… ▽ More

    Submitted 29 March, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: CVPR 2022, to appear

  12. arXiv:2201.05125  [pdf, other

    cs.LG cs.CV

    GradMax: Growing Neural Networks using Gradient Information

    Authors: Utku Evci, Bart van Merriënboer, Thomas Unterthiner, Max Vladymyrov, Fabian Pedregosa

    Abstract: The architecture and the parameters of neural networks are often optimized independently, which requires costly retraining of the parameters whenever the architecture is modified. In this work we instead focus on growing the architecture without requiring costly retraining. We present a method that adds new neurons during training without impacting what is already learned, while improving the trai… ▽ More

    Submitted 7 June, 2022; v1 submitted 13 January, 2022; originally announced January 2022.

    Comments: ICLR 2022

    Journal ref: International Conference on Learning Representations, 2022

  13. arXiv:2201.04182  [pdf, other

    cs.LG cs.CV

    HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning

    Authors: Andrey Zhmoginov, Mark Sandler, Max Vladymyrov

    Abstract: In this work we propose a HyperTransformer, a Transformer-based model for supervised and semi-supervised few-shot learning that generates weights of a convolutional neural network (CNN) directly from support samples. Since the dependence of a small generated CNN model on a specific task is encoded by a high-capacity Transformer model, we effectively decouple the complexity of the large task space… ▽ More

    Submitted 13 July, 2022; v1 submitted 11 January, 2022; originally announced January 2022.

  14. arXiv:2104.04657  [pdf, other

    cs.LG cs.NE

    Meta-Learning Bidirectional Update Rules

    Authors: Mark Sandler, Max Vladymyrov, Andrey Zhmoginov, Nolan Miller, Andrew Jackson, Tom Madams, Blaise Aguera y Arcas

    Abstract: In this paper, we introduce a new type of generalized neural network where neurons and synapses maintain multiple states. We show that classical gradient-based backpropagation in neural networks can be seen as a special case of a two-state network where one state is used for activations and another for gradients, with update rules derived from the chain rule. In our generalized framework, networks… ▽ More

    Submitted 11 June, 2021; v1 submitted 9 April, 2021; originally announced April 2021.

    Comments: ICML 2021, 17 pages

  15. arXiv:2011.03395  [pdf, other

    cs.LG stat.ML

    Underspecification Presents Challenges for Credibility in Modern Machine Learning

    Authors: Alexander D'Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yian Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary Nado, Vivek Natarajan, Christopher Nielson, Thomas F. Osborne , et al. (15 additional authors not shown)

    Abstract: ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predict… ▽ More

    Submitted 24 November, 2020; v1 submitted 6 November, 2020; originally announced November 2020.

    Comments: Updates: Updated statistical analysis in Section 6; Additional citations

  16. Novel tracking approach based on fully-unsupervised disentanglement of the geometrical factors of variation

    Authors: Mykhailo Vladymyrov, Akitaka Ariga

    Abstract: Efficient tracking algorithms are a crucial part of particle tracking detectors. While a lot of work has been done in designing a plethora of algorithms, these usually require tedious tuning for each use case. (Weakly) supervised Machine Learning-based approaches can leverage the actual raw data for maximal performance. Yet in realistic scenarios, sufficient high-quality labeled data is not availa… ▽ More

    Submitted 13 February, 2020; v1 submitted 10 September, 2019; originally announced September 2019.

    Comments: Accepted for publication in JINST

  17. arXiv:1906.11389  [pdf, ps, other

    cs.LG stat.ML

    No Pressure! Addressing the Problem of Local Minima in Manifold Learning Algorithms

    Authors: Max Vladymyrov

    Abstract: Nonlinear embedding manifold learning methods provide invaluable visual insights into the structure of high-dimensional data. However, due to a complicated nonconvex objective function, these methods can easily get stuck in local minima and their embedding quality can be poor. We propose a natural extension to several manifold learning methods aimed at identifying pressured points, i.e. points stu… ▽ More

    Submitted 27 December, 2019; v1 submitted 26 June, 2019; originally announced June 2019.

    Comments: 10 pages, NeurIPS 2019

  18. arXiv:1206.4646  [pdf

    cs.LG stat.ML

    Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings

    Authors: Max Vladymyrov, Miguel Carreira-Perpinan

    Abstract: Stochastic neighbor embedding (SNE) and related nonlinear manifold learning algorithms achieve high-quality low-dimensional representations of similarity data, but are notoriously slow to train. We propose a generic formulation of embedding algorithms that includes SNE and other existing algorithms, and study their relation with spectral methods and graph Laplacians. This allows us to define sever… ▽ More

    Submitted 18 June, 2012; originally announced June 2012.

    Comments: ICML2012