Skip to main content

Showing 1–23 of 23 results for author: Ramapuram, J

.
  1. arXiv:2502.08606  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    Distillation Scaling Laws

    Authors: Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb

    Abstract: We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks associated with using distillation at scale; compute allocation for both the teacher and student models can now be done to maximize student performance. We provide compute optimal distillation recipes for when 1… ▽ More

    Submitted 12 February, 2025; originally announced February 2025.

    Comments: 67 pages, 54 figures, 13 tables

  2. arXiv:2409.04431  [pdf, other

    cs.LG

    Theory, Analysis, and Best Practices for Sigmoid Self-Attention

    Authors: Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, Russ Webb

    Abstract: Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between keys and queries. Recent work has explored alternatives to softmax attention in transformers, such as ReLU and sigmoid activations. In this work, we revisit sigmoi… ▽ More

    Submitted 21 January, 2025; v1 submitted 6 September, 2024; originally announced September 2024.

  3. arXiv:2403.05490  [pdf, other

    cs.LG cs.AI cs.CV cs.IT stat.ML

    Poly-View Contrastive Learning

    Authors: Amitis Shidani, Devon Hjelm, Jason Ramapuram, Russ Webb, Eeshan Gunesh Dhekane, Dan Busbridge

    Abstract: Contrastive learning typically matches pairs of related views among a number of unrelated negative views. Views can be generated (e.g. by augmentations) or be observed. We investigate matching when there are more than two related views which we call poly-view tasks, and derive new representation learning objectives using information maximization and sufficient statistics. We show that with unlimit… ▽ More

    Submitted 8 March, 2024; originally announced March 2024.

    Comments: Accepted to ICLR 2024. 42 pages, 7 figures, 3 tables, loss pseudo-code included in appendix

  4. arXiv:2312.03213  [pdf, other

    cs.LG stat.ML

    Bootstrap Your Own Variance

    Authors: Polina Turishcheva, Jason Ramapuram, Sinead Williamson, Dan Busbridge, Eeshan Dhekane, Russ Webb

    Abstract: Understanding model uncertainty is important for many applications. We propose Bootstrap Your Own Variance (BYOV), combining Bootstrap Your Own Latent (BYOL), a negative-free Self-Supervised Learning (SSL) algorithm, with Bayes by Backprop (BBB), a Bayesian method for estimating model posteriors. We find that the learned predictive std of BYOV vs. a supervised BBB model is well captured by a Gauss… ▽ More

    Submitted 5 December, 2023; originally announced December 2023.

    Journal ref: NeurIPS 2023 Workshop: Self-Supervised Learning - Theory and Practice

  5. arXiv:2307.13813  [pdf, other

    stat.ML cs.AI cs.LG

    How to Scale Your EMA

    Authors: Dan Busbridge, Jason Ramapuram, Pierre Ablin, Tatiana Likhomanenko, Eeshan Gunesh Dhekane, Xavier Suau, Russ Webb

    Abstract: Preserving training dynamics across batch sizes is an important tool for practical machine learning as it enables the trade-off between batch size and wall-clock time. This trade-off is typically enabled by a scaling rule, for example, in stochastic gradient descent, one should scale the learning rate linearly with the batch size. Another important machine learning tool is the model EMA, a functio… ▽ More

    Submitted 7 November, 2023; v1 submitted 25 July, 2023; originally announced July 2023.

    Comments: Spotlight at NeurIPS 2023, 53 pages, 32 figures, 17 tables

  6. arXiv:2307.10907  [pdf, other

    cs.LG

    The Role of Entropy and Reconstruction in Multi-View Self-Supervised Learning

    Authors: Borja Rodríguez-Gálvez, Arno Blaas, Pau Rodríguez, Adam Goliński, Xavier Suau, Jason Ramapuram, Dan Busbridge, Luca Zappella

    Abstract: The mechanisms behind the success of multi-view self-supervised learning (MVSSL) are not yet fully understood. Contrastive MVSSL methods have been studied through the lens of InfoNCE, a lower bound of the Mutual Information (MI). However, the relation between other MVSSL methods and MI remains unclear. We consider a different lower bound on the MI consisting of an entropy and a reconstruction term… ▽ More

    Submitted 9 December, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

    Comments: 18 pages: 9 of main text, 2 of references, and 7 of supplementary material [Updated typo in page 6 (Section 3.2)]. Appears in the proceedings of ICML 2023

  7. arXiv:2306.16058  [pdf, other

    cs.LG cs.AI

    DUET: 2D Structured and Approximately Equivariant Representations

    Authors: Xavier Suau, Federico Danieli, T. Anderson Keller, Arno Blaas, Chen Huang, Jason Ramapuram, Dan Busbridge, Luca Zappella

    Abstract: Multiview Self-Supervised Learning (MSSL) is based on learning invariances with respect to a set of input transformations. However, invariance partially or totally removes transformation-related information from the representations, which might harm performance for specific downstream tasks that require such information. We propose 2D strUctured and EquivarianT representations (coined DUET), which… ▽ More

    Submitted 17 November, 2023; v1 submitted 28 June, 2023; originally announced June 2023.

    Comments: Accepted at ICML 2023

  8. arXiv:2303.06296  [pdf, other

    cs.LG cs.AI cs.CL cs.CV stat.ML

    Stabilizing Transformer Training by Preventing Attention Entropy Collapse

    Authors: Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, Josh Susskind

    Abstract: Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the attention entropy for each attention head during the course of training, which is a proxy for model sharpness. We identify a common pattern across different architectures and tasks, where low at… ▽ More

    Submitted 25 July, 2023; v1 submitted 10 March, 2023; originally announced March 2023.

    Journal ref: In International Conference on Machine Learning (pp. 40770-40803). PMLR. 2023

  9. arXiv:2210.16365  [pdf, other

    cs.LG

    Elastic Weight Consolidation Improves the Robustness of Self-Supervised Learning Methods under Transfer

    Authors: Andrius Ovsianas, Jason Ramapuram, Dan Busbridge, Eeshan Gunesh Dhekane, Russ Webb

    Abstract: Self-supervised representation learning (SSL) methods provide an effective label-free initial condition for fine-tuning downstream tasks. However, in numerous realistic scenarios, the downstream task might be biased with respect to the target label distribution. This in turn moves the learned fine-tuned model posterior away from the initial (label) bias-free self-supervised model posterior. In thi… ▽ More

    Submitted 28 October, 2022; originally announced October 2022.

    Comments: NeurIPS 2022 Workshop: Self-Supervised Learning - Theory and Practice

  10. arXiv:2207.07611  [pdf, other

    cs.LG cs.CV cs.SD eess.AS

    Position Prediction as an Effective Pretraining Strategy

    Authors: Shuangfei Zhai, Navdeep Jaitly, Jason Ramapuram, Dan Busbridge, Tatiana Likhomanenko, Joseph Yitan Cheng, Walter Talbott, Chen Huang, Hanlin Goh, Joshua Susskind

    Abstract: Transformers have gained increasing popularity in a wide range of applications, including Natural Language Processing (NLP), Computer Vision and Speech Recognition, because of their powerful representational capacity. However, harnessing this representational capacity effectively requires a large amount of data, strong regularization, or both, to mitigate overfitting. Recently, the power of the Tr… ▽ More

    Submitted 15 July, 2022; originally announced July 2022.

    Comments: Accepted to ICML 2022

  11. arXiv:2111.12427  [pdf, other

    cs.LG cs.CV

    Challenges of Adversarial Image Augmentations

    Authors: Arno Blaas, Xavier Suau, Jason Ramapuram, Nicholas Apostoloff, Luca Zappella

    Abstract: Image augmentations applied during training are crucial for the generalization performance of image classifiers. Therefore, a large body of research has focused on finding the optimal augmentation policy for a given task. Yet, RandAugment [2], a simple random augmentation policy, has recently been shown to outperform existing sophisticated policies. Only Adversarial AutoAugment (AdvAA) [11], an ap… ▽ More

    Submitted 3 December, 2021; v1 submitted 24 November, 2021; originally announced November 2021.

    Comments: To appear at the ICBINB 2021 Neurips Workshop

  12. arXiv:2110.00552  [pdf, other

    cs.LG

    Stochastic Contrastive Learning

    Authors: Jason Ramapuram, Dan Busbridge, Xavier Suau, Russ Webb

    Abstract: While state-of-the-art contrastive Self-Supervised Learning (SSL) models produce results competitive with their supervised counterparts, they lack the ability to infer latent variables. In contrast, prescribed latent variable (LV) models enable attributing uncertainty, inducing task specific compression, and in general allow for more interpretable representations. In this work, we introduce LV app… ▽ More

    Submitted 30 November, 2021; v1 submitted 1 October, 2021; originally announced October 2021.

    Comments: Accepted to 2nd Workshop on Self-Supervised Learning: Theory and Practice (NeurIPS 2021), Sydney, Australia

  13. arXiv:2110.00538  [pdf, other

    cs.LG

    Evaluating the fairness of fine-tuning strategies in self-supervised learning

    Authors: Jason Ramapuram, Dan Busbridge, Russ Webb

    Abstract: In this work we examine how fine-tuning impacts the fairness of contrastive Self-Supervised Learning (SSL) models. Our findings indicate that Batch Normalization (BN) statistics play a crucial role, and that updating only the BN statistics of a pre-trained SSL backbone improves its downstream fairness (36% worst subgroup, 25% mean subgroup gap). This procedure is competitive with supervised learni… ▽ More

    Submitted 1 October, 2021; originally announced October 2021.

    Comments: Accepted to BayLearn 2021

  14. arXiv:2110.00528  [pdf, other

    cs.CV cs.LG stat.ML

    Do Self-Supervised and Supervised Methods Learn Similar Visual Representations?

    Authors: Tom George Grigg, Dan Busbridge, Jason Ramapuram, Russ Webb

    Abstract: Despite the success of a number of recent techniques for visual self-supervised deep learning, there has been limited investigation into the representations that are ultimately learned. By leveraging recent advances in the comparison of neural representations, we explore in this direction by comparing a contrastive self-supervised algorithm to supervision for simple image data in a common architec… ▽ More

    Submitted 2 December, 2021; v1 submitted 1 October, 2021; originally announced October 2021.

    Comments: Accepted to 2nd Workshop on Self-Supervised Learning: Theory and Practice (NeurIPS 2021), Sydney, Australia. Fixed typos, added acknowledgements. 5 pages + 2 pages of appendices, 5 figures, 1 table

  15. arXiv:2103.03905  [pdf, other

    cs.NE cs.AI cs.CV cs.LG stat.ML

    Kanerva++: extending The Kanerva Machine with differentiable, locally block allocated latent memory

    Authors: Jason Ramapuram, Yan Wu, Alexandros Kalousis

    Abstract: Episodic and semantic memory are critical components of the human memory model. The theory of complementary learning systems (McClelland et al., 1995) suggests that the compressed representation produced by a serial event (episodic memory) is later restructured to build a more generalized form of reusable knowledge (semantic memory). In this work we develop a new principled Bayesian memory allocat… ▽ More

    Submitted 6 February, 2022; v1 submitted 20 February, 2021; originally announced March 2021.

    Journal ref: ICLR 2021

  16. arXiv:2011.02523  [pdf, other

    cs.CV cs.GR

    Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding

    Authors: Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, Joshua M. Susskind

    Abstract: For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real images. We address this challenge by introducing Hypersim, a photorealistic synthetic dataset for holistic indoor scene understanding. To create our dataset, we leverage a large repository of synthetic scenes created by professional artists, and we generate 77,400 images… ▽ More

    Submitted 17 August, 2021; v1 submitted 4 November, 2020; originally announced November 2020.

    Comments: Accepted for publication at the International Conference on Computer Vision (ICCV) 2021

  17. arXiv:2006.16228  [pdf, other

    cs.CV

    Self-Supervised MultiModal Versatile Networks

    Authors: Jean-Baptiste Alayrac, Adrià Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, Andrew Zisserman

    Abstract: Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalit… ▽ More

    Submitted 30 October, 2020; v1 submitted 29 June, 2020; originally announced June 2020.

    Comments: To appear in the Thirty-Fourth Annual Conference on Neural Information Processing Systems (NeurIPS 2020)

  18. arXiv:1905.03658  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Improving Discrete Latent Representations With Differentiable Approximation Bridges

    Authors: Jason Ramapuram, Russ Webb

    Abstract: Modern neural network training relies on piece-wise (sub-)differentiable functions in order to use backpropagation to update model parameters. In this work, we introduce a novel method to allow simple non-differentiable functions at intermediary layers of deep neural networks. We do so by training with a differentiable approximation bridge (DAB) neural network which approximates the non-differenti… ▽ More

    Submitted 25 October, 2019; v1 submitted 9 May, 2019; originally announced May 2019.

  19. arXiv:1812.03170  [pdf, other

    cs.CV cs.LG stat.ML

    Variational Saccading: Efficient Inference for Large Resolution Images

    Authors: Jason Ramapuram, Maurits Diephuis, Frantzeska Lavda, Russ Webb, Alexandros Kalousis

    Abstract: Image classification with deep neural networks is typically restricted to images of small dimensionality such as 224 x 244 in Resnet models [24]. This limitation excludes the 4000 x 3000 dimensional images that are taken by modern smartphone cameras and smart devices. In this work, we aim to mitigate the prohibitive inferential and memory costs of operating in such large dimensional spaces. To sam… ▽ More

    Submitted 6 September, 2019; v1 submitted 8 December, 2018; originally announced December 2018.

    Comments: Published BMVC 2019 & NIPS 2018 Bayesian Deep Learning Workshop

  20. arXiv:1810.10612  [pdf, other

    cs.LG stat.ML

    Continual Classification Learning Using Generative Models

    Authors: Frantzeska Lavda, Jason Ramapuram, Magda Gregorova, Alexandros Kalousis

    Abstract: Continual learning is the ability to sequentially learn over time by accommodating knowledge while retaining previously learned experiences. Neural networks can learn multiple tasks when trained on them jointly, but cannot maintain performance on previously learned tasks when tasks are presented one at a time. This problem is called catastrophic forgetting. In this work, we propose a classificatio… ▽ More

    Submitted 24 October, 2018; originally announced October 2018.

    Comments: 5 pages, 4 figures, under review in Continual learning Workshop NIPS 2018

  21. arXiv:1807.00126  [pdf, other

    cs.LG stat.ML

    A New Benchmark and Progress Toward Improved Weakly Supervised Learning

    Authors: Jason Ramapuram, Russ Webb

    Abstract: Knowledge Matters: Importance of Prior Information for Optimization [7], by Gulcehre et. al., sought to establish the limits of current black-box, deep learning techniques by posing problems which are difficult to learn without engineering knowledge into the model or training procedure. In our work, we completely solve the previous Knowledge Matters problem using a generic model, pose a more diffi… ▽ More

    Submitted 18 September, 2018; v1 submitted 30 June, 2018; originally announced July 2018.

  22. arXiv:1804.07169  [pdf, ps, other

    cs.LG stat.ML

    Large-scale Nonlinear Variable Selection via Kernel Random Features

    Authors: Magda Gregorová, Jason Ramapuram, Alexandros Kalousis, Stéphane Marchand-Maillet

    Abstract: We propose a new method for input variable selection in nonlinear regression. The method is embedded into a kernel regression machine that can model general nonlinear functions, not being a priori limited to additive models. This is the first kernel-based variable selection method applicable to large datasets. It sidesteps the typical poor scaling properties of kernel methods by mapping the inputs… ▽ More

    Submitted 1 September, 2018; v1 submitted 19 April, 2018; originally announced April 2018.

    Comments: Final version for proceedings of ECML/PKDD 2018

  23. Lifelong Generative Modeling

    Authors: Jason Ramapuram, Magda Gregorova, Alexandros Kalousis

    Abstract: Lifelong learning is the problem of learning multiple consecutive tasks in a sequential manner, where knowledge gained from previous tasks is retained and used to aid future learning over the lifetime of the learner. It is essential towards the development of intelligent machines that can adapt to their surroundings. In this work we focus on a lifelong learning approach to unsupervised generative… ▽ More

    Submitted 8 September, 2020; v1 submitted 27 May, 2017; originally announced May 2017.

    Comments: 32 pages

    Journal ref: Neurocomputing 2020, Volume 404, Pages 381-400