-
Steering Large Language Model Activations in Sparse Spaces
Authors:
Reza Bayat,
Ali Rahimi-Kalahroudi,
Mohammad Pezeshki,
Sarath Chandar,
Pascal Vincent
Abstract:
A key challenge in AI alignment is guiding large language models (LLMs) to follow desired behaviors at test time. Activation steering, which modifies internal model activations during inference, offers a potential solution. However, prior work in dense activation spaces struggles with superposition, wherein multiple features become entangled, limiting interpretability and precise control. In contr…
▽ More
A key challenge in AI alignment is guiding large language models (LLMs) to follow desired behaviors at test time. Activation steering, which modifies internal model activations during inference, offers a potential solution. However, prior work in dense activation spaces struggles with superposition, wherein multiple features become entangled, limiting interpretability and precise control. In contrast, sparse representations provide an untapped opportunity for more interpretable behavior modulation. In this work, we introduce sparse activation steering (SAS), a method that leverages sparse autoencoders (SAEs) to steer LLM behavior in sparse spaces. By isolating behavior-specific features through a contrastive prompt-pairing approach, we define a set of features that can selectively reinforce or suppress behaviors. Experiments on Gemma 2 LLMs show that SAS vectors enable nuanced behavioral modulation and finer-grained control. Furthermore, scaling SAEs improves monosemanticity of SAS vectors, suggesting more reliable and interpretable interventions.
△ Less
Submitted 28 February, 2025;
originally announced March 2025.
-
MaestroMotif: Skill Design from Artificial Intelligence Feedback
Authors:
Martin Klissarov,
Mikael Henaff,
Roberta Raileanu,
Shagun Sodhani,
Pascal Vincent,
Amy Zhang,
Pierre-Luc Bacon,
Doina Precup,
Marlos C. Machado,
Pierluca D'Oro
Abstract:
Describing skills in natural language has the potential to provide an accessible way to inject human knowledge about decision-making into an AI system. We present MaestroMotif, a method for AI-assisted skill design, which yields high-performing and adaptable agents. MaestroMotif leverages the capabilities of Large Language Models (LLMs) to effectively create and reuse skills. It first uses an LLM'…
▽ More
Describing skills in natural language has the potential to provide an accessible way to inject human knowledge about decision-making into an AI system. We present MaestroMotif, a method for AI-assisted skill design, which yields high-performing and adaptable agents. MaestroMotif leverages the capabilities of Large Language Models (LLMs) to effectively create and reuse skills. It first uses an LLM's feedback to automatically design rewards corresponding to each skill, starting from their natural language description. Then, it employs an LLM's code generation abilities, together with reinforcement learning, for training the skills and combining them to implement complex behaviors specified in language. We evaluate MaestroMotif using a suite of complex tasks in the NetHack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability.
△ Less
Submitted 11 December, 2024;
originally announced December 2024.
-
The Pitfalls of Memorization: When Memorization Hurts Generalization
Authors:
Reza Bayat,
Mohammad Pezeshki,
Elvis Dohmatob,
David Lopez-Paz,
Pascal Vincent
Abstract:
Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these explanations.This behavior leads to poor generalization when the learned explanations rely on spurious correlations. In this work, we formalize the interplay between memorization and generalization, showing that spurious correlations would particularly lead to poor…
▽ More
Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these explanations.This behavior leads to poor generalization when the learned explanations rely on spurious correlations. In this work, we formalize the interplay between memorization and generalization, showing that spurious correlations would particularly lead to poor generalization when are combined with memorization. Memorization can reduce training loss to zero, leaving no incentive to learn robust, generalizable patterns. To address this, we propose memorization-aware training (MAT), which uses held-out predictions as a signal of memorization to shift a model's logits. MAT encourages learning robust patterns invariant across distributions, improving generalization under distribution shifts.
△ Less
Submitted 10 December, 2024;
originally announced December 2024.
-
Compositional Risk Minimization
Authors:
Divyat Mahajan,
Mohammad Pezeshki,
Charles Arnal,
Ioannis Mitliagkas,
Kartik Ahuja,
Pascal Vincent
Abstract:
Compositional generalization is a crucial step towards developing data-efficient intelligent machines that generalize in human-like ways. In this work, we tackle a challenging form of distribution shift, termed compositional shift, where some attribute combinations are completely absent at training but present in the test distribution. This shift tests the model's ability to generalize composition…
▽ More
Compositional generalization is a crucial step towards developing data-efficient intelligent machines that generalize in human-like ways. In this work, we tackle a challenging form of distribution shift, termed compositional shift, where some attribute combinations are completely absent at training but present in the test distribution. This shift tests the model's ability to generalize compositionally to novel attribute combinations in discriminative tasks. We model the data with flexible additive energy distributions, where each energy term represents an attribute, and derive a simple alternative to empirical risk minimization termed compositional risk minimization (CRM). We first train an additive energy classifier to predict the multiple attributes and then adjust this classifier to tackle compositional shifts. We provide an extensive theoretical analysis of CRM, where we show that our proposal extrapolates to special affine hulls of seen attribute combinations. Empirical evaluations on benchmark datasets confirms the improved robustness of CRM compared to other methods from the literature designed to tackle various forms of subpopulation shifts.
△ Less
Submitted 10 February, 2025; v1 submitted 8 October, 2024;
originally announced October 2024.
-
Cache Blocking for Flux Reconstruction: Extension to Navier-Stokes Equations and Anti-aliasing
Authors:
Semih Akkurt,
Freddie Witherden,
Peter Vincent
Abstract:
In this article, cache blocking is implemented for the Navier Stokes equations with anti-aliasing support on mixed grids in PyFR for CPUs. In particular, cache blocking is used as an alternative to kernel fusion to eliminate unnecessary data movements between kernels at the main memory level. Specifically, kernels that exchange data are grouped together, and these groups are then executed on small…
▽ More
In this article, cache blocking is implemented for the Navier Stokes equations with anti-aliasing support on mixed grids in PyFR for CPUs. In particular, cache blocking is used as an alternative to kernel fusion to eliminate unnecessary data movements between kernels at the main memory level. Specifically, kernels that exchange data are grouped together, and these groups are then executed on small sub-regions of the domain that fit in per-core private data cache. Additionally, cache blocking is also used to efficiently implement a tensor product factorisation of the interpolation operators associated with anti-aliasing. By using cache blocking, the intermediate results between application of the sparse factors are stored in per-core private data cache, and a significant amount of data movement from main memory is avoided. In order to assess the performance gains a theoretical model is developed, and the implementation is benchmarked using a compressible 3D Taylor-Green vortex test case on both hexahedral and prismatic grids, with third- and forth-order solution polynomials. The expected performance gains based on the theoretical model range from 1.99 to 2.62, and the speedups obtained in practice range from 1.67 to 3.67 compared to PyFR v1.11.0.
△ Less
Submitted 6 November, 2023;
originally announced January 2024.
-
WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models
Authors:
Youssef Benchekroun,
Megi Dervishi,
Mark Ibrahim,
Jean-Baptiste Gaya,
Xavier Martinet,
Grégoire Mialon,
Thomas Scialom,
Emmanuel Dupoux,
Dieuwke Hupkes,
Pascal Vincent
Abstract:
We propose WorldSense, a benchmark designed to assess the extent to which LLMs are consistently able to sustain tacit world models, by testing how they draw simple inferences from descriptions of simple arrangements of entities. Worldsense is a synthetic benchmark with three problem types, each with their own trivial control, which explicitly avoids bias by decorrelating the abstract structure of…
▽ More
We propose WorldSense, a benchmark designed to assess the extent to which LLMs are consistently able to sustain tacit world models, by testing how they draw simple inferences from descriptions of simple arrangements of entities. Worldsense is a synthetic benchmark with three problem types, each with their own trivial control, which explicitly avoids bias by decorrelating the abstract structure of problems from the vocabulary and expressions, and by decorrelating all problem subparts with the correct response. We run our benchmark on three state-of-the-art chat-LLMs (GPT3.5, GPT4 and Llama2-chat) and show that these models make errors even with as few as three objects. Furthermore, they have quite heavy response biases, preferring certain responses irrespective of the question. Errors persist even with chain-of-thought prompting and in-context learning. Lastly, we show that while finetuning on similar problems does result in substantial improvements -- within- and out-of-distribution -- the finetuned models do not generalise beyond a constraint problem space.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
Self-Supervised Disentanglement by Leveraging Structure in Data Augmentations
Authors:
Cian Eastwood,
Julius von Kügelgen,
Linus Ericsson,
Diane Bouchacourt,
Pascal Vincent,
Bernhard Schölkopf,
Mark Ibrahim
Abstract:
Self-supervised representation learning often uses data augmentations to induce some invariance to "style" attributes of the data. However, with downstream tasks generally unknown at training time, it is difficult to deduce a priori which attributes of the data are indeed "style" and can be safely discarded. To deal with this, current approaches try to retain some style information by tuning the d…
▽ More
Self-supervised representation learning often uses data augmentations to induce some invariance to "style" attributes of the data. However, with downstream tasks generally unknown at training time, it is difficult to deduce a priori which attributes of the data are indeed "style" and can be safely discarded. To deal with this, current approaches try to retain some style information by tuning the degree of invariance to some particular task, such as ImageNet object classification. However, prior work has shown that such task-specific tuning can lead to significant performance degradation on other tasks that rely on the discarded style. To address this, we introduce a more principled approach that seeks to disentangle style features rather than discard them. The key idea is to add multiple style embedding spaces where: (i) each is invariant to all-but-one augmentation; and (ii) joint entropy is maximized. We formalize our structured data-augmentation procedure from a causal latent-variable-model perspective, and prove identifiability of both content and individual style variables. We empirically demonstrate the benefits of our approach on both synthetic and real-world data.
△ Less
Submitted 20 August, 2024; v1 submitted 15 November, 2023;
originally announced November 2023.
-
Motif: Intrinsic Motivation from Artificial Intelligence Feedback
Authors:
Martin Klissarov,
Pierluca D'Oro,
Shagun Sodhani,
Roberta Raileanu,
Pierre-Luc Bacon,
Pascal Vincent,
Amy Zhang,
Mikael Henaff
Abstract:
Exploring rich environments and evaluating one's actions without prior knowledge is immensely challenging. In this paper, we propose Motif, a general method to interface such prior knowledge from a Large Language Model (LLM) with an agent. Motif is based on the idea of grounding LLMs for decision-making without requiring them to interact with the environment: it elicits preferences from an LLM ove…
▽ More
Exploring rich environments and evaluating one's actions without prior knowledge is immensely challenging. In this paper, we propose Motif, a general method to interface such prior knowledge from a Large Language Model (LLM) with an agent. Motif is based on the idea of grounding LLMs for decision-making without requiring them to interact with the environment: it elicits preferences from an LLM over pairs of captions to construct an intrinsic reward, which is then used to train agents with reinforcement learning. We evaluate Motif's performance and behavior on the challenging, open-ended and procedurally-generated NetHack game. Surprisingly, by only learning to maximize its intrinsic reward, Motif achieves a higher game score than an algorithm directly trained to maximize the score itself. When combining Motif's intrinsic reward with the environment reward, our method significantly outperforms existing approaches and makes progress on tasks where no advancements have ever been made without demonstrations. Finally, we show that Motif mostly generates intuitive human-aligned behaviors which can be steered easily through prompt modifications, while scaling well with the LLM size and the amount of information given in the prompt.
△ Less
Submitted 29 September, 2023;
originally announced October 2023.
-
Discovering environments with XRM
Authors:
Mohammad Pezeshki,
Diane Bouchacourt,
Mark Ibrahim,
Nicolas Ballas,
Pascal Vincent,
David Lopez-Paz
Abstract:
Environment annotations are essential for the success of many out-of-distribution (OOD) generalization methods. Unfortunately, these are costly to obtain and often limited by human annotators' biases. To achieve robust generalization, it is essential to develop algorithms for automatic environment discovery within datasets. Current proposals, which divide examples based on their training error, su…
▽ More
Environment annotations are essential for the success of many out-of-distribution (OOD) generalization methods. Unfortunately, these are costly to obtain and often limited by human annotators' biases. To achieve robust generalization, it is essential to develop algorithms for automatic environment discovery within datasets. Current proposals, which divide examples based on their training error, suffer from one fundamental problem. These methods introduce hyper-parameters and early-stopping criteria, which require a validation set with human-annotated environments, the very information subject to discovery. In this paper, we propose Cross-Risk-Minimization (XRM) to address this issue. XRM trains twin networks, each learning from one random half of the training data, while imitating confident held-out mistakes made by its sibling. XRM provides a recipe for hyper-parameter tuning, does not require early-stopping, and can discover environments for all training and validation data. Algorithms built on top of XRM environments achieve oracle worst-group-accuracy, addressing a long-standing challenge in OOD generalization. Code available at \url{https://github.com/facebookresearch/XRM}.
△ Less
Submitted 19 July, 2024; v1 submitted 28 September, 2023;
originally announced September 2023.
-
PUG: Photorealistic and Semantically Controllable Synthetic Data for Representation Learning
Authors:
Florian Bordes,
Shashank Shekhar,
Mark Ibrahim,
Diane Bouchacourt,
Pascal Vincent,
Ari S. Morcos
Abstract:
Synthetic image datasets offer unmatched advantages for designing and evaluating deep neural networks: they make it possible to (i) render as many data samples as needed, (ii) precisely control each scene and yield granular ground truth labels (and captions), (iii) precisely control distribution shifts between training and testing to isolate variables of interest for sound experimentation. Despite…
▽ More
Synthetic image datasets offer unmatched advantages for designing and evaluating deep neural networks: they make it possible to (i) render as many data samples as needed, (ii) precisely control each scene and yield granular ground truth labels (and captions), (iii) precisely control distribution shifts between training and testing to isolate variables of interest for sound experimentation. Despite such promise, the use of synthetic image data is still limited -- and often played down -- mainly due to their lack of realism. Most works therefore rely on datasets of real images, which have often been scraped from public images on the internet, and may have issues with regards to privacy, bias, and copyright, while offering little control over how objects precisely appear. In this work, we present a path to democratize the use of photorealistic synthetic data: we develop a new generation of interactive environments for representation learning research, that offer both controllability and realism. We use the Unreal Engine, a powerful game engine well known in the entertainment industry, to produce PUG (Photorealistic Unreal Graphics) environments and datasets for representation learning. In this paper, we demonstrate the potential of PUG to enable more rigorous evaluations of vision models.
△ Less
Submitted 12 December, 2023; v1 submitted 7 August, 2023;
originally announced August 2023.
-
Stochastic positional embeddings improve masked image modeling
Authors:
Amir Bar,
Florian Bordes,
Assaf Shocher,
Mahmoud Assran,
Pascal Vincent,
Nicolas Ballas,
Trevor Darrell,
Amir Globerson,
Yann LeCun
Abstract:
Masked Image Modeling (MIM) is a promising self-supervised learning approach that enables learning from unlabeled images. Despite its recent success, learning good representations through MIM remains challenging because it requires predicting the right semantic content in accurate locations. For example, given an incomplete picture of a dog, we can guess that there is a tail, but we cannot determi…
▽ More
Masked Image Modeling (MIM) is a promising self-supervised learning approach that enables learning from unlabeled images. Despite its recent success, learning good representations through MIM remains challenging because it requires predicting the right semantic content in accurate locations. For example, given an incomplete picture of a dog, we can guess that there is a tail, but we cannot determine its exact location. In this work, we propose to incorporate location uncertainty into MIM by using stochastic positional embeddings (StoP). Specifically, we condition the model on stochastic masked token positions drawn from a Gaussian distribution. StoP reduces overfitting to location features and guides the model toward learning features that are more robust to location uncertainties. Quantitatively, StoP improves downstream MIM performance on a variety of downstream tasks, including $+1.7\%$ on ImageNet linear probing using ViT-B, and $+2.5\%$ for ViT-H using $1\%$ of the data.
△ Less
Submitted 27 February, 2024; v1 submitted 31 July, 2023;
originally announced August 2023.
-
On the Identifiability of Quantized Factors
Authors:
Vitória Barin-Pacela,
Kartik Ahuja,
Simon Lacoste-Julien,
Pascal Vincent
Abstract:
Disentanglement aims to recover meaningful latent ground-truth factors from the observed distribution solely, and is formalized through the theory of identifiability. The identifiability of independent latent factors is proven to be impossible in the unsupervised i.i.d. setting under a general nonlinear map from factors to observations. In this work, however, we demonstrate that it is possible to…
▽ More
Disentanglement aims to recover meaningful latent ground-truth factors from the observed distribution solely, and is formalized through the theory of identifiability. The identifiability of independent latent factors is proven to be impossible in the unsupervised i.i.d. setting under a general nonlinear map from factors to observations. In this work, however, we demonstrate that it is possible to recover quantized latent factors under a generic nonlinear diffeomorphism. We only assume that the latent factors have independent discontinuities in their density, without requiring the factors to be statistically independent. We introduce this novel form of identifiability, termed quantized factor identifiability, and provide a comprehensive proof of the recovery of the quantized factors.
△ Less
Submitted 12 March, 2024; v1 submitted 28 June, 2023;
originally announced June 2023.
-
Do SSL Models Have Déjà Vu? A Case of Unintended Memorization in Self-supervised Learning
Authors:
Casey Meehan,
Florian Bordes,
Pascal Vincent,
Kamalika Chaudhuri,
Chuan Guo
Abstract:
Self-supervised learning (SSL) algorithms can produce useful image representations by learning to associate different parts of natural images with one another. However, when taken to the extreme, SSL models can unintendedly memorize specific parts in individual training samples rather than learning semantically meaningful associations. In this work, we perform a systematic study of the unintended…
▽ More
Self-supervised learning (SSL) algorithms can produce useful image representations by learning to associate different parts of natural images with one another. However, when taken to the extreme, SSL models can unintendedly memorize specific parts in individual training samples rather than learning semantically meaningful associations. In this work, we perform a systematic study of the unintended memorization of image-specific information in SSL models -- which we refer to as déjà vu memorization. Concretely, we show that given the trained model and a crop of a training image containing only the background (e.g., water, sky, grass), it is possible to infer the foreground object with high accuracy or even visually reconstruct it. Furthermore, we show that déjà vu memorization is common to different SSL algorithms, is exacerbated by certain design choices, and cannot be detected by conventional techniques for evaluating representation quality. Our study of déjà vu memorization reveals previously unknown privacy risks in SSL models, as well as suggests potential practical mitigation strategies. Code is available at https://github.com/facebookresearch/DejaVu.
△ Less
Submitted 12 December, 2023; v1 submitted 26 April, 2023;
originally announced April 2023.
-
Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations
Authors:
Shashank Shekhar,
Florian Bordes,
Pascal Vincent,
Ari Morcos
Abstract:
Joint-embedding based learning (e.g., SimCLR, MoCo, DINO) and reconstruction-based learning (e.g., BEiT, SimMIM, MAE) are the two leading paradigms for self-supervised learning of vision transformers, but they differ substantially in their transfer performance. Here, we aim to explain these differences by analyzing the impact of these objectives on the structure and transferability of the learned…
▽ More
Joint-embedding based learning (e.g., SimCLR, MoCo, DINO) and reconstruction-based learning (e.g., BEiT, SimMIM, MAE) are the two leading paradigms for self-supervised learning of vision transformers, but they differ substantially in their transfer performance. Here, we aim to explain these differences by analyzing the impact of these objectives on the structure and transferability of the learned representations. Our analysis reveals that reconstruction-based learning features are significantly dissimilar to joint-embedding based learning features and that models trained with similar objectives learn similar features even across architectures. These differences arise early in the network and are primarily driven by attention and normalization layers. We find that joint-embedding features yield better linear probe transfer for classification because the different objectives drive different distributions of information and invariances in the learned representation. These differences explain opposite trends in transfer performance for downstream tasks that require spatial specificity in features. Finally, we address how fine-tuning changes reconstructive representations to enable better transfer, showing that fine-tuning re-organizes the information to be more similar to pre-trained joint embedding models.
△ Less
Submitted 25 April, 2023;
originally announced April 2023.
-
A surprisingly simple technique to control the pretraining bias for better transfer: Expand or Narrow your representation
Authors:
Florian Bordes,
Samuel Lavoie,
Randall Balestriero,
Nicolas Ballas,
Pascal Vincent
Abstract:
Self-Supervised Learning (SSL) models rely on a pretext task to learn representations. Because this pretext task differs from the downstream tasks used to evaluate the performance of these models, there is an inherent misalignment or pretraining bias. A commonly used trick in SSL, shown to make deep networks more robust to such bias, is the addition of a small projector (usually a 2 or 3 layer mul…
▽ More
Self-Supervised Learning (SSL) models rely on a pretext task to learn representations. Because this pretext task differs from the downstream tasks used to evaluate the performance of these models, there is an inherent misalignment or pretraining bias. A commonly used trick in SSL, shown to make deep networks more robust to such bias, is the addition of a small projector (usually a 2 or 3 layer multi-layer perceptron) on top of a backbone network during training. In contrast to previous work that studied the impact of the projector architecture, we here focus on a simpler, yet overlooked lever to control the information in the backbone representation. We show that merely changing its dimensionality -- by changing only the size of the backbone's very last block -- is a remarkably effective technique to mitigate the pretraining bias. It significantly improves downstream transfer performance for both Self-Supervised and Supervised pretrained models.
△ Less
Submitted 11 April, 2023;
originally announced April 2023.
-
Instance-Conditioned GAN Data Augmentation for Representation Learning
Authors:
Pietro Astolfi,
Arantxa Casanova,
Jakob Verbeek,
Pascal Vincent,
Adriana Romero-Soriano,
Michal Drozdzal
Abstract:
Data augmentation has become a crucial component to train state-of-the-art visual representation models. However, handcrafting combinations of transformations that lead to improved performances is a laborious task, which can result in visually unrealistic samples. To overcome these limitations, recent works have explored the use of generative models as learnable data augmentation tools, showing pr…
▽ More
Data augmentation has become a crucial component to train state-of-the-art visual representation models. However, handcrafting combinations of transformations that lead to improved performances is a laborious task, which can result in visually unrealistic samples. To overcome these limitations, recent works have explored the use of generative models as learnable data augmentation tools, showing promising results in narrow application domains, e.g., few-shot learning and low-data medical imaging. In this paper, we introduce a data augmentation module, called DA_IC-GAN, which leverages instance-conditioned GAN generations and can be used off-the-shelf in conjunction with most state-of-the-art training recipes. We showcase the benefits of DA_IC-GAN by plugging it out-of-the-box into the supervised training of ResNets and DeiT models on the ImageNet dataset, and achieving accuracy boosts up to between 1%p and 2%p with the highest capacity models. Moreover, the learnt representations are shown to be more robust than the baselines when transferred to a handful of out-of-distribution datasets, and exhibit increased invariance to variations of instance and viewpoints. We additionally couple DA_IC-GAN with a self-supervised training recipe and show that we can also achieve an improvement of 1%p in accuracy in some settings. With this work, we strengthen the evidence on the potential of learnable data augmentations to improve visual representation learning, paving the road towards non-handcrafted augmentations in model training.
△ Less
Submitted 16 March, 2023;
originally announced March 2023.
-
Towards Democratizing Joint-Embedding Self-Supervised Learning
Authors:
Florian Bordes,
Randall Balestriero,
Pascal Vincent
Abstract:
Joint Embedding Self-Supervised Learning (JE-SSL) has seen rapid developments in recent years, due to its promise to effectively leverage large unlabeled data. The development of JE-SSL methods was driven primarily by the search for ever increasing downstream classification accuracies, using huge computational resources, and typically built upon insights and intuitions inherited from a close paren…
▽ More
Joint Embedding Self-Supervised Learning (JE-SSL) has seen rapid developments in recent years, due to its promise to effectively leverage large unlabeled data. The development of JE-SSL methods was driven primarily by the search for ever increasing downstream classification accuracies, using huge computational resources, and typically built upon insights and intuitions inherited from a close parent JE-SSL method. This has led unwittingly to numerous pre-conceived ideas that carried over across methods e.g. that SimCLR requires very large mini batches to yield competitive accuracies; that strong and computationally slow data augmentations are required. In this work, we debunk several such ill-formed a priori ideas in the hope to unleash the full potential of JE-SSL free of unnecessary limitations. In fact, when carefully evaluating performances across different downstream tasks and properly optimizing hyper-parameters of the methods, we most often -- if not always -- see that these widespread misconceptions do not hold. For example we show that it is possible to train SimCLR to learn useful representations, while using a single image patch as negative example, and simple Gaussian noise as the only data augmentation for the positive pair. Along these lines, in the hope to democratize JE-SSL and to allow researchers to easily make more extensive evaluations of their methods, we introduce an optimized PyTorch library for SSL.
△ Less
Submitted 3 March, 2023;
originally announced March 2023.
-
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
Authors:
Mahmoud Assran,
Quentin Duval,
Ishan Misra,
Piotr Bojanowski,
Pascal Vincent,
Michael Rabbat,
Yann LeCun,
Nicolas Ballas
Abstract:
This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target block…
▽ More
This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.
△ Less
Submitted 13 April, 2023; v1 submitted 19 January, 2023;
originally announced January 2023.
-
ImageNet-X: Understanding Model Mistakes with Factor of Variation Annotations
Authors:
Badr Youbi Idrissi,
Diane Bouchacourt,
Randall Balestriero,
Ivan Evtimov,
Caner Hazirbas,
Nicolas Ballas,
Pascal Vincent,
Michal Drozdzal,
David Lopez-Paz,
Mark Ibrahim
Abstract:
Deep learning vision systems are widely deployed across applications where reliability is critical. However, even today's best models can fail to recognize an object when its pose, lighting, or background varies. While existing benchmarks surface examples challenging for models, they do not explain why such mistakes arise. To address this need, we introduce ImageNet-X, a set of sixteen human annot…
▽ More
Deep learning vision systems are widely deployed across applications where reliability is critical. However, even today's best models can fail to recognize an object when its pose, lighting, or background varies. While existing benchmarks surface examples challenging for models, they do not explain why such mistakes arise. To address this need, we introduce ImageNet-X, a set of sixteen human annotations of factors such as pose, background, or lighting the entire ImageNet-1k validation set as well as a random subset of 12k training images. Equipped with ImageNet-X, we investigate 2,200 current recognition models and study the types of mistakes as a function of model's (1) architecture, e.g. transformer vs. convolutional, (2) learning paradigm, e.g. supervised vs. self-supervised, and (3) training procedures, e.g., data augmentation. Regardless of these choices, we find models have consistent failure modes across ImageNet-X categories. We also find that while data augmentation can improve robustness to certain factors, they induce spill-over effects to other factors. For example, strong random cropping hurts robustness on smaller objects. Together, these insights suggest to advance the robustness of modern vision models, future research should focus on collecting additional data and understanding data augmentation schemes. Along with these insights, we release a toolkit based on ImageNet-X to spur further study into the mistakes image recognition systems make.
△ Less
Submitted 3 November, 2022;
originally announced November 2022.
-
Disentanglement of Correlated Factors via Hausdorff Factorized Support
Authors:
Karsten Roth,
Mark Ibrahim,
Zeynep Akata,
Pascal Vincent,
Diane Bouchacourt
Abstract:
A grand goal in deep learning research is to learn representations capable of generalizing across distribution shifts. Disentanglement is one promising direction aimed at aligning a model's representation with the underlying factors generating the data (e.g. color or background). Existing disentanglement methods, however, rely on an often unrealistic assumption: that factors are statistically inde…
▽ More
A grand goal in deep learning research is to learn representations capable of generalizing across distribution shifts. Disentanglement is one promising direction aimed at aligning a model's representation with the underlying factors generating the data (e.g. color or background). Existing disentanglement methods, however, rely on an often unrealistic assumption: that factors are statistically independent. In reality, factors (like object color and shape) are correlated. To address this limitation, we consider the use of a relaxed disentanglement criterion -- the Hausdorff Factorized Support (HFS) criterion -- that encourages only pairwise factorized \emph{support}, rather than a factorial distribution, by minimizing a Hausdorff distance. This allows for arbitrary distributions of the factors over their support, including correlations between them. We show that the use of HFS consistently facilitates disentanglement and recovery of ground-truth factors across a variety of correlation settings and benchmarks, even under severe training correlations and correlation shifts, with in parts over $+60\%$ in relative improvement over existing disentanglement methods. In addition, we find that leveraging HFS for representation learning can even facilitate transfer to downstream tasks such as classification under distribution shifts. We hope our original approach and positive empirical results inspire further progress on the open problem of robust generalization. Code available at https://github.com/facebookresearch/disentangling-correlated-factors.
△ Less
Submitted 25 February, 2023; v1 submitted 13 October, 2022;
originally announced October 2022.
-
The Hidden Uniform Cluster Prior in Self-Supervised Learning
Authors:
Mahmoud Assran,
Randall Balestriero,
Quentin Duval,
Florian Bordes,
Ishan Misra,
Piotr Bojanowski,
Pascal Vincent,
Michael Rabbat,
Nicolas Ballas
Abstract:
A successful paradigm in representation learning is to perform self-supervised pretraining using tasks based on mini-batch statistics (e.g., SimCLR, VICReg, SwAV, MSN). We show that in the formulation of all these methods is an overlooked prior to learn features that enable uniform clustering of the data. While this prior has led to remarkably semantic representations when pretraining on class-bal…
▽ More
A successful paradigm in representation learning is to perform self-supervised pretraining using tasks based on mini-batch statistics (e.g., SimCLR, VICReg, SwAV, MSN). We show that in the formulation of all these methods is an overlooked prior to learn features that enable uniform clustering of the data. While this prior has led to remarkably semantic representations when pretraining on class-balanced data, such as ImageNet, we demonstrate that it can hamper performance when pretraining on class-imbalanced data. By moving away from conventional uniformity priors and instead preferring power-law distributed feature clusters, we show that one can improve the quality of the learned representations on real-world class-imbalanced datasets. To demonstrate this, we develop an extension of the Masked Siamese Networks (MSN) method to support the use of arbitrary features priors.
△ Less
Submitted 13 October, 2022;
originally announced October 2022.
-
Guillotine Regularization: Why removing layers is needed to improve generalization in Self-Supervised Learning
Authors:
Florian Bordes,
Randall Balestriero,
Quentin Garrido,
Adrien Bardes,
Pascal Vincent
Abstract:
One unexpected technique that emerged in recent years consists in training a Deep Network (DN) with a Self-Supervised Learning (SSL) method, and using this network on downstream tasks but with its last few projector layers entirely removed. This trick of throwing away the projector is actually critical for SSL methods to display competitive performances on ImageNet for which more than 30 percentag…
▽ More
One unexpected technique that emerged in recent years consists in training a Deep Network (DN) with a Self-Supervised Learning (SSL) method, and using this network on downstream tasks but with its last few projector layers entirely removed. This trick of throwing away the projector is actually critical for SSL methods to display competitive performances on ImageNet for which more than 30 percentage points can be gained that way. This is a little vexing, as one would hope that the network layer at which invariance is explicitly enforced by the SSL criterion during training (the last projector layer) should be the one to use for best generalization performance downstream. But it seems not to be, and this study sheds some light on why. This trick, which we name Guillotine Regularization (GR), is in fact a generically applicable method that has been used to improve generalization performance in transfer learning scenarios. In this work, we identify the underlying reasons behind its success and show that the optimal layer to use might change significantly depending on the training setup, the data or the downstream task. Lastly, we give some insights on how to reduce the need for a projector in SSL by aligning the pretext SSL task and the downstream task.
△ Less
Submitted 9 June, 2023; v1 submitted 27 June, 2022;
originally announced June 2022.
-
Masked Siamese Networks for Label-Efficient Learning
Authors:
Mahmoud Assran,
Mathilde Caron,
Ishan Misra,
Piotr Bojanowski,
Florian Bordes,
Pascal Vincent,
Armand Joulin,
Michael Rabbat,
Nicolas Ballas
Abstract:
We propose Masked Siamese Networks (MSN), a self-supervised learning framework for learning image representations. Our approach matches the representation of an image view containing randomly masked patches to the representation of the original unmasked image. This self-supervised pre-training strategy is particularly scalable when applied to Vision Transformers since only the unmasked patches are…
▽ More
We propose Masked Siamese Networks (MSN), a self-supervised learning framework for learning image representations. Our approach matches the representation of an image view containing randomly masked patches to the representation of the original unmasked image. This self-supervised pre-training strategy is particularly scalable when applied to Vision Transformers since only the unmasked patches are processed by the network. As a result, MSNs improve the scalability of joint-embedding architectures, while producing representations of a high semantic level that perform competitively on low-shot image classification. For instance, on ImageNet-1K, with only 5,000 annotated images, our base MSN model achieves 72.4% top-1 accuracy, and with 1% of ImageNet-1K labels, we achieve 75.7% top-1 accuracy, setting a new state-of-the-art for self-supervised learning on this benchmark. Our code is publicly available.
△ Less
Submitted 14 April, 2022;
originally announced April 2022.
-
High Fidelity Visualization of What Your Self-Supervised Representation Knows About
Authors:
Florian Bordes,
Randall Balestriero,
Pascal Vincent
Abstract:
Discovering what is learned by neural networks remains a challenge. In self-supervised learning, classification is the most common task used to evaluate how good a representation is. However, relying only on such downstream task can limit our understanding of what information is retained in the representation of a given input. In this work, we showcase the use of a Representation Conditional Diffu…
▽ More
Discovering what is learned by neural networks remains a challenge. In self-supervised learning, classification is the most common task used to evaluate how good a representation is. However, relying only on such downstream task can limit our understanding of what information is retained in the representation of a given input. In this work, we showcase the use of a Representation Conditional Diffusion Model (RCDM) to visualize in data space the representations learned by self-supervised models. The use of RCDM is motivated by its ability to generate high-quality samples -- on par with state-of-the-art generative models -- while ensuring that the representations of those samples are faithful i.e. close to the one used for conditioning. By using RCDM to analyze self-supervised models, we are able to clearly show visually that i) SSL (backbone) representation are not invariant to the data augmentations they were trained with -- thus debunking an often restated but mistaken belief; ii) SSL post-projector embeddings appear indeed invariant to these data augmentation, along with many other data symmetries; iii) SSL representations appear more robust to small adversarial perturbation of their inputs than representations trained in a supervised manner; and iv) that SSL-trained representations exhibit an inherent structure that can be explored thanks to RCDM visualization and enables image manipulation.
△ Less
Submitted 16 August, 2022; v1 submitted 16 December, 2021;
originally announced December 2021.
-
Understanding Dimensional Collapse in Contrastive Self-supervised Learning
Authors:
Li Jing,
Pascal Vincent,
Yann LeCun,
Yuandong Tian
Abstract:
Self-supervised visual representation learning aims to learn useful representations without relying on human annotations. Joint embedding approach bases on maximizing the agreement between embedding vectors from different views of the same image. Various methods have been proposed to solve the collapsing problem where all embedding vectors collapse to a trivial constant solution. Among these metho…
▽ More
Self-supervised visual representation learning aims to learn useful representations without relying on human annotations. Joint embedding approach bases on maximizing the agreement between embedding vectors from different views of the same image. Various methods have been proposed to solve the collapsing problem where all embedding vectors collapse to a trivial constant solution. Among these methods, contrastive learning prevents collapse via negative sample pairs. It has been shown that non-contrastive methods suffer from a lesser collapse problem of a different nature: dimensional collapse, whereby the embedding vectors end up spanning a lower-dimensional subspace instead of the entire available embedding space. Here, we show that dimensional collapse also happens in contrastive learning. In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse. Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR, which directly optimizes the representation space without relying on an explicit trainable projector. Experiments show that DirectCLR outperforms SimCLR with a trainable linear projector on ImageNet.
△ Less
Submitted 23 April, 2022; v1 submitted 18 October, 2021;
originally announced October 2021.
-
Accounting for Variance in Machine Learning Benchmarks
Authors:
Xavier Bouthillier,
Pierre Delaunay,
Mirko Bronzi,
Assya Trofimov,
Brennan Nichyporuk,
Justin Szeto,
Naz Sepah,
Edward Raff,
Kanika Madan,
Vikram Voleti,
Samira Ebrahimi Kahou,
Vincent Michalski,
Dmitriy Serdyuk,
Tal Arbel,
Chris Pal,
Gaël Varoquaux,
Pascal Vincent
Abstract:
Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, reve…
▽ More
Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for performance comparisons.
△ Less
Submitted 1 March, 2021;
originally announced March 2021.
-
Online Adversarial Attacks
Authors:
Andjela Mladenovic,
Avishek Joey Bose,
Hugo Berard,
William L. Hamilton,
Simon Lacoste-Julien,
Pascal Vincent,
Gauthier Gidel
Abstract:
Adversarial attacks expose important vulnerabilities of deep learning models, yet little attention has been paid to settings where data arrives as a stream. In this paper, we formalize the online adversarial attack problem, emphasizing two key elements found in real-world use-cases: attackers must operate under partial knowledge of the target model, and the decisions made by the attacker are irrev…
▽ More
Adversarial attacks expose important vulnerabilities of deep learning models, yet little attention has been paid to settings where data arrives as a stream. In this paper, we formalize the online adversarial attack problem, emphasizing two key elements found in real-world use-cases: attackers must operate under partial knowledge of the target model, and the decisions made by the attacker are irrevocable since they operate on a transient data stream. We first rigorously analyze a deterministic variant of the online threat model by drawing parallels to the well-studied $k$-secretary problem in theoretical computer science and propose Virtual+, a simple yet practical online algorithm. Our main theoretical result shows Virtual+ yields provably the best competitive ratio over all single-threshold algorithms for $k<5$ -- extending the previous analysis of the $k$-secretary problem. We also introduce the \textit{stochastic $k$-secretary} -- effectively reducing online blackbox transfer attacks to a $k$-secretary problem under noise -- and prove theoretical bounds on the performance of Virtual+ adapted to this setting. Finally, we complement our theoretical results by conducting experiments on MNIST, CIFAR-10, and Imagenet classifiers, revealing the necessity of online algorithms in achieving near-optimal performance and also the rich interplay between attack strategies and online attack selection, enabling simple strategies like FGSM to outperform stronger adversaries.
△ Less
Submitted 22 March, 2022; v1 submitted 2 March, 2021;
originally announced March 2021.
-
Efficient Learning in Non-Stationary Linear Markov Decision Processes
Authors:
Ahmed Touati,
Pascal Vincent
Abstract:
We study episodic reinforcement learning in non-stationary linear (a.k.a. low-rank) Markov Decision Processes (MDPs), i.e, both the reward and transition kernel are linear with respect to a given feature map and are allowed to evolve either slowly or abruptly over time. For this problem setting, we propose OPT-WLSVI an optimistic model-free algorithm based on weighted least squares value iteration…
▽ More
We study episodic reinforcement learning in non-stationary linear (a.k.a. low-rank) Markov Decision Processes (MDPs), i.e, both the reward and transition kernel are linear with respect to a given feature map and are allowed to evolve either slowly or abruptly over time. For this problem setting, we propose OPT-WLSVI an optimistic model-free algorithm based on weighted least squares value iteration which uses exponential weights to smoothly forget data that are far in the past. We show that our algorithm, when competing against the best policy at each time, achieves a regret that is upper bounded by $\widetilde{\mathcal{O}}(d^{5/4}H^2 Δ^{1/4} K^{3/4})$ where $d$ is the dimension of the feature space, $H$ is the planning horizon, $K$ is the number of episodes and $Δ$ is a suitable measure of non-stationarity of the MDP. Moreover, we point out technical gaps in the study of forgetting strategies in non-stationary linear bandits setting made by previous works and we propose a fix to their regret analysis.
△ Less
Submitted 27 December, 2021; v1 submitted 24 October, 2020;
originally announced October 2020.
-
WHO 2016 subtyping and automated segmentation of glioma using multi-task deep learning
Authors:
Sebastian R. van der Voort,
Fatih Incekara,
Maarten M. J. Wijnenga,
Georgios Kapsas,
Renske Gahrmann,
Joost W. Schouten,
Rishi Nandoe Tewarie,
Geert J. Lycklama,
Philip C. De Witt Hamer,
Roelant S. Eijgelaar,
Pim J. French,
Hendrikus J. Dubbink,
Arnaud J. P. E. Vincent,
Wiro J. Niessen,
Martin J. van den Bent,
Marion Smits,
Stefan Klein
Abstract:
Accurate characterization of glioma is crucial for clinical decision making. A delineation of the tumor is also desirable in the initial decision stages but is a time-consuming task. Leveraging the latest GPU capabilities, we developed a single multi-task convolutional neural network that uses the full 3D, structural, pre-operative MRI scans to can predict the IDH mutation status, the 1p/19q co-de…
▽ More
Accurate characterization of glioma is crucial for clinical decision making. A delineation of the tumor is also desirable in the initial decision stages but is a time-consuming task. Leveraging the latest GPU capabilities, we developed a single multi-task convolutional neural network that uses the full 3D, structural, pre-operative MRI scans to can predict the IDH mutation status, the 1p/19q co-deletion status, and the grade of a tumor, while simultaneously segmenting the tumor. We trained our method using the largest, most diverse patient cohort to date containing 1508 glioma patients from 16 institutes. We tested our method on an independent dataset of 240 patients from 13 different institutes, and achieved an IDH-AUC of 0.90, 1p/19q-AUC of 0.85, grade-AUC of 0.81, and a mean whole tumor DICE score of 0.84. Thus, our method non-invasively predicts multiple, clinically relevant parameters and generalizes well to the broader clinical population.
△ Less
Submitted 9 October, 2020;
originally announced October 2020.
-
Implicit Regularization via Neural Feature Alignment
Authors:
Aristide Baratin,
Thomas George,
César Laurent,
R Devon Hjelm,
Guillaume Lajoie,
Pascal Vincent,
Simon Lacoste-Julien
Abstract:
We approach the problem of implicit regularization in deep learning from a geometrical viewpoint. We highlight a regularization effect induced by a dynamical alignment of the neural tangent features introduced by Jacot et al, along a small number of task-relevant directions. This can be interpreted as a combined mechanism of feature selection and compression. By extrapolating a new analysis of Rad…
▽ More
We approach the problem of implicit regularization in deep learning from a geometrical viewpoint. We highlight a regularization effect induced by a dynamical alignment of the neural tangent features introduced by Jacot et al, along a small number of task-relevant directions. This can be interpreted as a combined mechanism of feature selection and compression. By extrapolating a new analysis of Rademacher complexity bounds for linear models, we motivate and study a heuristic complexity measure that captures this phenomenon, in terms of sequences of tangent kernel classes along optimization paths.
△ Less
Submitted 16 March, 2021; v1 submitted 3 August, 2020;
originally announced August 2020.
-
Stochastic Hamiltonian Gradient Methods for Smooth Games
Authors:
Nicolas Loizou,
Hugo Berard,
Alexia Jolicoeur-Martineau,
Pascal Vincent,
Simon Lacoste-Julien,
Ioannis Mitliagkas
Abstract:
The success of adversarial formulations in machine learning has brought renewed motivation for smooth games. In this work, we focus on the class of stochastic Hamiltonian methods and provide the first convergence guarantees for certain classes of stochastic smooth games. We propose a novel unbiased estimator for the stochastic Hamiltonian gradient descent (SHGD) and highlight its benefits. Using t…
▽ More
The success of adversarial formulations in machine learning has brought renewed motivation for smooth games. In this work, we focus on the class of stochastic Hamiltonian methods and provide the first convergence guarantees for certain classes of stochastic smooth games. We propose a novel unbiased estimator for the stochastic Hamiltonian gradient descent (SHGD) and highlight its benefits. Using tools from the optimization literature we show that SHGD converges linearly to the neighbourhood of a stationary point. To guarantee convergence to the exact solution, we analyze SHGD with a decreasing step-size and we also present the first stochastic variance reduced Hamiltonian method. Our results provide the first global non-asymptotic last-iterate convergence guarantees for the class of stochastic unconstrained bilinear games and for the more general class of stochastic games that satisfy a "sufficiently bilinear" condition, notably including some non-convex non-concave problems. We supplement our analysis with experiments on stochastic bilinear and sufficiently bilinear games, where our theory is shown to be tight, and on simple adversarial machine learning formulations.
△ Less
Submitted 8 July, 2020;
originally announced July 2020.
-
Sharp Analysis of Smoothed Bellman Error Embedding
Authors:
Ahmed Touati,
Pascal Vincent
Abstract:
The \textit{Smoothed Bellman Error Embedding} algorithm~\citep{dai2018sbeed}, known as SBEED, was proposed as a provably convergent reinforcement learning algorithm with general nonlinear function approximation. It has been successfully implemented with neural networks and achieved strong empirical results. In this work, we study the theoretical behavior of SBEED in batch-mode reinforcement learni…
▽ More
The \textit{Smoothed Bellman Error Embedding} algorithm~\citep{dai2018sbeed}, known as SBEED, was proposed as a provably convergent reinforcement learning algorithm with general nonlinear function approximation. It has been successfully implemented with neural networks and achieved strong empirical results. In this work, we study the theoretical behavior of SBEED in batch-mode reinforcement learning. We prove a near-optimal performance guarantee that depends on the representation power of the used function classes and a tight notion of the distribution shift. Our results improve upon prior guarantees for SBEED in ~\citet{dai2018sbeed} in terms of the dependence on the planning horizon and on the sample size. Our analysis builds on the recent work of ~\citet{Xie2020} which studies a related algorithm MSBO, that could be interpreted as a \textit{non-smooth} counterpart of SBEED.
△ Less
Submitted 7 July, 2020;
originally announced July 2020.
-
Adversarial Example Games
Authors:
Avishek Joey Bose,
Gauthier Gidel,
Hugo Berard,
Andre Cianflone,
Pascal Vincent,
Simon Lacoste-Julien,
William L. Hamilton
Abstract:
The existence of adversarial examples capable of fooling trained neural network classifiers calls for a much better understanding of possible attacks to guide the development of safeguards against them. This includes attack methods in the challenging non-interactive blackbox setting, where adversarial attacks are generated without any access, including queries, to the target model. Prior attacks i…
▽ More
The existence of adversarial examples capable of fooling trained neural network classifiers calls for a much better understanding of possible attacks to guide the development of safeguards against them. This includes attack methods in the challenging non-interactive blackbox setting, where adversarial attacks are generated without any access, including queries, to the target model. Prior attacks in this setting have relied mainly on algorithmic innovations derived from empirical observations (e.g., that momentum helps), lacking principled transferability guarantees. In this work, we provide a theoretical foundation for crafting transferable adversarial examples to entire hypothesis classes. We introduce Adversarial Example Games (AEG), a framework that models the crafting of adversarial examples as a min-max game between a generator of attacks and a classifier. AEG provides a new way to design adversarial examples by adversarially training a generator and a classifier from a given hypothesis class (e.g., architecture). We prove that this game has an equilibrium, and that the optimal generator is able to craft adversarial examples that can attack any classifier from the corresponding hypothesis class. We demonstrate the efficacy of AEG on the MNIST and CIFAR-10 datasets, outperforming prior state-of-the-art approaches with an average relative improvement of $29.9\%$ and $47.2\%$ against undefended and robust models (Table 2 & 3) respectively.
△ Less
Submitted 8 January, 2021; v1 submitted 1 July, 2020;
originally announced July 2020.
-
Revisiting Loss Modelling for Unstructured Pruning
Authors:
César Laurent,
Camille Ballas,
Thomas George,
Nicolas Ballas,
Pascal Vincent
Abstract:
By removing parameters from deep neural networks, unstructured pruning methods aim at cutting down memory footprint and computational cost, while maintaining prediction accuracy. In order to tackle this otherwise intractable problem, many of these methods model the loss landscape using first or second order Taylor expansions to identify which parameters can be discarded. We revisit loss modelling…
▽ More
By removing parameters from deep neural networks, unstructured pruning methods aim at cutting down memory footprint and computational cost, while maintaining prediction accuracy. In order to tackle this otherwise intractable problem, many of these methods model the loss landscape using first or second order Taylor expansions to identify which parameters can be discarded. We revisit loss modelling for unstructured pruning: we show the importance of ensuring locality of the pruning steps. We systematically compare first and second order Taylor expansions and empirically show that both can reach similar levels of performance. Finally, we show that better preserving the original network function does not necessarily transfer to better performing networks after fine-tuning, suggesting that only considering the impact of pruning on the loss might not be a sufficient objective to design good pruning criteria.
△ Less
Submitted 22 June, 2020;
originally announced June 2020.
-
Do sequence-to-sequence VAEs learn global features of sentences?
Authors:
Tom Bosc,
Pascal Vincent
Abstract:
Autoregressive language models are powerful and relatively easy to train. However, these models are usually trained without explicit conditioning labels and do not offer easy ways to control global aspects such as sentiment or topic during generation. Bowman & al. (2016) adapted the Variational Autoencoder (VAE) for natural language with the sequence-to-sequence architecture and claimed that the l…
▽ More
Autoregressive language models are powerful and relatively easy to train. However, these models are usually trained without explicit conditioning labels and do not offer easy ways to control global aspects such as sentiment or topic during generation. Bowman & al. (2016) adapted the Variational Autoencoder (VAE) for natural language with the sequence-to-sequence architecture and claimed that the latent vector was able to capture such global features in an unsupervised manner. We question this claim. We measure which words benefit most from the latent information by decomposing the reconstruction loss per position in the sentence. Using this method, we find that VAEs are prone to memorizing the first words and the sentence length, producing local features of limited usefulness. To alleviate this, we investigate alternative architectures based on bag-of-words assumptions and language model pretraining. These variants learn latent variables that are more global, i.e., more predictive of topic or sentiment labels. Moreover, using reconstructions, we observe that they decrease memorization: the first word and the sentence length are not recovered as accurately than with the baselines, consequently yielding more diverse reconstructions.
△ Less
Submitted 28 March, 2021; v1 submitted 16 April, 2020;
originally announced April 2020.
-
Stable Policy Optimization via Off-Policy Divergence Regularization
Authors:
Ahmed Touati,
Amy Zhang,
Joelle Pineau,
Pascal Vincent
Abstract:
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL). While these methods achieve state-of-the-art performance across a wide range of challenging tasks, there is room for improvement in the stabilization of the policy learning and how the off-policy data are used. In this paper we…
▽ More
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL). While these methods achieve state-of-the-art performance across a wide range of challenging tasks, there is room for improvement in the stabilization of the policy learning and how the off-policy data are used. In this paper we revisit the theoretical foundations of these algorithms and propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another. This proximity term, expressed in terms of the divergence between the visitation distributions, is learned in an off-policy and adversarial manner. We empirically show that our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
△ Less
Submitted 19 June, 2020; v1 submitted 9 March, 2020;
originally announced March 2020.
-
Application of Genetic Algorithm for More Efficient Multi-Layer Thickness Optimization in Solar Cells
Authors:
Premkumar Vincent,
Gwenaelle Cunha Sergio,
Jaewon Jang,
In Man Kang,
Jaehoon Park,
Hyeok Kim,
Minho Lee,
Jin-Hyuk Bae
Abstract:
Thin-film solar cells are predominately designed similar to a stacked structure. Optimizing the layer thicknesses in this stack structure is crucial to extract the best efficiency of the solar cell. The commonplace method used in optimization simulations, such as for optimizing the optical spacer layers' thicknesses, is the parameter sweep. Our simulation study shows that the implementation of a m…
▽ More
Thin-film solar cells are predominately designed similar to a stacked structure. Optimizing the layer thicknesses in this stack structure is crucial to extract the best efficiency of the solar cell. The commonplace method used in optimization simulations, such as for optimizing the optical spacer layers' thicknesses, is the parameter sweep. Our simulation study shows that the implementation of a meta-heuristic method like the genetic algorithm results in a significantly faster and accurate search method when compared to the brute-force parameter sweep method in both single and multi-layer optimization. While other sweep methods can also outperform the brute-force method, they do not consistently exhibit $100\%$ accuracy in the optimized results like our genetic algorithm. We have used a well-studied P3HT-based structure to test our algorithm. Our best-case scenario was observed to use $60.84\%$ fewer simulations than the brute-force method.
△ Less
Submitted 7 April, 2020; v1 submitted 14 September, 2019;
originally announced September 2019.
-
An Empirical Study of Batch Normalization and Group Normalization in Conditional Computation
Authors:
Vincent Michalski,
Vikram Voleti,
Samira Ebrahimi Kahou,
Anthony Ortiz,
Pascal Vincent,
Chris Pal,
Doina Precup
Abstract:
Batch normalization has been widely used to improve optimization in deep neural networks. While the uncertainty in batch statistics can act as a regularizer, using these dataset statistics specific to the training set impairs generalization in certain tasks. Recently, alternative methods for normalizing feature activations in neural networks have been proposed. Among them, group normalization has…
▽ More
Batch normalization has been widely used to improve optimization in deep neural networks. While the uncertainty in batch statistics can act as a regularizer, using these dataset statistics specific to the training set impairs generalization in certain tasks. Recently, alternative methods for normalizing feature activations in neural networks have been proposed. Among them, group normalization has been shown to yield similar, in some domains even superior performance to batch normalization. All these methods utilize a learned affine transformation after the normalization operation to increase representational power. Methods used in conditional computation define the parameters of these transformations as learnable functions of conditioning information. In this work, we study whether and where the conditional formulation of group normalization can improve generalization compared to conditional batch normalization. We evaluate performances on the tasks of visual question answering, few-shot learning, and conditional image generation.
△ Less
Submitted 31 July, 2019;
originally announced August 2019.
-
A Closer Look at the Optimization Landscapes of Generative Adversarial Networks
Authors:
Hugo Berard,
Gauthier Gidel,
Amjad Almahairi,
Pascal Vincent,
Simon Lacoste-Julien
Abstract:
Generative adversarial networks have been very successful in generative modeling, however they remain relatively challenging to train compared to standard deep neural networks. In this paper, we propose new visualization techniques for the optimization landscapes of GANs that enable us to study the game vector field resulting from the concatenation of the gradient of both players. Using these visu…
▽ More
Generative adversarial networks have been very successful in generative modeling, however they remain relatively challenging to train compared to standard deep neural networks. In this paper, we propose new visualization techniques for the optimization landscapes of GANs that enable us to study the game vector field resulting from the concatenation of the gradient of both players. Using these visualization techniques we try to bridge the gap between theory and practice by showing empirically that the training of GANs exhibits significant rotations around Local Stable Stationary Points (LSSP), similar to the one predicted by theory on toy examples. Moreover, we provide empirical evidence that GAN training converge to a stable stationary point which is a saddle point for the generator loss, not a minimum, while still achieving excellent performance.
△ Less
Submitted 27 April, 2020; v1 submitted 11 June, 2019;
originally announced June 2019.
-
Stochastic Neural Network with Kronecker Flow
Authors:
Chin-Wei Huang,
Ahmed Touati,
Pascal Vincent,
Gintare Karolina Dziugaite,
Alexandre Lacoste,
Aaron Courville
Abstract:
Recent advances in variational inference enable the modelling of highly structured joint distributions, but are limited in their capacity to scale to the high-dimensional setting of stochastic neural networks. This limitation motivates a need for scalable parameterizations of the noise generation process, in a manner that adequately captures the dependencies among the various parameters. In this w…
▽ More
Recent advances in variational inference enable the modelling of highly structured joint distributions, but are limited in their capacity to scale to the high-dimensional setting of stochastic neural networks. This limitation motivates a need for scalable parameterizations of the noise generation process, in a manner that adequately captures the dependencies among the various parameters. In this work, we address this need and present the Kronecker Flow, a generalization of the Kronecker product to invertible mappings designed for stochastic neural networks. We apply our method to variational Bayesian neural networks on predictive tasks, PAC-Bayes generalization bound estimation, and approximate Thompson sampling in contextual bandits. In all setups, our methods prove to be competitive with existing methods and better than the baselines.
△ Less
Submitted 13 February, 2020; v1 submitted 10 June, 2019;
originally announced June 2019.
-
SVRG for Policy Evaluation with Fewer Gradient Evaluations
Authors:
Zilun Peng,
Ahmed Touati,
Pascal Vincent,
Doina Precup
Abstract:
Stochastic variance-reduced gradient (SVRG) is an optimization method originally designed for tackling machine learning problems with a finite sum structure. SVRG was later shown to work for policy evaluation, a problem in reinforcement learning in which one aims to estimate the value function of a given policy. SVRG makes use of gradient estimates at two scales. At the slower scale, SVRG computes…
▽ More
Stochastic variance-reduced gradient (SVRG) is an optimization method originally designed for tackling machine learning problems with a finite sum structure. SVRG was later shown to work for policy evaluation, a problem in reinforcement learning in which one aims to estimate the value function of a given policy. SVRG makes use of gradient estimates at two scales. At the slower scale, SVRG computes a full gradient over the whole dataset, which could lead to prohibitive computation costs. In this work, we show that two variants of SVRG for policy evaluation could significantly diminish the number of gradient calculations while preserving a linear convergence speed. More importantly, our theoretical result implies that one does not need to use the entire dataset in every epoch of SVRG when it is applied to policy evaluation with linear function approximation. Our experiments demonstrate large computational savings provided by the proposed methods.
△ Less
Submitted 19 June, 2020; v1 submitted 9 June, 2019;
originally announced June 2019.
-
Reducing Uncertainty in Undersampled MRI Reconstruction with Active Acquisition
Authors:
Zizhao Zhang,
Adriana Romero,
Matthew J. Muckley,
Pascal Vincent,
Lin Yang,
Michal Drozdzal
Abstract:
The goal of MRI reconstruction is to restore a high fidelity image from partially observed measurements. This partial view naturally induces reconstruction uncertainty that can only be reduced by acquiring additional measurements. In this paper, we present a novel method for MRI reconstruction that, at inference time, dynamically selects the measurements to take and iteratively refines the predict…
▽ More
The goal of MRI reconstruction is to restore a high fidelity image from partially observed measurements. This partial view naturally induces reconstruction uncertainty that can only be reduced by acquiring additional measurements. In this paper, we present a novel method for MRI reconstruction that, at inference time, dynamically selects the measurements to take and iteratively refines the prediction in order to best reduce the reconstruction error and, thus, its uncertainty. We validate our method on a large scale knee MRI dataset, as well as on ImageNet. Results show that (1) our system successfully outperforms active acquisition baselines; (2) our uncertainty estimates correlate with error maps; and (3) our ResNet-based architecture surpasses standard pixel-to-pixel models in the task of MRI reconstruction. The proposed method not only shows high-quality reconstructions but also paves the road towards more applicable solutions for accelerating MRI.
△ Less
Submitted 8 February, 2019;
originally announced February 2019.
-
fastMRI: An Open Dataset and Benchmarks for Accelerated MRI
Authors:
Jure Zbontar,
Florian Knoll,
Anuroop Sriram,
Tullie Murrell,
Zhengnan Huang,
Matthew J. Muckley,
Aaron Defazio,
Ruben Stern,
Patricia Johnson,
Mary Bruno,
Marc Parente,
Krzysztof J. Geras,
Joe Katsnelson,
Hersh Chandarana,
Zizhao Zhang,
Michal Drozdzal,
Adriana Romero,
Michael Rabbat,
Pascal Vincent,
Nafissa Yakubova,
James Pinkerton,
Duo Wang,
Erich Owens,
C. Lawrence Zitnick,
Michael P. Recht
, et al. (2 additional authors not shown)
Abstract:
Accelerating Magnetic Resonance Imaging (MRI) by taking fewer measurements has the potential to reduce medical costs, minimize stress to patients and make MRI possible in applications where it is currently prohibitively slow or expensive. We introduce the fastMRI dataset, a large-scale collection of both raw MR measurements and clinical MR images, that can be used for training and evaluation of ma…
▽ More
Accelerating Magnetic Resonance Imaging (MRI) by taking fewer measurements has the potential to reduce medical costs, minimize stress to patients and make MRI possible in applications where it is currently prohibitively slow or expensive. We introduce the fastMRI dataset, a large-scale collection of both raw MR measurements and clinical MR images, that can be used for training and evaluation of machine-learning approaches to MR image reconstruction. By introducing standardized evaluation criteria and a freely-accessible dataset, our goal is to help the community make rapid advances in the state of the art for MR image reconstruction. We also provide a self-contained introduction to MRI for machine learning researchers with no medical imaging background.
△ Less
Submitted 11 December, 2019; v1 submitted 21 November, 2018;
originally announced November 2018.
-
The Long Road to Computational Location Privacy: A Survey
Authors:
Primault Vincent,
Boutet Antoine,
Ben Mokhtar Sonia,
Brunie Lionel
Abstract:
The widespread adoption of continuously connected smartphones and tablets developed the usage of mobile applications, among which many use location to provide geolocated services. These services provide new prospects for users: getting directions to work in the morning, leaving a check-in at a restaurant at noon and checking next day's weather in the evening are possible right from any mobile devi…
▽ More
The widespread adoption of continuously connected smartphones and tablets developed the usage of mobile applications, among which many use location to provide geolocated services. These services provide new prospects for users: getting directions to work in the morning, leaving a check-in at a restaurant at noon and checking next day's weather in the evening are possible right from any mobile device embedding a GPS chip. In these location-based applications, the user's location is sent to a server, which uses them to provide contextual and personalised answers. However, nothing prevents the latter from gathering, analysing and possibly sharing the collected information, which opens the door to many privacy threats. Indeed, mobility data can reveal sensitive information about users, among which one's home, work place or even religious and political preferences. For this reason, many privacy-preserving mechanisms have been proposed these last years to enhance location privacy while using geolocated services. This article surveys and organises contributions in this area from classical building blocks to the most recent developments of privacy threats and location privacy-preserving mechanisms. We divide the protection mechanisms between online and offline use cases, and organise them into six categories depending on the nature of their algorithm. Moreover, this article surveys the evaluation metrics used to assess protection mechanisms in terms of privacy, utility and performance. Finally, open challenges and new directions to address the problem of computational location privacy are pointed out and discussed.
△ Less
Submitted 8 October, 2018;
originally announced October 2018.
-
Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis
Authors:
Thomas George,
César Laurent,
Xavier Bouthillier,
Nicolas Ballas,
Pascal Vincent
Abstract:
Optimization algorithms that leverage gradient covariance information, such as variants of natural gradient descent (Amari, 1998), offer the prospect of yielding more effective descent directions. For models with many parameters, the covariance matrix they are based on becomes gigantic, making them inapplicable in their original form. This has motivated research into both simple diagonal approxima…
▽ More
Optimization algorithms that leverage gradient covariance information, such as variants of natural gradient descent (Amari, 1998), offer the prospect of yielding more effective descent directions. For models with many parameters, the covariance matrix they are based on becomes gigantic, making them inapplicable in their original form. This has motivated research into both simple diagonal approximations and more sophisticated factored approximations such as KFAC (Heskes, 2000; Martens & Grosse, 2015; Grosse & Martens, 2016). In the present work we draw inspiration from both to propose a novel approximation that is provably better than KFAC and amendable to cheap partial updates. It consists in tracking a diagonal variance, not in parameter coordinates, but in a Kronecker-factored eigenbasis, in which the diagonal approximation is likely to be more effective. Experiments show improvements over KFAC in optimization speed for several deep network architectures.
△ Less
Submitted 26 July, 2021; v1 submitted 11 June, 2018;
originally announced June 2018.
-
Randomized Value Functions via Multiplicative Normalizing Flows
Authors:
Ahmed Touati,
Harsh Satija,
Joshua Romoff,
Joelle Pineau,
Pascal Vincent
Abstract:
Randomized value functions offer a promising approach towards the challenge of efficient exploration in complex environments with high dimensional state and action spaces. Unlike traditional point estimate methods, randomized value functions maintain a posterior distribution over action-space values. This prevents the agent's behavior policy from prematurely exploiting early estimates and falling…
▽ More
Randomized value functions offer a promising approach towards the challenge of efficient exploration in complex environments with high dimensional state and action spaces. Unlike traditional point estimate methods, randomized value functions maintain a posterior distribution over action-space values. This prevents the agent's behavior policy from prematurely exploiting early estimates and falling into local optima. In this work, we leverage recent advances in variational Bayesian neural networks and combine these with traditional Deep Q-Networks (DQN) and Deep Deterministic Policy Gradient (DDPG) to achieve randomized value functions for high-dimensional domains. In particular, we augment DQN and DDPG with multiplicative normalizing flows in order to track a rich approximate posterior distribution over the parameters of the value function. This allows the agent to perform approximate Thompson sampling in a computationally efficient manner via stochastic gradient methods. We demonstrate the benefits of our approach through an empirical comparison in high dimensional environments.
△ Less
Submitted 28 June, 2019; v1 submitted 6 June, 2018;
originally announced June 2018.
-
A Variational Inequality Perspective on Generative Adversarial Networks
Authors:
Gauthier Gidel,
Hugo Berard,
Gaëtan Vignoud,
Pascal Vincent,
Simon Lacoste-Julien
Abstract:
Generative adversarial networks (GANs) form a generative modeling approach known for producing appealing samples, but they are notably difficult to train. One common way to tackle this issue has been to propose new formulations of the GAN objective. Yet, surprisingly few studies have looked at optimization methods designed for this adversarial training. In this work, we cast GAN optimization probl…
▽ More
Generative adversarial networks (GANs) form a generative modeling approach known for producing appealing samples, but they are notably difficult to train. One common way to tackle this issue has been to propose new formulations of the GAN objective. Yet, surprisingly few studies have looked at optimization methods designed for this adversarial training. In this work, we cast GAN optimization problems in the general variational inequality framework. Tapping into the mathematical programming literature, we counter some common misconceptions about the difficulties of saddle point optimization and propose to extend techniques designed for variational inequalities to the training of GANs. We apply averaging, extrapolation and a computationally cheaper variant that we call extrapolation from the past to the stochastic gradient method (SGD) and Adam.
△ Less
Submitted 28 August, 2020; v1 submitted 28 February, 2018;
originally announced February 2018.
-
Improving Landmark Localization with Semi-Supervised Learning
Authors:
Sina Honari,
Pavlo Molchanov,
Stephen Tyree,
Pascal Vincent,
Christopher Pal,
Jan Kautz
Abstract:
We present two techniques to improve landmark localization in images from partially annotated datasets. Our primary goal is to leverage the common situation where precise landmark locations are only provided for a small data subset, but where class labels for classification or regression tasks related to the landmarks are more abundantly available. First, we propose the framework of sequential mul…
▽ More
We present two techniques to improve landmark localization in images from partially annotated datasets. Our primary goal is to leverage the common situation where precise landmark locations are only provided for a small data subset, but where class labels for classification or regression tasks related to the landmarks are more abundantly available. First, we propose the framework of sequential multitasking and explore it here through an architecture for landmark localization where training with class labels acts as an auxiliary signal to guide the landmark localization on unlabeled data. A key aspect of our approach is that errors can be backpropagated through a complete landmark localization model. Second, we propose and explore an unsupervised learning technique for landmark localization based on having a model predict equivariant landmarks with respect to transformations applied to the image. We show that these techniques, improve landmark prediction considerably and can learn effective detectors even when only a small fraction of the dataset has landmark labels. We present results on two toy datasets and four real datasets, with hands and faces, and report new state-of-the-art on two datasets in the wild, e.g. with only 5\% of labeled images we outperform previous state-of-the-art trained on the AFLW dataset.
△ Less
Submitted 28 October, 2018; v1 submitted 5 September, 2017;
originally announced September 2017.
-
Parametric Adversarial Divergences are Good Losses for Generative Modeling
Authors:
Gabriel Huang,
Hugo Berard,
Ahmed Touati,
Gauthier Gidel,
Pascal Vincent,
Simon Lacoste-Julien
Abstract:
Parametric adversarial divergences, which are a generalization of the losses used to train generative adversarial networks (GANs), have often been described as being approximations of their nonparametric counterparts, such as the Jensen-Shannon divergence, which can be derived under the so-called optimal discriminator assumption. In this position paper, we argue that despite being "non-optimal", p…
▽ More
Parametric adversarial divergences, which are a generalization of the losses used to train generative adversarial networks (GANs), have often been described as being approximations of their nonparametric counterparts, such as the Jensen-Shannon divergence, which can be derived under the so-called optimal discriminator assumption. In this position paper, we argue that despite being "non-optimal", parametric divergences have distinct properties from their nonparametric counterparts which can make them more suitable for learning high-dimensional distributions. A key property is that parametric divergences are only sensitive to certain aspects/moments of the distribution, which depend on the architecture of the discriminator and the loss it was trained with. In contrast, nonparametric divergences such as the Kullback-Leibler divergence are sensitive to moments ignored by the discriminator, but they do not necessarily correlate with sample quality (Theis et al., 2016). Similarly, we show that mutual information can lead to unintuitive interpretations, and explore more intuitive alternatives based on parametric divergences. We conclude that parametric divergences are a flexible framework for defining statistical quantities relevant to a specific modeling task.
△ Less
Submitted 21 October, 2021; v1 submitted 8 August, 2017;
originally announced August 2017.
-
Learning to Compute Word Embeddings On the Fly
Authors:
Dzmitry Bahdanau,
Tom Bosc,
Stanisław Jastrzębski,
Edward Grefenstette,
Pascal Vincent,
Yoshua Bengio
Abstract:
Words in natural language follow a Zipfian distribution whereby some words are frequent but most are rare. Learning representations for words in the "long tail" of this distribution requires enormous amounts of data. Representations of rare words trained directly on end tasks are usually poor, requiring us to pre-train embeddings on external data, or treat all rare words as out-of-vocabulary words…
▽ More
Words in natural language follow a Zipfian distribution whereby some words are frequent but most are rare. Learning representations for words in the "long tail" of this distribution requires enormous amounts of data. Representations of rare words trained directly on end tasks are usually poor, requiring us to pre-train embeddings on external data, or treat all rare words as out-of-vocabulary words with a unique representation. We provide a method for predicting embeddings of rare words on the fly from small amounts of auxiliary data with a network trained end-to-end for the downstream task. We show that this improves results against baselines where embeddings are trained on the end task for reading comprehension, recognizing textual entailment and language modeling.
△ Less
Submitted 7 March, 2018; v1 submitted 1 June, 2017;
originally announced June 2017.