Search | arXiv e-print repository

Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations

Authors: Shashank Shekhar, Florian Bordes, Pascal Vincent, Ari Morcos

Abstract: Joint-embedding based learning (e.g., SimCLR, MoCo, DINO) and reconstruction-based learning (e.g., BEiT, SimMIM, MAE) are the two leading paradigms for self-supervised learning of vision transformers, but they differ substantially in their transfer performance. Here, we aim to explain these differences by analyzing the impact of these objectives on the structure and transferability of the learned… ▽ More Joint-embedding based learning (e.g., SimCLR, MoCo, DINO) and reconstruction-based learning (e.g., BEiT, SimMIM, MAE) are the two leading paradigms for self-supervised learning of vision transformers, but they differ substantially in their transfer performance. Here, we aim to explain these differences by analyzing the impact of these objectives on the structure and transferability of the learned representations. Our analysis reveals that reconstruction-based learning features are significantly dissimilar to joint-embedding based learning features and that models trained with similar objectives learn similar features even across architectures. These differences arise early in the network and are primarily driven by attention and normalization layers. We find that joint-embedding features yield better linear probe transfer for classification because the different objectives drive different distributions of information and invariances in the learned representation. These differences explain opposite trends in transfer performance for downstream tasks that require spatial specificity in features. Finally, we address how fine-tuning changes reconstructive representations to enable better transfer, showing that fine-tuning re-organizes the information to be more similar to pre-trained joint embedding models. △ Less

Submitted 25 April, 2023; originally announced April 2023.

arXiv:2301.08243 [pdf, other]

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

Authors: Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas

Abstract: This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target block… ▽ More This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction. △ Less

Submitted 13 April, 2023; v1 submitted 19 January, 2023; originally announced January 2023.

Comments: 2023 IEEE/CVF International Conference on Computer Vision

arXiv:2204.07141 [pdf, other]

Masked Siamese Networks for Label-Efficient Learning

Authors: Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas

Abstract: We propose Masked Siamese Networks (MSN), a self-supervised learning framework for learning image representations. Our approach matches the representation of an image view containing randomly masked patches to the representation of the original unmasked image. This self-supervised pre-training strategy is particularly scalable when applied to Vision Transformers since only the unmasked patches are… ▽ More We propose Masked Siamese Networks (MSN), a self-supervised learning framework for learning image representations. Our approach matches the representation of an image view containing randomly masked patches to the representation of the original unmasked image. This self-supervised pre-training strategy is particularly scalable when applied to Vision Transformers since only the unmasked patches are processed by the network. As a result, MSNs improve the scalability of joint-embedding architectures, while producing representations of a high semantic level that perform competitively on low-shot image classification. For instance, on ImageNet-1K, with only 5,000 annotated images, our base MSN model achieves 72.4% top-1 accuracy, and with 1% of ImageNet-1K labels, we achieve 75.7% top-1 accuracy, setting a new state-of-the-art for self-supervised learning on this benchmark. Our code is publicly available. △ Less

Submitted 14 April, 2022; originally announced April 2022.

arXiv:2011.11570 [pdf, other]

doi 10.1109/CDC42340.2020.9304378

Direct Transcription for Dynamic Optimization: A Tutorial with a Case Study on Dual-Patient Ventilation During the COVID-19 Pandemic

Authors: Eric C. Kerrigan, Yuanbo Nie, Omar Faqir, Caroline H. Kennedy, Steven A. Niederer, Jose A. Solis-Lemus, Peter Vincent, Steven E. Williams

Abstract: A variety of optimal control, estimation, system identification and design problems can be formulated as functional optimization problems with differential equality and inequality constraints. Since these problems are infinite-dimensional and often do not have a known analytical solution, one has to resort to numerical methods to compute an approximate solution. This paper uses a unifying notation… ▽ More A variety of optimal control, estimation, system identification and design problems can be formulated as functional optimization problems with differential equality and inequality constraints. Since these problems are infinite-dimensional and often do not have a known analytical solution, one has to resort to numerical methods to compute an approximate solution. This paper uses a unifying notation to outline some of the techniques used in the transcription step of simultaneous direct methods (which discretize-then-optimize) for solving continuous-time dynamic optimization problems. We focus on collocation, integrated residual and Runge-Kutta schemes. These transcription methods are then applied to a simulation case study to answer a question that arose during the COVID-19 pandemic, namely: If there are not enough ventilators, is it possible to ventilate more than one patient on a single ventilator? The results suggest that it is possible, in principle, to estimate individual patient parameters sufficiently accurately, using a relatively small number of flow rate measurements, without needing to disconnect a patient from the system or needing more than one flow rate sensor. We also show that it is possible to ensure that two different patients can indeed receive their desired tidal volume, by modifying the resistance experienced by the air flow to each patient and controlling the ventilator pressure. △ Less

Submitted 23 November, 2020; originally announced November 2020.

Comments: Accepted to 59th IEEE Conference on Decision and Control, Jeju Island, Republic of Korea, December 14th-18th 2020

Journal ref: 2020 59th IEEE Conference on Decision and Control (CDC)

arXiv:2010.04425 [pdf, other]

WHO 2016 subtyping and automated segmentation of glioma using multi-task deep learning

Authors: Sebastian R. van der Voort, Fatih Incekara, Maarten M. J. Wijnenga, Georgios Kapsas, Renske Gahrmann, Joost W. Schouten, Rishi Nandoe Tewarie, Geert J. Lycklama, Philip C. De Witt Hamer, Roelant S. Eijgelaar, Pim J. French, Hendrikus J. Dubbink, Arnaud J. P. E. Vincent, Wiro J. Niessen, Martin J. van den Bent, Marion Smits, Stefan Klein

Abstract: Accurate characterization of glioma is crucial for clinical decision making. A delineation of the tumor is also desirable in the initial decision stages but is a time-consuming task. Leveraging the latest GPU capabilities, we developed a single multi-task convolutional neural network that uses the full 3D, structural, pre-operative MRI scans to can predict the IDH mutation status, the 1p/19q co-de… ▽ More Accurate characterization of glioma is crucial for clinical decision making. A delineation of the tumor is also desirable in the initial decision stages but is a time-consuming task. Leveraging the latest GPU capabilities, we developed a single multi-task convolutional neural network that uses the full 3D, structural, pre-operative MRI scans to can predict the IDH mutation status, the 1p/19q co-deletion status, and the grade of a tumor, while simultaneously segmenting the tumor. We trained our method using the largest, most diverse patient cohort to date containing 1508 glioma patients from 16 institutes. We tested our method on an independent dataset of 240 patients from 13 different institutes, and achieved an IDH-AUC of 0.90, 1p/19q-AUC of 0.85, grade-AUC of 0.81, and a mean whole tumor DICE score of 0.84. Thus, our method non-invasively predicts multiple, clinically relevant parameters and generalizes well to the broader clinical population. △ Less

Submitted 9 October, 2020; originally announced October 2020.

arXiv:1811.08839 [pdf, other]

fastMRI: An Open Dataset and Benchmarks for Accelerated MRI

Authors: Jure Zbontar, Florian Knoll, Anuroop Sriram, Tullie Murrell, Zhengnan Huang, Matthew J. Muckley, Aaron Defazio, Ruben Stern, Patricia Johnson, Mary Bruno, Marc Parente, Krzysztof J. Geras, Joe Katsnelson, Hersh Chandarana, Zizhao Zhang, Michal Drozdzal, Adriana Romero, Michael Rabbat, Pascal Vincent, Nafissa Yakubova, James Pinkerton, Duo Wang, Erich Owens, C. Lawrence Zitnick, Michael P. Recht , et al. (2 additional authors not shown)

Abstract: Accelerating Magnetic Resonance Imaging (MRI) by taking fewer measurements has the potential to reduce medical costs, minimize stress to patients and make MRI possible in applications where it is currently prohibitively slow or expensive. We introduce the fastMRI dataset, a large-scale collection of both raw MR measurements and clinical MR images, that can be used for training and evaluation of ma… ▽ More Accelerating Magnetic Resonance Imaging (MRI) by taking fewer measurements has the potential to reduce medical costs, minimize stress to patients and make MRI possible in applications where it is currently prohibitively slow or expensive. We introduce the fastMRI dataset, a large-scale collection of both raw MR measurements and clinical MR images, that can be used for training and evaluation of machine-learning approaches to MR image reconstruction. By introducing standardized evaluation criteria and a freely-accessible dataset, our goal is to help the community make rapid advances in the state of the art for MR image reconstruction. We also provide a self-contained introduction to MRI for machine learning researchers with no medical imaging background. △ Less

Submitted 11 December, 2019; v1 submitted 21 November, 2018; originally announced November 2018.

Comments: 35 pages, 10 figures

Showing 1–6 of 6 results for author: Vincent, P