Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

Baevski, Alexei; Babu, Arun; Hsu, Wei-Ning; Auli, Michael

Computer Science > Machine Learning

arXiv:2212.07525 (cs)

[Submitted on 14 Dec 2022 (v1), last revised 15 Jun 2023 (this version, v2)]

Title:Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

Authors:Alexei Baevski, Arun Babu, Wei-Ning Hsu, Michael Auli

View PDF

Abstract:Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations. data2vec 2.0 benefits from the rich contextualized target representations introduced in data2vec which enable a fast self-supervised learner. Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time, on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x less time, and on GLUE natural language understanding it matches a retrained RoBERTa model in half the time. Trading some speed for accuracy results in ImageNet-1K top-1 accuracy of 86.8\% with a ViT-L model trained for 150 epochs.

Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2212.07525 [cs.LG]
	(or arXiv:2212.07525v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2212.07525

Submission history

From: Michael Auli [view email]
[v1] Wed, 14 Dec 2022 22:13:11 UTC (591 KB)
[v2] Thu, 15 Jun 2023 15:19:22 UTC (593 KB)

Computer Science > Machine Learning

Title:Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators