InterBERT: An Effective Multi-Modal Pretraining Approach via Vision-and-Language Interaction

Lin, Junyang; Yang, An; Zhang, Yichang; Liu, Jie; Zhou, Jingren; Yang, Hongxia

Computer Science > Computation and Language

arXiv:2003.13198v2 (cs)

[Submitted on 30 Mar 2020 (v1), revised 10 Jun 2020 (this version, v2), latest version 22 Apr 2021 (v4)]

Title:InterBERT: An Effective Multi-Modal Pretraining Approach via Vision-and-Language Interaction

Authors:Junyang Lin, An Yang, Yichang Zhang, Jie Liu, Jingren Zhou, Hongxia Yang

View PDF

Abstract:We propose a novel method for multi-modal pretraining, namely InterBERT (BERT for Interaction). The proposed architecture owns a strong capability of modeling interaction between the information flows of different modalities. The single-stream interaction module is capable of effectively processing information of multiple modalities, and the two-stream extraction module on top preserves the independence of each modality to avoid significant performance downgrade in single-modal tasks. The proposed pretraining task called masked group modeling (MGM) includes masked segment modeling and masked region modeling. It encourages the model to model a span or region instead of a single word or object, and it requires the model to learn from the general context. We pretrain the model with MGM and the conventional image-text matching, and finetune it on a series of vision-and-language downstream tasks, including caption-based image retrieval, zero-shot image retrieval, and visual commonsense reasoning. Experimental results demonstrate that InterBERT outperforms a series of strong baselines, including the most recent multi-modal pretraining methods. The analysis shows that the proposed MGM is effective for pretraining, and our method for multi-modal pretraining can adapt to single-modal tasks without significant performance decrease in comparison with the BERT-base model.

Comments:	15 pages
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2003.13198 [cs.CL]
	(or arXiv:2003.13198v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2003.13198

Submission history

From: An Yang [view email]
[v1] Mon, 30 Mar 2020 03:13:22 UTC (9,024 KB)
[v2] Wed, 10 Jun 2020 02:14:00 UTC (1,155 KB)
[v3] Wed, 6 Jan 2021 16:19:34 UTC (16,292 KB)
[v4] Thu, 22 Apr 2021 11:20:26 UTC (16,283 KB)

Computer Science > Computation and Language

Title:InterBERT: An Effective Multi-Modal Pretraining Approach via Vision-and-Language Interaction

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:InterBERT: An Effective Multi-Modal Pretraining Approach via Vision-and-Language Interaction

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators