MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning

Zhao, Zijia; Guo, Longteng; He, Xingjian; Shao, Shuai; Yuan, Zehuan; Liu, Jing

doi:10.1145/3539618.3591721

Computer Science > Computer Vision and Pattern Recognition

arXiv:2210.04183 (cs)

[Submitted on 9 Oct 2022 (v1), last revised 14 Jun 2023 (this version, v3)]

Title:MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning

Authors:Zijia Zhao, Longteng Guo, Xingjian He, Shuai Shao, Zehuan Yuan, Jing Liu

View PDF

Abstract:Multimodal representation learning has shown promising improvements on various vision-language tasks. Most existing methods excel at building global-level alignment between vision and language while lacking effective fine-grained image-text interaction. In this paper, we propose a jointly masked multimodal modeling method to learn fine-grained multimodal representations. Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover. The implicit target provides a unified and debiased objective for vision and language, where the model predicts latent multimodal representations of the unmasked input. The explicit target further enriches the multimodal representations by recovering high-level and semantically meaningful information: momentum visual features of image patches and concepts of word tokens. Through such a masked modeling process, our model not only learns fine-grained multimodal interaction, but also avoids the semantic gap between high-level representations and low- or mid-level prediction targets (e.g. image pixels), thus producing semantically rich multimodal representations that perform well on both zero-shot and fine-tuned settings. Our pre-trained model (named MAMO) achieves state-of-the-art performance on various downstream vision-language tasks, including image-text retrieval, visual question answering, visual reasoning, and weakly-supervised visual grounding.

Comments:	SIGIR 2023, 10 pages
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
Cite as:	arXiv:2210.04183 [cs.CV]
	(or arXiv:2210.04183v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2210.04183
Related DOI:	https://doi.org/10.1145/3539618.3591721

Submission history

From: Zijia Zhao [view email]
[v1] Sun, 9 Oct 2022 06:31:15 UTC (23,076 KB)
[v2] Mon, 5 Jun 2023 14:35:59 UTC (32,404 KB)
[v3] Wed, 14 Jun 2023 07:26:20 UTC (30,651 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators