Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning

Yaras, Can; Chen, Siyi; Wang, Peng; Qu, Qing

Computer Science > Machine Learning

arXiv:2412.07909 (cs)

[Submitted on 10 Dec 2024]

Title:Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning

Authors:Can Yaras, Siyi Chen, Peng Wang, Qing Qu

View PDF HTML (experimental)

Abstract:Multimodal learning has recently gained significant popularity, demonstrating impressive performance across various zero-shot classification tasks and a range of perceptive and generative applications. Models such as Contrastive Language-Image Pretraining (CLIP) are designed to bridge different modalities, such as images and text, by learning a shared representation space through contrastive learning. Despite their success, the working mechanisms underlying multimodal learning are not yet well understood. Notably, these models often exhibit a modality gap, where different modalities occupy distinct regions within the shared representation space. In this work, we conduct an in-depth analysis of the emergence of modality gap by characterizing the gradient flow learning dynamics. Specifically, we identify the critical roles of mismatched data pairs and a learnable temperature parameter in causing and perpetuating the modality gap during training. Furthermore, our theoretical insights are validated through experiments on practical CLIP models. These findings provide principled guidance for mitigating the modality gap, including strategies such as appropriate temperature scheduling and modality swapping. Additionally, we demonstrate that closing the modality gap leads to improved performance on tasks such as image-text retrieval.

Comments:	The first two authors contributed equally to this work
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2412.07909 [cs.LG]
	(or arXiv:2412.07909v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2412.07909

Submission history

From: Can Yaras [view email]
[v1] Tue, 10 Dec 2024 20:36:49 UTC (3,535 KB)

Computer Science > Machine Learning

Title:Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators