Bridge the Modality and Capability Gaps in Vision-Language Model Selection

Yi, Chao; He, Yu-Hang; Zhan, De-Chuan; Ye, Han-Jia

Computer Science > Machine Learning

arXiv:2403.13797 (cs)

[Submitted on 20 Mar 2024 (v1), last revised 18 May 2025 (this version, v3)]

Title:Bridge the Modality and Capability Gaps in Vision-Language Model Selection

Authors:Chao Yi, Yu-Hang He, De-Chuan Zhan, Han-Jia Ye

View PDF HTML (experimental)

Abstract:Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names. The expanding variety of Pre-Trained VLMs enhances the likelihood of identifying a suitable VLM for specific tasks. To better reuse the VLM resource and fully leverage its potential on different zero-shot image classification tasks, a promising strategy is selecting appropriate Pre-Trained VLMs from the VLM Zoo, relying solely on the text data of the target dataset without access to the dataset's images. In this paper, we analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection: the "Modality Gap" - the disparity in VLM's embeddings across two different modalities, making text a less reliable substitute for images; and the "Capability Gap" - the discrepancy between the VLM's overall ranking and its ranking for target dataset, hindering direct prediction of a model's dataset-specific performance from its general performance. We propose VLM Selection With gAp Bridging (SWAB) to mitigate the negative impact of two gaps. SWAB first adopts optimal transport to capture the relevance between open-source and target datasets with a transportation matrix. It then uses this matrix to transfer useful statistics of VLMs from open-source datasets to the target dataset for bridging two gaps. By bridging two gaps to obtain better substitutes for test images, SWAB can accurately predict the performance ranking of different VLMs on the target task without the need for the dataset's images. Experiments across various VLMs and image classification datasets validate SWAB's effectiveness.

Comments:	fix typo in figure 2 "Capability Gap"
Subjects:	Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2403.13797 [cs.LG]
	(or arXiv:2403.13797v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2403.13797

Submission history

From: Chao Yi [view email]
[v1] Wed, 20 Mar 2024 17:54:58 UTC (4,789 KB)
[v2] Sat, 2 Nov 2024 03:14:39 UTC (5,538 KB)
[v3] Sun, 18 May 2025 16:13:47 UTC (5,589 KB)

Computer Science > Machine Learning

Title:Bridge the Modality and Capability Gaps in Vision-Language Model Selection

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Bridge the Modality and Capability Gaps in Vision-Language Model Selection

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators