CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation

Abbasi, Reza; Nazari, Ali; Sefid, Aminreza; Banayeeanzade, Mohammadali; Rohban, Mohammad Hossein; Baghshah, Mahdieh Soleymani

Computer Science > Computer Vision and Pattern Recognition

arXiv:2502.19842 (cs)

[Submitted on 27 Feb 2025 (v1), last revised 28 Feb 2025 (this version, v2)]

Title:CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation

Authors:Reza Abbasi, Ali Nazari, Aminreza Sefid, Mohammadali Banayeeanzade, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

View PDF HTML (experimental)

Abstract:Contrastive Language-Image Pre-training (CLIP) models excel in zero-shot classification, yet face challenges in complex multi-object scenarios. This study offers a comprehensive analysis of CLIP's limitations in these contexts using a specialized dataset, ComCO, designed to evaluate CLIP's encoders in diverse multi-object scenarios. Our findings reveal significant biases: the text encoder prioritizes first-mentioned objects, and the image encoder favors larger objects. Through retrieval and classification tasks, we quantify these biases across multiple CLIP variants and trace their origins to CLIP's training process, supported by analyses of the LAION dataset and training progression. Our image-text matching experiments show substantial performance drops when object size or token order changes, underscoring CLIP's instability with rephrased but semantically similar captions. Extending this to longer captions and text-to-image models like Stable Diffusion, we demonstrate how prompt order influences object prominence in generated images. For more details and access to our dataset and analysis code, visit our project repository: this https URL.

Comments:	Accepted at CVPR 2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2502.19842 [cs.CV]
	(or arXiv:2502.19842v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2502.19842

Submission history

From: Reza Abbasi [view email]
[v1] Thu, 27 Feb 2025 07:34:42 UTC (33,923 KB)
[v2] Fri, 28 Feb 2025 19:00:13 UTC (33,923 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators