Does CLIP Bind Concepts? Probing Compositionality in Large Image Models

Lewis, Martha; Yu, Qinan; Merullo, Jack; Pavlick, Ellie

Computer Science > Computer Vision and Pattern Recognition

arXiv:2212.10537v1 (cs)

[Submitted on 20 Dec 2022 (this version), latest version 30 Aug 2024 (v3)]

Title:Does CLIP Bind Concepts? Probing Compositionality in Large Image Models

Authors:Martha Lewis, Qinan Yu, Jack Merullo, Ellie Pavlick

View PDF

Abstract:Large-scale models combining text and images have made incredible progress in recent years. However, they can still fail at tasks requiring compositional knowledge, such as correctly picking out a red cube from a picture of multiple shapes. We examine the ability of CLIP (Radford et al., 2021), to caption images requiring compositional knowledge. We implement five compositional language models to probe the kinds of structure that CLIP may be using, and develop a novel training algorithm, Compositional Skipgram for Images (CoSI), to train these models. We look at performance in attribute-based tasks, requiring the identification of a particular combination of attribute and object (such as "red cube"), and in relational settings, where the spatial relation between two shapes (such as "cube behind sphere") must be identified. We find that in some conditions, CLIP is able to learn attribute-object labellings, and to generalize to unseen attribute-object combinations. However, we also see evidence that CLIP is not able to bind features together reliably. Moreover, CLIP is not able to reliably learn relations between objects, whereas some compositional models are able to learn these perfectly. Of the five models we developed, none were able to generalize to unseen relations.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2212.10537 [cs.CV]
	(or arXiv:2212.10537v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2212.10537

Submission history

From: Martha Lewis [view email]
[v1] Tue, 20 Dec 2022 18:46:28 UTC (945 KB)
[v2] Wed, 29 Mar 2023 15:34:23 UTC (306 KB)
[v3] Fri, 30 Aug 2024 04:51:28 UTC (318 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Does CLIP Bind Concepts? Probing Compositionality in Large Image Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Does CLIP Bind Concepts? Probing Compositionality in Large Image Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators