Skip to main content

Showing 1–1 of 1 results for author: Koishigarina, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2502.03566  [pdf, other

    cs.CV cs.LG

    CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally

    Authors: Darina Koishigarina, Arnas Uselis, Seong Joon Oh

    Abstract: CLIP (Contrastive Language-Image Pretraining) has become a popular choice for various downstream tasks. However, recent studies have questioned its ability to represent compositional concepts effectively. These works suggest that CLIP often acts like a bag-of-words (BoW) model, interpreting images and text as sets of individual concepts without grasping the structural relationships. In particular,… ▽ More

    Submitted 8 February, 2025; v1 submitted 5 February, 2025; originally announced February 2025.