TIPS: Text-Image Pretraining with Spatial awareness

Maninis, Kevis-Kokitsi; Chen, Kaifeng; Ghosh, Soham; Karpur, Arjun; Chen, Koert; Xia, Ye; Cao, Bingyi; Salz, Daniel; Han, Guangxing; Dlabal, Jan; Gnanapragasam, Dan; Seyedhosseini, Mojtaba; Zhou, Howard; Araujo, Andre

Computer Science > Computer Vision and Pattern Recognition

arXiv:2410.16512 (cs)

[Submitted on 21 Oct 2024 (v1), last revised 7 Mar 2025 (this version, v2)]

Title:TIPS: Text-Image Pretraining with Spatial awareness

Authors:Kevis-Kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Arjun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, Dan Gnanapragasam, Mojtaba Seyedhosseini, Howard Zhou, Andre Araujo

View PDF HTML (experimental)

Abstract:While image-text representation learning has become very popular in recent years, existing models tend to lack spatial awareness and have limited direct applicability for dense understanding tasks. For this reason, self-supervised image-only pretraining is still the go-to method for many dense vision applications (e.g. depth estimation, semantic segmentation), despite the lack of explicit supervisory signals. In this paper, we close this gap between image-text and self-supervised learning, by proposing a novel general-purpose image-text model, which can be effectively used off the shelf for dense and global vision tasks. Our method, which we refer to as Text-Image Pretraining with Spatial awareness (TIPS), leverages two simple and effective insights. First, on textual supervision: we reveal that replacing noisy web image captions by synthetically generated textual descriptions boosts dense understanding performance significantly, due to a much richer signal for learning spatially aware representations. We propose an adapted training method that combines noisy and synthetic captions, resulting in improvements across both dense and global understanding tasks. Second, on the learning technique: we propose to combine contrastive image-text learning with self-supervised masked image modeling, to encourage spatial coherence, unlocking substantial enhancements for downstream applications. Building on these two ideas, we scale our model using the transformer architecture, trained on a curated set of public images. Our experiments are conducted on 8 tasks involving 16 datasets in total, demonstrating strong off-the-shelf performance on both dense and global understanding, for several image-only and image-text tasks. Code and models are released at this https URL.

Comments:	ICLR2025 camera-ready + appendix
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2410.16512 [cs.CV]
	(or arXiv:2410.16512v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2410.16512

Submission history

From: Kaifeng Chen [view email]
[v1] Mon, 21 Oct 2024 21:05:04 UTC (5,726 KB)
[v2] Fri, 7 Mar 2025 19:38:42 UTC (10,812 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:TIPS: Text-Image Pretraining with Spatial awareness

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:TIPS: Text-Image Pretraining with Spatial awareness

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators