RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Yuan, Hangjie; Zhang, Shiwei; Wang, Xiang; Albanie, Samuel; Pan, Yining; Feng, Tao; Jiang, Jianwen; Ni, Dong; Zhang, Yingya; Zhao, Deli

Computer Science > Computer Vision and Pattern Recognition

arXiv:2308.09351 (cs)

[Submitted on 18 Aug 2023]

Title:RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Authors:Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao

View PDF

Abstract:Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (e.g., BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1% data and yields 45.09mAP with 100% data. Code and models are publicly available at this https URL.

Comments:	Accepted to ICCV 2023. Code and models: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
Cite as:	arXiv:2308.09351 [cs.CV]
	(or arXiv:2308.09351v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2308.09351

Submission history

From: Hangjie Yuan [view email]
[v1] Fri, 18 Aug 2023 07:17:09 UTC (957 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators