Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval

Lin, Chengzhi; Wu, Ancong; Liang, Junwei; Zhang, Jun; Ge, Wenhang; Zheng, Wei-Shi; Shen, Chunhua

Computer Science > Computer Vision and Pattern Recognition

arXiv:2209.13307 (cs)

[Submitted on 27 Sep 2022]

Title:Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval

Authors:Chengzhi Lin, Ancong Wu, Junwei Liang, Jun Zhang, Wenhang Ge, Wei-Shi Zheng, Chunhua Shen

View PDF

Abstract:Cross-modal retrieval between videos and texts has gained increasing research interest due to the rapid emergence of videos on the web. Generally, a video contains rich instance and event information and the query text only describes a part of the information. Thus, a video can correspond to multiple different text descriptions and queries. We call this phenomenon the ``Video-Text Correspondence Ambiguity'' problem. Current techniques mostly concentrate on mining local or multi-level alignment between contents of a video and text (\textit{e.g.}, object to entity and action to verb). It is difficult for these methods to alleviate the video-text correspondence ambiguity by describing a video using only one single feature, which is required to be matched with multiple different text features at the same time. To address this problem, we propose a Text-Adaptive Multiple Visual Prototype Matching model, which automatically captures multiple prototypes to describe a video by adaptive aggregation of video token features. Given a query text, the similarity is determined by the most similar prototype to find correspondence in the video, which is termed text-adaptive matching. To learn diverse prototypes for representing the rich information in videos, we propose a variance loss to encourage different prototypes to attend to different contents of the video. Our method outperforms state-of-the-art methods on four public video retrieval datasets.

Comments:	NIPS2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Cite as:	arXiv:2209.13307 [cs.CV]
	(or arXiv:2209.13307v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2209.13307
Journal reference:	NIPS2022

Submission history

From: Chengzhi Lin [view email]
[v1] Tue, 27 Sep 2022 11:13:48 UTC (2,628 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators