The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval

Meinardus, Boris; Batra, Anil; Rohrbach, Anna; Rohrbach, Marcus

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.18113v3 (cs)

[Submitted on 26 Jun 2024 (v1), revised 14 Oct 2024 (this version, v3), latest version 11 Mar 2025 (v5)]

Title:The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval

Authors:Boris Meinardus, Anil Batra, Anna Rohrbach, Marcus Rohrbach

View PDF HTML (experimental)

Abstract:Recent studies have shown promising results in utilizing multimodal large language models (MLLMs) for computer vision tasks such as object detection and semantic segmentation. However, many challenging video tasks remain under-explored. Video-language tasks necessitate spatial and temporal comprehension and require significant compute. Therefore, prior works have developed complex, highly specialized architectures or leveraged additional input signals such as video transcripts to best encode contextual and temporal information, which limits their generality and can be impractical. One particularly challenging task is video moment retrieval, which requires precise temporal and contextual grounding. This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval. We introduce Mr. BLIP (Mr. as in Moment Retrieval), a multimodal, single-stage model that requires no expensive video-language pretraining, no additional input signal (e.g., no transcript or audio), and has a simpler and more versatile design than prior state-of-the-art methods. We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions. Notably, we attain over 9% (absolute) higher Recall (at 0.5 and 0.7 IoU) on the challenging long-video multi-moment QVHighlights benchmark. Our code is publicly available.

Comments:	Code: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2406.18113 [cs.CV]
	(or arXiv:2406.18113v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.18113

Submission history

From: Boris Meinardus [view email]
[v1] Wed, 26 Jun 2024 06:59:09 UTC (6,672 KB)
[v2] Wed, 24 Jul 2024 06:43:07 UTC (6,671 KB)
[v3] Mon, 14 Oct 2024 06:50:19 UTC (6,671 KB)
[v4] Fri, 21 Feb 2025 00:49:07 UTC (15,466 KB)
[v5] Tue, 11 Mar 2025 10:03:46 UTC (15,466 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators