RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Hannan, Tanveer; Islam, Md Mohaiminul; Seidl, Thomas; Bertasius, Gedas

Computer Science > Computer Vision and Pattern Recognition

arXiv:2312.06729 (cs)

[Submitted on 11 Dec 2023 (v1), last revised 13 Jul 2024 (this version, v3)]

Title:RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Authors:Tanveer Hannan, Md Mohaiminul Islam, Thomas Seidl, Gedas Bertasius

View PDF HTML (experimental)

Abstract:Locating specific moments within long videos (20-120 minutes) presents a significant challenge, akin to finding a needle in a haystack. Adapting existing short video (5-30 seconds) grounding methods to this problem yields poor performance. Since most real life videos, such as those on YouTube and AR/VR, are lengthy, addressing this issue is crucial. Existing methods typically operate in two stages: clip retrieval and grounding. However, this disjoint process limits the retrieval module's fine-grained event understanding, crucial for specific moment detection. We propose RGNet which deeply integrates clip retrieval and grounding into a single network capable of processing long videos into multiple granular levels, e.g., clips and frames. Its core component is a novel transformer encoder, RG-Encoder, that unifies the two stages through shared features and mutual optimization. The encoder incorporates a sparse attention mechanism and an attention loss to model both granularity jointly. Moreover, we introduce a contrastive clip sampling technique to mimic the long video paradigm closely during training. RGNet surpasses prior methods, showcasing state-of-the-art performance on long video temporal grounding (LVTG) datasets MAD and Ego4D.

Comments:	The code is released at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2312.06729 [cs.CV]
	(or arXiv:2312.06729v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2312.06729

Submission history

From: Tanveer Hannan [view email]
[v1] Mon, 11 Dec 2023 09:12:35 UTC (17,825 KB)
[v2] Thu, 21 Mar 2024 22:11:31 UTC (27,276 KB)
[v3] Sat, 13 Jul 2024 10:21:14 UTC (23,297 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators