Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

Dong, Sixun; Hu, Huazhang; Lian, Dongze; Luo, Weixin; Qian, Yicheng; Gao, Shenghua

Computer Science > Computer Vision and Pattern Recognition

arXiv:2303.12370 (cs)

[Submitted on 22 Mar 2023 (v1), last revised 28 Mar 2023 (this version, v2)]

Title:Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

Authors:Sixun Dong, Huazhang Hu, Dongze Lian, Weixin Luo, Yicheng Qian, Shenghua Gao

View PDF

Abstract:Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. We solve this task by borrowing ideas from CLIP. Specifically, we use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video, respectively. To model the correspondence between text and video, we propose a multiple granularity loss, where the video-paragraph contrastive loss enforces matching between the whole video and the complete script, and a fine-grained frame-sentence contrastive loss enforces the matching between each action and its description. As the frame-sentence correspondence is not available, we propose to use the fact that video actions happen sequentially in the temporal domain to generate pseudo frame-sentence correspondence and supervise the network training with the pseudo labels. Extensive experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin, which validates the effectiveness of our proposed approach. Code is available at this https URL

Comments:	CVPR 2023. Code: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2303.12370 [cs.CV]
	(or arXiv:2303.12370v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2303.12370

Submission history

From: Sixun Dong [view email]
[v1] Wed, 22 Mar 2023 08:13:25 UTC (3,041 KB)
[v2] Tue, 28 Mar 2023 04:43:12 UTC (3,041 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators