Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

Qian, Rui; Li, Yeqing; Xu, Zheng; Yang, Ming-Hsuan; Belongie, Serge; Cui, Yin

Computer Science > Computer Vision and Pattern Recognition

arXiv:2207.07646 (cs)

[Submitted on 15 Jul 2022]

Title:Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

Authors:Rui Qian, Yeqing Li, Zheng Xu, Ming-Hsuan Yang, Serge Belongie, Yin Cui

View PDF

Abstract:Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition. In this work, we extend this paradigm by leveraging motion and audio that naturally exist in video. We present \textbf{MOV}, a simple yet effective method for \textbf{M}ultimodal \textbf{O}pen-\textbf{V}ocabulary video classification. In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram. We design a cross-modal fusion mechanism to aggregate complimentary multimodal information. Experiments on Kinetics-700 and VGGSound show that introducing flow or audio modality brings large performance gains over the pre-trained VLM and existing methods. Specifically, MOV greatly improves the accuracy on base classes, while generalizes better on novel classes. MOV achieves state-of-the-art results on UCF and HMDB zero-shot video classification benchmarks, significantly outperforming both traditional zero-shot methods and recent methods based on VLMs. Code and models will be released.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2207.07646 [cs.CV]
	(or arXiv:2207.07646v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2207.07646

Submission history

From: Yin Cui [view email]
[v1] Fri, 15 Jul 2022 17:59:11 UTC (223 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators