Search | arXiv e-print repository

Progressive Video Summarization via Multimodal Self-supervised Learning

Authors: Li Haopeng, Ke Qiuhong, Gong Mingming, Tom Drummond

Abstract: Modern video summarization methods are based on deep neural networks that require a large amount of annotated data for training. However, existing datasets for video summarization are small-scale, easily leading to over-fitting of the deep models. Considering that the annotation of large-scale datasets is time-consuming, we propose a multimodal self-supervised learning framework to obtain semantic… ▽ More Modern video summarization methods are based on deep neural networks that require a large amount of annotated data for training. However, existing datasets for video summarization are small-scale, easily leading to over-fitting of the deep models. Considering that the annotation of large-scale datasets is time-consuming, we propose a multimodal self-supervised learning framework to obtain semantic representations of videos, which benefits the video summarization task. Specifically, the self-supervised learning is conducted by exploring the semantic consistency between the videos and text in both coarse-grained and fine-grained fashions, as well as recovering masked frames in the videos. The multimodal framework is trained on a newly-collected dataset that consists of video-text pairs. Additionally, we introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries. Extensive experiments have proved the effectiveness and superiority of our method in rank correlation coefficients and F-score. △ Less

Submitted 19 October, 2022; v1 submitted 7 January, 2022; originally announced January 2022.

arXiv:2112.13478 [pdf, other]

doi 10.1109/TPAMI.2022.3186506

Video Joint Modelling Based on Hierarchical Transformer for Co-summarization

Authors: Li Haopeng, Ke Qiuhong, Gong Mingming, Zhang Rui

Abstract: Video summarization aims to automatically generate a summary (storyboard or video skim) of a video, which can facilitate large-scale video retrieval and browsing. Most of the existing methods perform video summarization on individual videos, which neglects the correlations among similar videos. Such correlations, however, are also informative for video understanding and video summarization. To add… ▽ More Video summarization aims to automatically generate a summary (storyboard or video skim) of a video, which can facilitate large-scale video retrieval and browsing. Most of the existing methods perform video summarization on individual videos, which neglects the correlations among similar videos. Such correlations, however, are also informative for video understanding and video summarization. To address this limitation, we propose Video Joint Modelling based on Hierarchical Transformer (VJMHT) for co-summarization, which takes into consideration the semantic dependencies across videos. Specifically, VJMHT consists of two layers of Transformer: the first layer extracts semantic representation from individual shots of similar videos, while the second layer performs shot-level video joint modelling to aggregate cross-video semantic information. By this means, complete cross-video high-level patterns are explicitly modelled and learned for the summarization of individual videos. Moreover, Transformer-based video representation reconstruction is introduced to maximize the high-level similarity between the summary and the original video. Extensive experiments are conducted to verify the effectiveness of the proposed modules and the superiority of VJMHT in terms of F-measure and rank-based evaluation. △ Less

Submitted 29 June, 2022; v1 submitted 26 December, 2021; originally announced December 2021.

arXiv:2112.10057 [pdf, other]

Precondition and Effect Reasoning for Action Recognition

Authors: Yoo Hongsang, Li Haopeng, Ke Qiuhong, Liu Liangchen, Zhang Rui

Abstract: Human action recognition has drawn a lot of attention in the recent years due to the research and application significance. Most existing works on action recognition focus on learning effective spatial-temporal features from videos, but neglect the strong causal relationship among the precondition, action and effect. Such relationships are also crucial to the accuracy of action recognition. In thi… ▽ More Human action recognition has drawn a lot of attention in the recent years due to the research and application significance. Most existing works on action recognition focus on learning effective spatial-temporal features from videos, but neglect the strong causal relationship among the precondition, action and effect. Such relationships are also crucial to the accuracy of action recognition. In this paper, we propose to model the causal relationships based on the precondition and effect to improve the performance of action recognition. Specifically, a Cycle-Reasoning model is proposed to capture the causal relationships for action recognition. To this end, we annotate precondition and effect for a large-scale action dataset. Experimental results show that the proposed Cycle-Reasoning model can effectively reason about the precondition and effect and can enhance action recognition performance. △ Less

Submitted 13 November, 2022; v1 submitted 18 December, 2021; originally announced December 2021.

Comments: The paper is under consideration at Computer Vision and Image Understanding

Showing 1–3 of 3 results for author: Qiuhong, K