Recognizing and Presenting the Storytelling Video Structure with Deep Multimodal Networks

Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita

Computer Science > Computer Vision and Pattern Recognition

arXiv:1610.01376 (cs)

[Submitted on 5 Oct 2016 (v1), last revised 10 Nov 2016 (this version, v2)]

Title:Recognizing and Presenting the Storytelling Video Structure with Deep Multimodal Networks

Authors:Lorenzo Baraldi, Costantino Grana, Rita Cucchiara

View PDF

Abstract:This paper presents a novel approach for temporal and semantic segmentation of edited videos into meaningful segments, from the point of view of the storytelling structure. The objective is to decompose a long video into more manageable sequences, which can in turn be used to retrieve the most significant parts of it given a textual query and to provide an effective summarization. Previous video decomposition methods mainly employed perceptual cues, tackling the problem either as a story change detection, or as a similarity grouping task, and the lack of semantics limited their ability to identify story boundaries. Our proposal connects together perceptual, audio and semantic cues in a specialized deep network architecture designed with a combination of CNNs which generate an appropriate embedding, and clusters shots into connected sequences of semantic scenes, i.e. stories. A retrieval presentation strategy is also proposed, by selecting the semantically and aesthetically "most valuable" thumbnails to present, considering the query in order to improve the storytelling presentation. Finally, the subjective nature of the task is considered, by conducting experiments with different annotators and by proposing an algorithm to maximize the agreement between automatic results and human annotators.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:1610.01376 [cs.CV]
	(or arXiv:1610.01376v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1610.01376

Submission history

From: Lorenzo Baraldi [view email]
[v1] Wed, 5 Oct 2016 11:55:33 UTC (1,575 KB)
[v2] Thu, 10 Nov 2016 14:09:38 UTC (1,640 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Recognizing and Presenting the Storytelling Video Structure with Deep Multimodal Networks

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Recognizing and Presenting the Storytelling Video Structure with Deep Multimodal Networks

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators