Deep Multimodal Speaker Naming

Hu, Yongtao; Ren, Jimmy; Dai, Jingwen; Yuan, Chang; Xu, Li; Wang, Wenping

doi:10.1145/2733373.2806293

Computer Science > Computer Vision and Pattern Recognition

arXiv:1507.04831 (cs)

[Submitted on 17 Jul 2015]

Title:Deep Multimodal Speaker Naming

Authors:Yongtao Hu, Jimmy Ren, Jingwen Dai, Chang Yuan, Li Xu, Wenping Wang

View PDF

Abstract:Automatic speaker naming is the problem of localizing as well as identifying each speaking character in a TV/movie/live show video. This is a challenging problem mainly attributes to its multimodal nature, namely face cue alone is insufficient to achieve good performance. Previous multimodal approaches to this problem usually process the data of different modalities individually and merge them using handcrafted heuristics. Such approaches work well for simple scenes, but fail to achieve high performance for speakers with large appearance variations. In this paper, we propose a novel convolutional neural networks (CNN) based learning framework to automatically learn the fusion function of both face and audio cues. We show that without using face tracking, facial landmark localization or subtitle/transcript, our system with robust multimodal feature extraction is able to achieve state-of-the-art speaker naming performance evaluated on two diverse TV series. The dataset and implementation of our algorithm are publicly available online.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
ACM classes:	H.3
Cite as:	arXiv:1507.04831 [cs.CV]
	(or arXiv:1507.04831v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1507.04831
Related DOI:	https://doi.org/10.1145/2733373.2806293

Submission history

From: Yongtao Hu [view email]
[v1] Fri, 17 Jul 2015 04:13:12 UTC (998 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2015-07

Change to browse by:

cs
cs.LG
cs.MM
cs.SD

References & Citations

DBLP - CS Bibliography

listing | bibtex

Yongtao Hu
Jimmy S. J. Ren
Jimmy Ren
Jingwen Dai
Chang Yuan

…

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:Deep Multimodal Speaker Naming

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Deep Multimodal Speaker Naming

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators