Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

Jiang, Hao; Murdock, Calvin; Ithapu, Vamsi Krishna

Computer Science > Computer Vision and Pattern Recognition

arXiv:2201.01928 (cs)

[Submitted on 6 Jan 2022]

Title:Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

Authors:Hao Jiang, Calvin Murdock, Vamsi Krishna Ithapu

View PDF

Abstract:Augmented reality devices have the potential to enhance human perception and enable other assistive functionalities in complex conversational environments. Effectively capturing the audio-visual context necessary for understanding these social interactions first requires detecting and localizing the voice activities of the device wearer and the surrounding people. These tasks are challenging due to their egocentric nature: the wearer's head motion may cause motion blur, surrounding people may appear in difficult viewing angles, and there may be occlusions, visual clutter, audio noise, and bad lighting. Under these conditions, previous state-of-the-art active speaker detection methods do not give satisfactory results. Instead, we tackle the problem from a new setting using both video and multi-channel microphone array audio. We propose a novel end-to-end deep learning approach that is able to give robust voice activity detection and localization results. In contrast to previous methods, our method localizes active speakers from all possible directions on the sphere, even outside the camera's field of view, while simultaneously detecting the device wearer's own voice activity. Our experiments show that the proposed method gives superior results, can run in real time, and is robust against noise and clutter.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2201.01928 [cs.CV]
	(or arXiv:2201.01928v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2201.01928

Submission history

From: Hao Jiang [view email]
[v1] Thu, 6 Jan 2022 05:40:16 UTC (3,223 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators