Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction

Pan, Zexu; Zhao, Shengkui; Wang, Tingting; Zhou, Kun; Ma, Yukun; Zhang, Chong; Ma, Bin

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2505.20635 (eess)

[Submitted on 27 May 2025]

Title:Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction

Authors:Zexu Pan, Shengkui Zhao, Tingting Wang, Kun Zhou, Yukun Ma, Chong Zhang, Bin Ma

View PDF HTML (experimental)

Abstract:Audio-visual speaker extraction isolates a target speaker's speech from a mixture speech signal conditioned on a visual cue, typically using the target speaker's face recording. However, in real-world scenarios, other co-occurring faces are often present on-screen, providing valuable speaker activity cues in the scene. In this work, we introduce a plug-and-play inter-speaker attention module to process these flexible numbers of co-occurring faces, allowing for more accurate speaker extraction in complex multi-person environments. We integrate our module into two prominent models: the AV-DPRNN and the state-of-the-art AV-TFGridNet. Extensive experiments on diverse datasets, including the highly overlapped VoxCeleb2 and sparsely overlapped MISP, demonstrate that our approach consistently outperforms baselines. Furthermore, cross-dataset evaluations on LRS2 and LRS3 confirm the robustness and generalizability of our method.

Comments:	Interspeech 2025
Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
Cite as:	arXiv:2505.20635 [eess.AS]
	(or arXiv:2505.20635v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2505.20635

Submission history

From: Zexu Pan [view email]
[v1] Tue, 27 May 2025 02:21:38 UTC (333 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators