USED: Universal Speaker Extraction and Diarization

Ao, Junyi; Yıldırım, Mehmet Sinan; Ge, Meng; Wang, Shuai; Tao, Ruijie; Qian, Yanmin; Deng, Liqun; Xiao, Longshuai; Li, Haizhou

Computer Science > Sound

arXiv:2309.10674v1 (cs)

[Submitted on 19 Sep 2023 (this version), latest version 16 Jan 2025 (v3)]

Title:USED: Universal Speaker Extraction and Diarization

Authors:Junyi Ao, Mehmet Sinan Yıldırım, Meng Ge, Shuai Wang, Ruijie Tao, Yanmin Qian, Liqun Deng, Longshuai Xiao, Haizhou Li

View PDF

Abstract:Speaker extraction and diarization are two crucial enabling techniques for speech applications. Speaker extraction aims to extract a target speaker's voice from a multi-talk mixture, while speaker diarization demarcates speech segments by speaker, identifying `who spoke when'. The previous studies have typically treated the two tasks independently. However, the two tasks share a similar objective, that is to disentangle the speakers in the spectral domain for the former but in the temporal domain for the latter. It is logical to believe that the speaker turns obtained from speaker diarization can benefit speaker extraction, while the extracted speech offers more accurate speaker turns than the mixture speech. In this paper, we propose a unified framework called Universal Speaker Extraction and Diarization (USED). We extend the existing speaker extraction model to simultaneously extract the waveforms of all speakers. We also employ a scenario-aware differentiated loss function to address the problem of sparsely overlapped speech in real-world conversations. We show that the USED model significantly outperforms the baselines for both speaker extraction and diarization tasks, in both highly overlapped and sparsely overlapped scenarios. Audio samples are available at this https URL.

Comments:	Submitted to ICASSP 2024
Subjects:	Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2309.10674 [cs.SD]
	(or arXiv:2309.10674v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2309.10674

Submission history

From: Junyi Ao [view email]
[v1] Tue, 19 Sep 2023 14:56:31 UTC (171 KB)
[v2] Thu, 9 May 2024 08:54:51 UTC (1,930 KB)
[v3] Thu, 16 Jan 2025 09:08:33 UTC (3,175 KB)

Computer Science > Sound

Title:USED: Universal Speaker Extraction and Diarization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:USED: Universal Speaker Extraction and Diarization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators