Target Speaker Voice Activity Detection with Transformers and Its Integration with End-to-End Neural Diarization

Wang, Dongmei; Xiao, Xiong; Kanda, Naoyuki; Yoshioka, Takuya; Wu, Jian

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2208.13085 (eess)

[Submitted on 27 Aug 2022 (v1), last revised 26 Sep 2022 (this version, v3)]

Title:Target Speaker Voice Activity Detection with Transformers and Its Integration with End-to-End Neural Diarization

Authors:Dongmei Wang, Xiong Xiao, Naoyuki Kanda, Takuya Yoshioka, Jian Wu

View PDF

Abstract:This paper describes a speaker diarization model based on target speaker voice activity detection (TS-VAD) using transformers. To overcome the original TS-VAD model's drawback of being unable to handle an arbitrary number of speakers, we investigate model architectures that use input tensors with variable-length time and speaker dimensions. Transformer layers are applied to the speaker axis to make the model output insensitive to the order of the speaker profiles provided to the TS-VAD model. Time-wise sequential layers are interspersed between these speaker-wise transformer layers to allow the temporal and cross-speaker correlations of the input speech signal to be captured. We also extend a diarization model based on end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA) by replacing its dot-product-based speaker detection layer with the transformer-based TS-VAD. Experimental results on VoxConverse show that using the transformers for the cross-speaker modeling reduces the diarization error rate (DER) of TS-VAD by 11.3%, achieving a new state-of-the-art (SOTA) DER of 4.57%. Also, our extended EEND-EDA reduces DER by 6.9% on the CALLHOME dataset relative to the original EEND-EDA with a similar model size, achieving a new SOTA DER of 11.18% under a widely used training data setting.

Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
Cite as:	arXiv:2208.13085 [eess.AS]
	(or arXiv:2208.13085v3 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2208.13085

Submission history

From: Dongmei Wang [view email]
[v1] Sat, 27 Aug 2022 21:11:45 UTC (254 KB)
[v2] Fri, 23 Sep 2022 17:03:27 UTC (255 KB)
[v3] Mon, 26 Sep 2022 01:30:26 UTC (254 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Target Speaker Voice Activity Detection with Transformers and Its Integration with End-to-End Neural Diarization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Target Speaker Voice Activity Detection with Transformers and Its Integration with End-to-End Neural Diarization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators