Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

Kanda, Naoyuki; Wu, Jian; Wu, Yu; Xiao, Xiong; Meng, Zhong; Wang, Xiaofei; Gaur, Yashesh; Chen, Zhuo; Li, Jinyu; Yoshioka, Takuya

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2203.16685 (eess)

[Submitted on 30 Mar 2022 (v1), last revised 14 Jul 2022 (this version, v2)]

Title:Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

Authors:Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka

View PDF

Abstract:This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize ``who spoke what'' with low latency even when multiple people are speaking simultaneously. Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion. To further recognize speaker identities, we propose an encoder-decoder based speaker embedding extractor that can estimate a speaker representation for each recognized token not only from non-overlapping speech but also from overlapping speech. The proposed speaker embedding, named t-vector, is extracted synchronously with the t-SOT ASR model, enabling joint execution of speaker identification (SID) or speaker diarization (SD) with the multi-talker transcription with low latency. We evaluate the proposed model for a joint task of ASR and SID/SD by using LibriSpeechMix and LibriCSS corpora. The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.

Comments:	Accepted for presentation at Interspeech 2022
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
Cite as:	arXiv:2203.16685 [eess.AS]
	(or arXiv:2203.16685v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2203.16685

Submission history

From: Naoyuki Kanda [view email]
[v1] Wed, 30 Mar 2022 21:42:00 UTC (350 KB)
[v2] Thu, 14 Jul 2022 20:38:18 UTC (402 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators