Streaming Voice Conversion Via Intermediate Bottleneck Features And Non-streaming Teacher Guidance

Chen, Yuanzhe; Tu, Ming; Li, Tang; Li, Xin; Kong, Qiuqiang; Li, Jiaxin; Wang, Zhichao; Tian, Qiao; Wang, Yuping; Wang, Yuxuan

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2210.15158 (eess)

[Submitted on 27 Oct 2022]

Title:Streaming Voice Conversion Via Intermediate Bottleneck Features And Non-streaming Teacher Guidance

Authors:Yuanzhe Chen, Ming Tu, Tang Li, Xin Li, Qiuqiang Kong, Jiaxin Li, Zhichao Wang, Qiao Tian, Yuping Wang, Yuxuan Wang

View PDF

Abstract:Streaming voice conversion (VC) is the task of converting the voice of one person to another in real-time. Previous streaming VC methods use phonetic posteriorgrams (PPGs) extracted from automatic speech recognition (ASR) systems to represent speaker-independent information. However, PPGs lack the prosody and vocalization information of the source speaker, and streaming PPGs contain undesired leaked timbre of the source speaker. In this paper, we propose to use intermediate bottleneck features (IBFs) to replace PPGs. VC systems trained with IBFs retain more prosody and vocalization information of the source speaker. Furthermore, we propose a non-streaming teacher guidance (TG) framework that addresses the timbre leakage problem. Experiments show that our proposed IBFs and the TG framework achieve a state-of-the-art streaming VC naturalness of 3.85, a content consistency of 3.77, and a timbre similarity of 3.77 under a future receptive field of 160 ms which significantly outperform previous streaming VC systems.

Comments:	The paper has been submitted to ICASSP2023
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2210.15158 [eess.AS]
	(or arXiv:2210.15158v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2210.15158

Submission history

From: Yuanzhe Chen [view email]
[v1] Thu, 27 Oct 2022 03:53:21 UTC (441 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Streaming Voice Conversion Via Intermediate Bottleneck Features And Non-streaming Teacher Guidance

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Streaming Voice Conversion Via Intermediate Bottleneck Features And Non-streaming Teacher Guidance

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators