Information Sieve: Content Leakage Reduction in End-to-End Prosody For Expressive Speech Synthesis

Dai, Xudong; Gong, Cheng; Wang, Longbiao; Zhang, Kaili

Computer Science > Sound

arXiv:2108.01831 (cs)

[Submitted on 4 Aug 2021]

Title:Information Sieve: Content Leakage Reduction in End-to-End Prosody For Expressive Speech Synthesis

Authors:Xudong Dai, Cheng Gong, Longbiao Wang, Kaili Zhang

View PDF

Abstract:Expressive neural text-to-speech (TTS) systems incorporate a style encoder to learn a latent embedding as the style information. However, this embedding process may encode redundant textual information. This phenomenon is called content leakage. Researchers have attempted to resolve this problem by adding an ASR or other auxiliary supervision loss functions. In this study, we propose an unsupervised method called the "information sieve" to reduce the effect of content leakage in prosody transfer. The rationale of this approach is that the style encoder can be forced to focus on style information rather than on textual information contained in the reference speech by a well-designed downsample-upsample filter, i.e., the extracted style embeddings can be downsampled at a certain interval and then upsampled by duplication. Furthermore, we used instance normalization in convolution layers to help the system learn a better latent style space. Objective metrics such as the significantly lower word error rate (WER) demonstrate the effectiveness of this model in mitigating content leakage. Listening tests indicate that the model retains its prosody transferability compared with the baseline models such as the original GST-Tacotron and ASR-guided Tacotron.

Comments:	Accepted By Interspeech 2021
Subjects:	Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2108.01831 [cs.SD]
	(or arXiv:2108.01831v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2108.01831

Submission history

From: Xudong Dai [view email]
[v1] Wed, 4 Aug 2021 03:45:16 UTC (985 KB)

Computer Science > Sound

Title:Information Sieve: Content Leakage Reduction in End-to-End Prosody For Expressive Speech Synthesis

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Information Sieve: Content Leakage Reduction in End-to-End Prosody For Expressive Speech Synthesis

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators