Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Min, Dongchan; Lee, Dong Bok; Yang, Eunho; Hwang, Sung Ju

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2106.03153 (eess)

[Submitted on 6 Jun 2021 (v1), last revised 16 Jun 2021 (this version, v3)]

Title:Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Authors:Dongchan Min, Dong Bok Lee, Eunho Yang, Sung Ju Hwang

View PDF

Abstract:With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality without fine-tuning. In this work, we propose StyleSpeech, a new TTS model which not only synthesizes high-quality speech but also effectively adapts to new speakers. Specifically, we propose Style-Adaptive Layer Normalization (SALN) which aligns gain and bias of the text input according to the style extracted from a reference speech audio. With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio. Furthermore, to enhance StyleSpeech's adaptation to speech from new speakers, we extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training. The experimental results show that our models generate high-quality speech which accurately follows the speaker's voice with single short-duration (1-3 sec) speech audio, significantly outperforming baselines.

Comments:	Accepted by ICML 2021
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
Cite as:	arXiv:2106.03153 [eess.AS]
	(or arXiv:2106.03153v3 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2106.03153

Submission history

From: Dongchan Min [view email]
[v1] Sun, 6 Jun 2021 15:34:11 UTC (400 KB)
[v2] Mon, 14 Jun 2021 12:50:24 UTC (549 KB)
[v3] Wed, 16 Jun 2021 16:57:10 UTC (549 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators