Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus

Kim, Minchan; Jeong, Myeonghun; Choi, Byoung Jin; Ahn, Sunghwan; Lee, Joun Yeop; Kim, Nam Soo

doi:10.21437/Interspeech.2022-225

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2203.15447 (eess)

[Submitted on 29 Mar 2022 (v1), last revised 6 Oct 2022 (this version, v2)]

Title:Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus

Authors:Minchan Kim, Myeonghun Jeong, Byoung Jin Choi, Sunghwan Ahn, Joun Yeop Lee, Nam Soo Kim

View PDF

Abstract:Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus, which is troublesome to collect. In this paper, we propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training. By leveraging wav2vec2.0 representation, unlabeled speech can highly improve performance, especially in the lack of labeled speech. We also extend the proposed method to zero-shot multi-speaker TTS (ZS-TTS). The experimental results verify the effectiveness of the proposed method in terms of naturalness, intelligibility, and speaker generalization. We highlight that the single speaker TTS model fine-tuned on the only 10 minutes of labeled dataset outperforms the other baselines, and the ZS-TTS model fine-tuned on the only 30 minutes of single speaker dataset can generate the voice of the arbitrary speaker, by pre-training on unlabeled multi-speaker speech corpus.

Comments:	Accepted by Interspeech2022
Subjects:	Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
Cite as:	arXiv:2203.15447 [eess.AS]
	(or arXiv:2203.15447v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2203.15447
Related DOI:	https://doi.org/10.21437/Interspeech.2022-225

Submission history

From: Minchan Kim [view email]
[v1] Tue, 29 Mar 2022 11:26:56 UTC (1,117 KB)
[v2] Thu, 6 Oct 2022 07:51:53 UTC (1,117 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators