Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech

Wu, Haibin; Wang, Xiaofei; Eskimez, Sefik Emre; Thakker, Manthan; Tompkins, Daniel; Tsai, Chung-Hsien; Li, Canrun; Xiao, Zhen; Zhao, Sheng; Li, Jinyu; Kanda, Naoyuki

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2407.12229 (eess)

[Submitted on 17 Jul 2024 (v1), last revised 17 Sep 2024 (this version, v2)]

Title:Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech

Authors:Haibin Wu, Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Daniel Tompkins, Chung-Hsien Tsai, Canrun Li, Zhen Xiao, Sheng Zhao, Jinyu Li, Naoyuki Kanda

View PDF HTML (experimental)

Abstract:People change their tones of voice, often accompanied by nonverbal vocalizations (NVs) such as laughter and cries, to convey rich emotions. However, most text-to-speech (TTS) systems lack the capability to generate speech with rich emotions, including NVs. This paper introduces EmoCtrl-TTS, an emotion-controllable zero-shot TTS that can generate highly emotional speech with NVs for any speaker. EmoCtrl-TTS leverages arousal and valence values, as well as laughter embeddings, to condition the flow-matching-based zero-shot TTS. To achieve high-quality emotional speech generation, EmoCtrl-TTS is trained using more than 27,000 hours of expressive data curated based on pseudo-labeling. Comprehensive evaluations demonstrate that EmoCtrl-TTS excels in mimicking the emotions of audio prompts in speech-to-speech translation scenarios. We also show that EmoCtrl-TTS can capture emotion changes, express strong emotions, and generate various NVs in zero-shot TTS. See this https URL for demo samples.

Comments:	Accepted by SLT2024. See this https URL for demo samples
Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Cite as:	arXiv:2407.12229 [eess.AS]
	(or arXiv:2407.12229v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2407.12229

Submission history

From: Xiaofei Wang [view email]
[v1] Wed, 17 Jul 2024 00:54:15 UTC (730 KB)
[v2] Tue, 17 Sep 2024 10:40:11 UTC (1,043 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators