DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability

Park, Hyun Joon; Kim, Jin Sob; Shin, Wooseok; Han, Sung Won

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2406.19135 (eess)

[Submitted on 27 Jun 2024]

Title:DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability

Authors:Hyun Joon Park, Jin Sob Kim, Wooseok Shin, Sung Won Han

View PDF HTML (experimental)

Abstract:Expressive Text-to-Speech (TTS) using reference speech has been studied extensively to synthesize natural speech, but there are limitations to obtaining well-represented styles and improving model generalization ability. In this study, we present Diffusion-based EXpressive TTS (DEX-TTS), an acoustic model designed for reference-based speech synthesis with enhanced style representations. Based on a general diffusion TTS framework, DEX-TTS includes encoders and adapters to handle styles extracted from reference speech. Key innovations contain the differentiation of styles into time-invariant and time-variant categories for effective style extraction, as well as the design of encoders and adapters with high generalization ability. In addition, we introduce overlapping patchify and convolution-frequency patch embedding strategies to improve DiT-based diffusion networks for TTS. DEX-TTS yields outstanding performance in terms of objective and subjective evaluation in English multi-speaker and emotional multi-speaker datasets, without relying on pre-training strategies. Lastly, the comparison results for the general TTS on a single-speaker dataset verify the effectiveness of our enhanced diffusion backbone. Demos are available here.

Comments:	Preprint
Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2406.19135 [eess.AS]
	(or arXiv:2406.19135v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2406.19135

Submission history

From: Hyun Joon Park [view email]
[v1] Thu, 27 Jun 2024 12:39:55 UTC (4,445 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators