Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy

Li, Bohan; Li, Zhihan; Wang, Haoran; Zhang, Hanglei; Guo, Yiwei; Wang, Hankun; Chen, Xie; Yu, Kai

Computer Science > Sound

arXiv:2506.22023 (cs)

[Submitted on 27 Jun 2025]

Title:Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy

Authors:Bohan Li, Zhihan Li, Haoran Wang, Hanglei Zhang, Yiwei Guo, Hankun Wang, Xie Chen, Kai Yu

View PDF HTML (experimental)

Abstract:Recently, autoregressive (AR) language models have emerged as a dominant approach in speech synthesis, offering expressive generation and scalable training. However, conventional AR speech synthesis models relying on the next-token prediction paradigm often encounter significant challenges when handling long speech sequences. These models often struggle to construct stable frame-to-frame attention, leading to increased latency and degraded synthesis quality, thereby limiting their feasibility for real-time applications. To address these limitations, we introduce a novel dynamic chunk-wise autoregressive synthesis framework, termed DCAR, designed to enhance both efficiency and intelligibility robustness in AR speech generation. DCAR introduces a chunk-to-frame attention mechanism through training with multi-token prediction, enabling dynamic chunk prediction in variable speech contexts using a lightweight module trained on-policy. DCAR dynamically adjusts the token prediction span, significantly reducing the sequence length dependency while obtaining high synthesis quality. Comprehensive empirical evaluations demonstrate that DCAR substantially outperforms traditional next-token prediction models, achieving up to 72.27% intelligibility improvement and 2.61x inference speedup simultaneously on the test set. Furthermore, we conduct comprehensive analysis to support it as a versatile foundation for next-generation speech synthesis systems.

Comments:	17 pages, 8 figures, 5 tables
Subjects:	Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2506.22023 [cs.SD]
	(or arXiv:2506.22023v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2506.22023

Submission history

From: Bohan Li [view email]
[v1] Fri, 27 Jun 2025 08:45:21 UTC (705 KB)

Computer Science > Sound

Title:Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators