UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation

Wang, Jinting; Yang, Shan; Li, Chenxing; Yu, Dong; Liu, Li

Computer Science > Computer Vision and Pattern Recognition

arXiv:2506.04134 (cs)

[Submitted on 4 Jun 2025 (v1), last revised 5 Aug 2025 (this version, v3)]

Title:UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation

Authors:Jinting Wang, Shan Yang, Chenxing Li, Dong Yu, Li Liu

View PDF HTML (experimental)

Abstract:Cued Speech (CS) enhances lipreading via hand coding, offering visual phonemic cues that support precise speech perception for the hearing-impaired. The task of CS Video-to-Speech generation (CSV2S) aims to convert CS videos into intelligible speech signals. Most existing research focuses on CS Recognition (CSR), which transcribes video content into text. Consequently, a common solution for CSV2S is to integrate CSR with a text-to-speech (TTS) system. However, this pipeline relies on text as an intermediate medium, which may lead to error propagation and temporal misalignment between speech and CS video dynamics. In contrast, directly generating audio speech from CS video (direct CSV2S) often suffers from the inherent multimodal complexity and the limited availability of CS data. To address these challenges, we propose UniCUE, the first unified framework for CSV2S that directly generates speech from CS videos without relying on intermediate text. The core innovation of UniCUE lies in integrating an understanding task (CSR) that provides fine-grained CS visual-semantic cues to guide speech generation. Specifically, UniCUE incorporates a pose-aware visual processor, a semantic alignment pool that enables precise visual-semantic mapping, and a VisioPhonetic adapter to bridge the understanding and generation tasks within a unified architecture. To support this framework, we construct UniCUE-HI, a large-scale Mandarin CS dataset containing 11282 videos from 14 cuers, including both hearing-impaired and normal-hearing individuals. Extensive experiments on this dataset demonstrate that UniCUE achieves state-of-the-art performance across multiple evaluation metrics.

Comments:	8 pages, 5 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2506.04134 [cs.CV]
	(or arXiv:2506.04134v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2506.04134

Submission history

From: Jinting Wang [view email]
[v1] Wed, 4 Jun 2025 16:26:49 UTC (3,372 KB)
[v2] Wed, 23 Jul 2025 07:56:36 UTC (2,512 KB)
[v3] Tue, 5 Aug 2025 08:57:37 UTC (712 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators