Cross-Modal Mutual Learning for Cued Speech Recognition

Liu, Lei; Liu, Li

Computer Science > Computer Vision and Pattern Recognition

arXiv:2212.01083 (cs)

[Submitted on 2 Dec 2022 (v1), last revised 27 Feb 2023 (this version, v2)]

Title:Cross-Modal Mutual Learning for Cued Speech Recognition

Authors:Lei Liu, Li Liu

View PDF

Abstract:Automatic Cued Speech Recognition (ACSR) provides an intelligent human-machine interface for visual communications, where the Cued Speech (CS) system utilizes lip movements and hand gestures to code spoken language for hearing-impaired people. Previous ACSR approaches often utilize direct feature concatenation as the main fusion paradigm. However, the asynchronous modalities i.e., lip, hand shape and hand position) in CS may cause interference for feature concatenation. To address this challenge, we propose a transformer based cross-modal mutual learning framework to prompt multi-modal interaction. Compared with the vanilla self-attention, our model forces modality-specific information of different modalities to pass through a modality-invariant codebook, collating linguistic representations for tokens of each modality. Then the shared linguistic knowledge is used to re-synchronize multi-modal sequences. Moreover, we establish a novel large-scale multi-speaker CS dataset for Mandarin Chinese. To our knowledge, this is the first work on ACSR for Mandarin Chinese. Extensive experiments are conducted for different languages i.e., Chinese, French, and British English). Results demonstrate that our model exhibits superior recognition performance to the state-of-the-art by a large margin.

Comments:	Accepted to ICASSP2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2212.01083 [cs.CV]
	(or arXiv:2212.01083v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2212.01083

Submission history

From: Lei Liu [view email]
[v1] Fri, 2 Dec 2022 10:45:33 UTC (132 KB)
[v2] Mon, 27 Feb 2023 04:30:11 UTC (3,401 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Cross-Modal Mutual Learning for Cued Speech Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Cross-Modal Mutual Learning for Cued Speech Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators