Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation

Fan, Yingruo; Lin, Zhaojiang; Saito, Jun; Wang, Wenping; Komura, Taku

Computer Science > Computer Vision and Pattern Recognition

arXiv:2112.02214 (cs)

[Submitted on 4 Dec 2021 (v1), last revised 7 Dec 2021 (this version, v2)]

Title:Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation

Authors:Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, Taku Komura

View PDF

Abstract:Speech-driven 3D facial animation with accurate lip synchronization has been widely studied. However, synthesizing realistic motions for the entire face during speech has rarely been explored. In this work, we present a joint audio-text model to capture the contextual information for expressive speech-driven 3D facial animation. The existing datasets are collected to cover as many different phonemes as possible instead of sentences, thus limiting the capability of the audio-based model to learn more diverse contexts. To address this, we propose to leverage the contextual text embeddings extracted from the powerful pre-trained language model that has learned rich contextual representations from large-scale text data. Our hypothesis is that the text features can disambiguate the variations in upper face expressions, which are not strongly correlated with the audio. In contrast to prior approaches which learn phoneme-level features from the text, we investigate the high-level contextual text features for speech-driven 3D facial animation. We show that the combined acoustic and textual modalities can synthesize realistic facial expressions while maintaining audio-lip synchronization. We conduct the quantitative and qualitative evaluations as well as the perceptual user study. The results demonstrate the superior performance of our model against existing state-of-the-art approaches.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Cite as:	arXiv:2112.02214 [cs.CV]
	(or arXiv:2112.02214v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2112.02214

Submission history

From: Evelyn Fan [view email]
[v1] Sat, 4 Dec 2021 01:37:22 UTC (1,681 KB)
[v2] Tue, 7 Dec 2021 12:58:30 UTC (1,681 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators