Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

Tan, Jintao; Cheng, Xize; Xiong, Lingyu; Zhu, Lei; Li, Xiandong; Wu, Xianjia; Gong, Kai; Li, Minglei; Cai, Yi

Computer Science > Computer Vision and Pattern Recognition

arXiv:2408.01732 (cs)

[Submitted on 3 Aug 2024]

Title:Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

Authors:Jintao Tan, Xize Cheng, Lingyu Xiong, Lei Zhu, Xiandong Li, Xianjia Wu, Kai Gong, Minglei Li, Yi Cai

View PDF HTML (experimental)

Abstract:Audio-driven talking head generation is a significant and challenging task applicable to various fields such as virtual avatars, film production, and online conferences. However, the existing GAN-based models emphasize generating well-synchronized lip shapes but overlook the visual quality of generated frames, while diffusion-based models prioritize generating high-quality frames but neglect lip shape matching, resulting in jittery mouth movements. To address the aforementioned problems, we introduce a two-stage diffusion-based model. The first stage involves generating synchronized facial landmarks based on the given speech. In the second stage, these generated landmarks serve as a condition in the denoising process, aiming to optimize mouth jitter issues and generate high-fidelity, well-synchronized, and temporally coherent talking head videos. Extensive experiments demonstrate that our model yields the best performance.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2408.01732 [cs.CV]
	(or arXiv:2408.01732v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2408.01732

Submission history

From: Jintao Tan [view email]
[v1] Sat, 3 Aug 2024 10:19:38 UTC (7,771 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators