Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models

Alexanderson, Simon; Nagy, Rajmund; Beskow, Jonas; Henter, Gustav Eje

doi:10.1145/3592458

Computer Science > Machine Learning

arXiv:2211.09707 (cs)

[Submitted on 17 Nov 2022 (v1), last revised 16 May 2023 (this version, v2)]

Title:Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models

Authors:Simon Alexanderson, Rajmund Nagy, Jonas Beskow, Gustav Eje Henter

View PDF

Abstract:Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, e.g., dancing and co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power. We also demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression. Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. We also synthesise path-driven locomotion using the same model architecture. Finally, we generalise the guidance procedure to obtain product-of-expert ensembles of diffusion models and demonstrate how these may be used for, e.g., style interpolation, a contribution we believe is of independent interest. See this https URL for video examples, data, and code.

Comments:	20 pages, 9 figures. Published in ACM ToG and presented at SIGGRAPH 2023
Subjects:	Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
MSC classes:	68T07
ACM classes:	G.3; I.2.6; I.3.7; J.5
Cite as:	arXiv:2211.09707 [cs.LG]
	(or arXiv:2211.09707v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2211.09707
Journal reference:	ACM Trans. Graph. 42, 4 (August 2023), 20 pages
Related DOI:	https://doi.org/10.1145/3592458

Submission history

From: Gustav Eje Henter [view email]
[v1] Thu, 17 Nov 2022 17:41:00 UTC (989 KB)
[v2] Tue, 16 May 2023 17:59:58 UTC (12,708 KB)

Computer Science > Machine Learning

Title:Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators