FolAI: Synchronized Foley Sound Generation with Semantic and Temporal Alignment

Gramaccioni, Riccardo Fosco; Marinoni, Christian; Postolache, Emilian; Comunità, Marco; Cosmo, Luca; Reiss, Joshua D.; Comminiello, Danilo

Computer Science > Sound

arXiv:2412.15023 (cs)

[Submitted on 19 Dec 2024 (v1), last revised 5 May 2025 (this version, v3)]

Title:FolAI: Synchronized Foley Sound Generation with Semantic and Temporal Alignment

Authors:Riccardo Fosco Gramaccioni, Christian Marinoni, Emilian Postolache, Marco Comunità, Luca Cosmo, Joshua D. Reiss, Danilo Comminiello

View PDF HTML (experimental)

Abstract:Traditional sound design workflows rely on manual alignment of audio events to visual cues, as in Foley sound design, where everyday actions like footsteps or object interactions are recreated to match the on-screen motion. This process is time-consuming, difficult to scale, and lacks automation tools that preserve creative intent. Despite recent advances in vision-to-audio generation, producing temporally coherent and semantically controllable sound effects from video remains a major challenge. To address these limitations, we introduce FolAI, a two-stage generative framework that decouples the when and the what of sound synthesis, i.e., the temporal structure extraction and the semantically guided generation, respectively. In the first stage, we estimate a smooth control signal from the video that captures the motion intensity and rhythmic structure over time, serving as a temporal scaffold for the audio. In the second stage, a diffusion-based generative model produces sound effects conditioned both on this temporal envelope and on high-level semantic embeddings, provided by the user, that define the desired auditory content (e.g., material or action type). This modular design enables precise control over both timing and timbre, streamlining repetitive tasks while preserving creative flexibility in professional Foley workflows. Results on diverse visual contexts, such as footstep generation and action-specific sonorization, demonstrate that our model reliably produces audio that is temporally aligned with visual motion, semantically consistent with user intent, and perceptually realistic. These findings highlight the potential of FolAI as a controllable and modular solution for scalable, high-quality Foley sound synthesis in professional and interactive settings. Supplementary materials are accessible on our dedicated demo page at this https URL.

Subjects:	Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2412.15023 [cs.SD]
	(or arXiv:2412.15023v3 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2412.15023

Submission history

From: Riccardo Fosco Gramaccioni [view email]
[v1] Thu, 19 Dec 2024 16:37:19 UTC (10,176 KB)
[v2] Thu, 2 Jan 2025 16:16:08 UTC (10,176 KB)
[v3] Mon, 5 May 2025 16:55:53 UTC (10,134 KB)

Computer Science > Sound

Title:FolAI: Synchronized Foley Sound Generation with Semantic and Temporal Alignment

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:FolAI: Synchronized Foley Sound Generation with Semantic and Temporal Alignment

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators