YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls

Chen, Zihao; Zhang, Haomin; Di, Xinhan; Wang, Haoyu; Shan, Sizhe; Zheng, Junjie; Liang, Yunming; Fan, Yihan; Zhu, Xinfa; Tian, Wenjie; Wang, Yihua; Ding, Chaofan; Xie, Lei

Computer Science > Sound

arXiv:2412.09168 (cs)

[Submitted on 12 Dec 2024]

Title:YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls

Authors:Zihao Chen, Haomin Zhang, Xinhan Di, Haoyu Wang, Sizhe Shan, Junjie Zheng, Yunming Liang, Yihan Fan, Xinfa Zhu, Wenjie Tian, Yihua Wang, Chaofan Ding, Lei Xie

View PDF HTML (experimental)

Abstract:Generating sound effects for product-level videos, where only a small amount of labeled data is available for diverse scenes, requires the production of high-quality sounds in few-shot settings. To tackle the challenge of limited labeled data in real-world scenes, we introduce YingSound, a foundation model designed for video-guided sound generation that supports high-quality audio generation in few-shot settings. Specifically, YingSound consists of two major modules. The first module uses a conditional flow matching transformer to achieve effective semantic alignment in sound generation across audio and visual modalities. This module aims to build a learnable audio-visual aggregator (AVA) that integrates high-resolution visual features with corresponding audio features at multiple stages. The second module is developed with a proposed multi-modal visual-audio chain-of-thought (CoT) approach to generate finer sound effects in few-shot settings. Finally, an industry-standard video-to-audio (V2A) dataset that encompasses various real-world scenarios is presented. We show that YingSound effectively generates high-quality synchronized sounds across diverse conditional inputs through automated evaluations and human studies. Project Page: \url{this https URL}

Comments:	16 pages, 4 figures
Subjects:	Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2412.09168 [cs.SD]
	(or arXiv:2412.09168v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2412.09168

Submission history

From: Xinhan Di [view email]
[v1] Thu, 12 Dec 2024 10:55:57 UTC (23,470 KB)

Computer Science > Sound

Title:YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators