Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks

Liu, Chang; Zhang, Haomin; Xia, Shiyu; Chen, Zihao; Ding, Chaofan; Yue, Xin; Chen, Huizhe; Di, Xinhan

Computer Science > Sound

arXiv:2505.20038 (cs)

[Submitted on 26 May 2025]

Title:Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks

Authors:Chang Liu, Haomin Zhang, Shiyu Xia, Zihao Chen, Chaofan Ding, Xin Yue, Huizhe Chen, Xinhan Di

View PDF HTML (experimental)

Abstract:Generating high-quality piano audio from video requires precise synchronization between visual cues and musical output, ensuring accurate semantic and temporal this http URL, existing evaluation datasets do not fully capture the intricate synchronization required for piano music generation. A comprehensive benchmark is essential for two primary reasons: (1) existing metrics fail to reflect the complexity of video-to-piano music interactions, and (2) a dedicated benchmark dataset can provide valuable insights to accelerate progress in high-quality piano music generation. To address these challenges, we introduce the CoP Benchmark Dataset-a fully open-sourced, multimodal benchmark designed specifically for video-guided piano music generation. The proposed Chain-of-Perform (CoP) benchmark offers several compelling features: (1) detailed multimodal annotations, enabling precise semantic and temporal alignment between video content and piano audio via step-by-step Chain-of-Perform guidance; (2) a versatile evaluation framework for rigorous assessment of both general-purpose and specialized video-to-piano generation tasks; and (3) full open-sourcing of the dataset, annotations, and evaluation protocols. The dataset is publicly available at this https URL, with a continuously updated leaderboard to promote ongoing research in this domain.

Comments:	4 pages, 1 figure, accepted by CVPR 2025 MMFM Workshop
Subjects:	Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2505.20038 [cs.SD]
	(or arXiv:2505.20038v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2505.20038

Submission history

From: Xinhan Di [view email]
[v1] Mon, 26 May 2025 14:24:19 UTC (14,055 KB)

Computer Science > Sound

Title:Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators