Semi-Supervised Contrastive Learning for Controllable Video-to-Music Retrieval

Stewart, Shanti; KV, Gouthaman; Lu, Lie; Fanelli, Andrea

Computer Science > Multimedia

arXiv:2412.05831 (cs)

[Submitted on 8 Dec 2024 (v1), last revised 23 Dec 2024 (this version, v2)]

Title:Semi-Supervised Contrastive Learning for Controllable Video-to-Music Retrieval

Authors:Shanti Stewart, Gouthaman KV, Lie Lu, Andrea Fanelli

View PDF HTML (experimental)

Abstract:Content creators often use music to enhance their videos, from soundtracks in movies to background music in video blogs and social media content. However, identifying the best music for a video can be a difficult and time-consuming task. To address this challenge, we propose a novel framework for automatically retrieving a matching music clip for a given video, and vice versa. Our approach leverages annotated music labels, as well as the inherent artistic correspondence between visual and music elements. Distinct from previous cross-modal music retrieval works, our method combines both self-supervised and supervised training objectives. We use self-supervised and label-supervised contrastive learning to train a joint embedding space between music and video. We show the effectiveness of our approach by using music genre labels for the supervised training component, and our framework can be generalized to other music annotations (e.g., emotion, instrument, etc.). Furthermore, our method enables fine-grained control over how much the retrieval process focuses on self-supervised vs. label information at inference time. We evaluate the learned embeddings through a variety of video-to-music and music-to-video retrieval tasks. Our experiments show that the proposed approach successfully combines self-supervised and supervised objectives and is effective for controllable music-video retrieval.

Comments:	Accepted at ICASSP 2025
Subjects:	Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2412.05831 [cs.MM]
	(or arXiv:2412.05831v2 [cs.MM] for this version)
	https://doi.org/10.48550/arXiv.2412.05831

Submission history

From: Shanti Stewart [view email]
[v1] Sun, 8 Dec 2024 06:37:27 UTC (1,052 KB)
[v2] Mon, 23 Dec 2024 02:52:36 UTC (1,052 KB)

Computer Science > Multimedia

Title:Semi-Supervised Contrastive Learning for Controllable Video-to-Music Retrieval

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Multimedia

Title:Semi-Supervised Contrastive Learning for Controllable Video-to-Music Retrieval

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators