Multi-Distillation from Speech and Music Representation Models

Wei, Jui-Chiang; Lin, Yi-Cheng; Ritter-Gutierrez, Fabian; Lee, Hung-yi

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2506.07237 (eess)

[Submitted on 8 Jun 2025 (v1), last revised 11 Jun 2025 (this version, v2)]

Title:Multi-Distillation from Speech and Music Representation Models

Authors:Jui-Chiang Wei, Yi-Cheng Lin, Fabian Ritter-Gutierrez, Hung-yi Lee

View PDF HTML (experimental)

Abstract:Real-world audio often mixes speech and music, yet models typically handle only one domain. This paper introduces a multi-teacher distillation framework that unifies speech and music models into a single one while significantly reducing model size. Our approach leverages the strengths of domain-specific teacher models, such as HuBERT for speech and MERT for music, and explores various strategies to balance both domains. Experiments across diverse tasks demonstrate that our model matches the performance of domain-specific models, showing the effectiveness of cross-domain distillation. Additionally, we conduct few-shot learning experiments, highlighting the need for general models in real-world scenarios where labeled data is limited. Our results show that our model not only performs on par with specialized models but also outperforms them in few-shot scenarios, proving that a cross-domain approach is essential and effective for diverse tasks with limited data.

Comments:	8 pages, 1 figures
Subjects:	Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2506.07237 [eess.AS]
	(or arXiv:2506.07237v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2506.07237

Submission history

From: Jui-Chiang Wei [view email]
[v1] Sun, 8 Jun 2025 17:45:46 UTC (1,531 KB)
[v2] Wed, 11 Jun 2025 17:28:22 UTC (1,531 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Multi-Distillation from Speech and Music Representation Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Multi-Distillation from Speech and Music Representation Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators