SpeechVerse: A Large-scale Generalizable Audio Language Model

Das, Nilaksh; Dingliwal, Saket; Ronanki, Srikanth; Paturi, Rohit; Huang, Zhaocheng; Mathur, Prashant; Yuan, Jie; Bekal, Dhanush; Niu, Xing; Jayanthi, Sai Muralidhar; Li, Xilai; Mundnich, Karel; Sunkara, Monica; Bodapati, Sravan; Srinivasan, Sundararajan; Han, Kyu J; Kirchhoff, Katrin

Computer Science > Computation and Language

arXiv:2405.08295 (cs)

[Submitted on 14 May 2024 (v1), last revised 24 Mar 2025 (this version, v3)]

Title:SpeechVerse: A Large-scale Generalizable Audio Language Model

Authors:Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, Xilai Li, Karel Mundnich, Monica Sunkara, Sravan Bodapati, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff

View PDF

Abstract:Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore develop SpeechVerse, a robust multi-task training and curriculum learning framework that combines pre-trained speech and text foundation models via a small set of learnable parameters, while keeping the pre-trained models frozen during training. The models are instruction finetuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a diverse range of speech processing tasks using natural language instructions. We perform extensive benchmarking that includes comparing our model performance against traditional baselines across several datasets and tasks. Furthermore, we evaluate the model's capability for generalized instruction following by testing on out-of-domain datasets, novel prompts, and unseen tasks. Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks.

Comments:	Single Column, 13 page
Subjects:	Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2405.08295 [cs.CL]
	(or arXiv:2405.08295v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2405.08295

Submission history

From: Saket Dingliwal [view email]
[v1] Tue, 14 May 2024 03:33:31 UTC (1,398 KB)
[v2] Fri, 31 May 2024 17:47:40 UTC (1,398 KB)
[v3] Mon, 24 Mar 2025 21:06:53 UTC (1,398 KB)

Computer Science > Computation and Language

Title:SpeechVerse: A Large-scale Generalizable Audio Language Model

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:SpeechVerse: A Large-scale Generalizable Audio Language Model

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators