Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities

Saon, George; Dekel, Avihu; Brooks, Alexander; Nagano, Tohru; Daniels, Abraham; Satt, Aharon; Mittal, Ashish; Kingsbury, Brian; Haws, David; Morais, Edmilson; Kurata, Gakuto; Aronowitz, Hagai; Ibrahim, Ibrahim; Kuo, Jeff; Soule, Kate; Lastras, Luis; Suzuki, Masayuki; Hoory, Ron; Thomas, Samuel; Novitasari, Sashi; Fukuda, Takashi; Sunder, Vishal; Cui, Xiaodong; Kons, Zvi

Abstract:Granite-speech LLMs are compact and efficient speech language models specifically designed for English ASR and automatic speech translation (AST). The models were trained by modality aligning the 2B and 8B parameter variants of granite-3.3-instruct to speech on publicly available open-source corpora containing audio inputs and text targets consisting of either human transcripts for ASR or automatically generated translations for AST. Comprehensive benchmarking shows that on English ASR, which was our primary focus, they outperform several competitors' models that were trained on orders of magnitude more proprietary data, and they keep pace on English-to-X AST for major European languages, Japanese, and Chinese. The speech-specific components are: a conformer acoustic encoder using block attention and self-conditioning trained with connectionist temporal classification, a windowed query-transformer speech modality adapter used to do temporal downsampling of the acoustic embeddings and map them to the LLM text embedding space, and LoRA adapters to further fine-tune the text LLM. Granite-speech-3.3 operates in two modes: in speech mode, it performs ASR and AST by activating the encoder, projector, and LoRA adapters; in text mode, it calls the underlying granite-3.3-instruct model directly (without LoRA), essentially preserving all the text LLM capabilities and safety. Both models are freely available on HuggingFace (this https URL and this https URL) and can be used for both research and commercial purposes under a permissive Apache 2.0 license.

Comments:	7 pages, 9 figures
Subjects:	Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2505.08699 [eess.AS]
	(or arXiv:2505.08699v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2505.08699

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators