Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition

Hono, Yukiya; Mitsuda, Koh; Zhao, Tianyu; Mitsui, Kentaro; Wakatsuki, Toshiaki; Sawada, Kei

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2312.03668 (eess)

[Submitted on 6 Dec 2023 (v1), last revised 6 Jun 2024 (this version, v2)]

Title:Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition

Authors:Yukiya Hono, Koh Mitsuda, Tianyu Zhao, Kentaro Mitsui, Toshiaki Wakatsuki, Kei Sawada

View PDF HTML (experimental)

Abstract:Advances in machine learning have made it possible to perform various text and speech processing tasks, such as automatic speech recognition (ASR), in an end-to-end (E2E) manner. E2E approaches utilizing pre-trained models are gaining attention for conserving training data and resources. However, most of their applications in ASR involve only one of either a pre-trained speech or a language model. This paper proposes integrating a pre-trained speech representation model and a large language model (LLM) for E2E ASR. The proposed model enables the optimization of the entire ASR process, including acoustic feature extraction and acoustic and language modeling, by combining pre-trained models with a bridge network and also enables the application of remarkable developments in LLM utilization, such as parameter-efficient domain adaptation and inference optimization. Experimental results demonstrate that the proposed model achieves a performance comparable to that of modern E2E ASR models by utilizing powerful pre-training models with the proposed integrated approach.

Comments:	17 pages, 4 figures, 9 tables, accepted for Findings of ACL 2024. The model is available at this https URL
Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2312.03668 [eess.AS]
	(or arXiv:2312.03668v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2312.03668

Submission history

From: Yukiya Hono [view email]
[v1] Wed, 6 Dec 2023 18:34:42 UTC (172 KB)
[v2] Thu, 6 Jun 2024 15:24:16 UTC (239 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators