Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition

Meng, Zhong; Gaur, Yashesh; Kanda, Naoyuki; Li, Jinyu; Chen, Xie; Wu, Yu; Gong, Yifan

Computer Science > Computation and Language

arXiv:2110.05354 (cs)

[Submitted on 6 Oct 2021 (v1), last revised 26 Jun 2022 (this version, v5)]

Title:Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition

Authors:Zhong Meng, Yashesh Gaur, Naoyuki Kanda, Jinyu Li, Xie Chen, Yu Wu, Yifan Gong

View PDF

Abstract:Text-only adaptation of an end-to-end (E2E) model remains a challenging task for automatic speech recognition (ASR). Language model (LM) fusion-based approaches require an additional external LM during inference, significantly increasing the computation cost. To overcome this, we propose an internal LM adaptation (ILMA) of the E2E model using text-only data. Trained with audio-transcript pairs, an E2E model implicitly learns an internal LM that characterizes the token sequence probability which is approximated by the E2E model output after zeroing out the encoder contribution. During ILMA, we fine-tune the internal LM, i.e., the E2E components excluding the encoder, to minimize a cross-entropy loss. To make ILMA effective, it is essential to train the E2E model with an internal LM loss besides the standard E2E loss. Furthermore, we propose to regularize ILMA by minimizing the Kullback-Leibler divergence between the output distributions of the adapted and unadapted internal LMs. ILMA is the most effective when we update only the last linear layer of the joint network. ILMA enables a fast text-only adaptation of the E2E model without increasing the run-time computational cost. Experimented with 30K-hour trained transformer transducer models, ILMA achieves up to 34.9% relative word error rate reduction from the unadapted baseline.

Comments:	5 pages, in Interspeech 2022
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2110.05354 [cs.CL]
	(or arXiv:2110.05354v5 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2110.05354
Journal reference:	Interspeech 2022, Incheon, Korea

Submission history

From: Zhong Meng [view email]
[v1] Wed, 6 Oct 2021 23:03:29 UTC (32 KB)
[v2] Thu, 14 Oct 2021 23:14:32 UTC (32 KB)
[v3] Fri, 18 Feb 2022 19:08:15 UTC (45 KB)
[v4] Sun, 20 Mar 2022 00:19:52 UTC (45 KB)
[v5] Sun, 26 Jun 2022 23:17:36 UTC (46 KB)

Computer Science > Computation and Language

Title:Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators