Skip to main content
Cornell University
We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors. Donate
arxiv logo > cs > arXiv:2110.05354

Help | Advanced Search

arXiv logo
Cornell University Logo

quick links

  • Login
  • Help Pages
  • About

Computer Science > Computation and Language

arXiv:2110.05354 (cs)
[Submitted on 6 Oct 2021 (v1), last revised 26 Jun 2022 (this version, v5)]

Title:Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition

Authors:Zhong Meng, Yashesh Gaur, Naoyuki Kanda, Jinyu Li, Xie Chen, Yu Wu, Yifan Gong
View a PDF of the paper titled Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition, by Zhong Meng and 6 other authors
View PDF
Abstract:Text-only adaptation of an end-to-end (E2E) model remains a challenging task for automatic speech recognition (ASR). Language model (LM) fusion-based approaches require an additional external LM during inference, significantly increasing the computation cost. To overcome this, we propose an internal LM adaptation (ILMA) of the E2E model using text-only data. Trained with audio-transcript pairs, an E2E model implicitly learns an internal LM that characterizes the token sequence probability which is approximated by the E2E model output after zeroing out the encoder contribution. During ILMA, we fine-tune the internal LM, i.e., the E2E components excluding the encoder, to minimize a cross-entropy loss. To make ILMA effective, it is essential to train the E2E model with an internal LM loss besides the standard E2E loss. Furthermore, we propose to regularize ILMA by minimizing the Kullback-Leibler divergence between the output distributions of the adapted and unadapted internal LMs. ILMA is the most effective when we update only the last linear layer of the joint network. ILMA enables a fast text-only adaptation of the E2E model without increasing the run-time computational cost. Experimented with 30K-hour trained transformer transducer models, ILMA achieves up to 34.9% relative word error rate reduction from the unadapted baseline.
Comments: 5 pages, in Interspeech 2022
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as: arXiv:2110.05354 [cs.CL]
  (or arXiv:2110.05354v5 [cs.CL] for this version)
  https://doi.org/10.48550/arXiv.2110.05354
arXiv-issued DOI via DataCite
Journal reference: Interspeech 2022, Incheon, Korea

Submission history

From: Zhong Meng [view email]
[v1] Wed, 6 Oct 2021 23:03:29 UTC (32 KB)
[v2] Thu, 14 Oct 2021 23:14:32 UTC (32 KB)
[v3] Fri, 18 Feb 2022 19:08:15 UTC (45 KB)
[v4] Sun, 20 Mar 2022 00:19:52 UTC (45 KB)
[v5] Sun, 26 Jun 2022 23:17:36 UTC (46 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition, by Zhong Meng and 6 other authors
  • View PDF
  • TeX Source
  • Other Formats
license icon view license
Current browse context:
cs.CL
< prev   |   next >
new | recent | 2021-10
Change to browse by:
cs
cs.AI
cs.LG
cs.SD
eess
eess.AS

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar

DBLP - CS Bibliography

listing | bibtex
Zhong Meng
Yashesh Gaur
Naoyuki Kanda
Jinyu Li
Xie Chen
…
a export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
  • Author
  • Venue
  • Institution
  • Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack