Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper

Thorbecke, Iuliia; Zuluaga-Gomez, Juan; Villatoro-Tello, Esaú; Kumar, Shashi; Rangappa, Pradeep; Burdisso, Sergio; Motlicek, Petr; Pandia, Karthik; Ganapathiraju, Aravind

Computer Science > Computation and Language

arXiv:2409.13499 (cs)

[Submitted on 20 Sep 2024 (v1), last revised 7 Oct 2024 (this version, v2)]

Title:Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper

Authors:Iuliia Thorbecke, Juan Zuluaga-Gomez, Esaú Villatoro-Tello, Shashi Kumar, Pradeep Rangappa, Sergio Burdisso, Petr Motlicek, Karthik Pandia, Aravind Ganapathiraju

View PDF HTML (experimental)

Abstract:The training of automatic speech recognition (ASR) with little to no supervised data remains an open question. In this work, we demonstrate that streaming Transformer-Transducer (TT) models can be trained from scratch in consumer and accessible GPUs in their entirety with pseudo-labeled (PL) speech from foundational speech models (FSM). This allows training a robust ASR model just in one stage and does not require large data and computational budget compared to the two-step scenario with pre-training and fine-tuning. We perform a comprehensive ablation on different aspects of PL-based streaming TT models such as the impact of (1) shallow fusion of n-gram LMs, (2) contextual biasing with named entities, (3) chunk-wise decoding for low-latency streaming applications, and (4) TT overall performance as the function of the FSM size. Our results demonstrate that TT can be trained from scratch without supervised data, even with very noisy PLs. We validate the proposed framework on 6 languages from CommonVoice and propose multiple heuristics to filter out hallucinated PLs.

Comments:	Accepted to EMNLP Findings 2024
Subjects:	Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2409.13499 [cs.CL]
	(or arXiv:2409.13499v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2409.13499

Submission history

From: Iuliia Thorbecke [view email]
[v1] Fri, 20 Sep 2024 13:38:59 UTC (671 KB)
[v2] Mon, 7 Oct 2024 19:16:21 UTC (674 KB)

Computer Science > Computation and Language

Title:Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators