-
Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward
Authors:
Shashi Kumar,
Iuliia Thorbecke,
Sergio Burdisso,
Esaú Villatoro-Tello,
Manjunath K E,
Kadri Hacioğlu,
Pradeep Rangappa,
Petr Motlicek,
Aravind Ganapathiraju,
Andreas Stolcke
Abstract:
Recent research has demonstrated that training a linear connector between speech foundation encoders and large language models (LLMs) enables this architecture to achieve strong ASR capabilities. Despite the impressive results, it remains unclear whether these simple approaches are robust enough across different scenarios and speech conditions, such as domain shifts and speech perturbations. In th…
▽ More
Recent research has demonstrated that training a linear connector between speech foundation encoders and large language models (LLMs) enables this architecture to achieve strong ASR capabilities. Despite the impressive results, it remains unclear whether these simple approaches are robust enough across different scenarios and speech conditions, such as domain shifts and speech perturbations. In this paper, we address these questions by conducting various ablation experiments using a recent and widely adopted approach called SLAM-ASR. We present novel empirical findings that offer insights on how to effectively utilize the SLAM-ASR architecture across a wide range of settings. Our main findings indicate that SLAM-ASR exhibits poor performance in cross-domain evaluation settings. Additionally, speech perturbations on in-domain data, such as changes in speech rate or additive noise, can significantly degrade performance. Our findings offer critical insights for fine-tuning and configuring robust LLM-based ASR models, tailored to different data characteristics and computational resources.
△ Less
Submitted 22 January, 2025; v1 submitted 6 November, 2024;
originally announced November 2024.
-
LM-assisted keyword biasing with Aho-Corasick algorithm for Transducer-based ASR
Authors:
Iuliia Thorbecke,
Juan Zuluaga-Gomez,
Esaú Villatoro-Tello,
Andres Carofilis,
Shashi Kumar,
Petr Motlicek,
Karthik Pandia,
Aravind Ganapathiraju
Abstract:
Despite the recent success of end-to-end models for automatic speech recognition, recognizing special rare and out-of-vocabulary words, as well as fast domain adaptation with text, are still challenging. It often happens that biasing to the special entities leads to a degradation in the overall performance. We propose a light on-the-fly method to improve automatic speech recognition performance by…
▽ More
Despite the recent success of end-to-end models for automatic speech recognition, recognizing special rare and out-of-vocabulary words, as well as fast domain adaptation with text, are still challenging. It often happens that biasing to the special entities leads to a degradation in the overall performance. We propose a light on-the-fly method to improve automatic speech recognition performance by combining a bias list of named entities with a word-level n-gram language model with the shallow fusion approach based on the Aho-Corasick string matching algorithm. The Aho-Corasick algorithm has proved to be more efficient than other methods and allows fast context adaptation. An n-gram language model is introduced as a graph with fail and output arcs, where the arc weights are adapted from the n-gram probabilities. The language model is used as an additional support to keyword biasing when the language model is combined with bias entities in a single context graph to take care of the overall performance. We demonstrate our findings on 4 languages, 2 public and 1 private datasets including performance on named entities and out-of-vocabulary entities. We achieve up to 21.6% relative improvement in the general word error rate with no practical difference in the inverse real-time factor.
△ Less
Submitted 20 September, 2024;
originally announced September 2024.
-
Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper
Authors:
Iuliia Thorbecke,
Juan Zuluaga-Gomez,
Esaú Villatoro-Tello,
Shashi Kumar,
Pradeep Rangappa,
Sergio Burdisso,
Petr Motlicek,
Karthik Pandia,
Aravind Ganapathiraju
Abstract:
The training of automatic speech recognition (ASR) with little to no supervised data remains an open question. In this work, we demonstrate that streaming Transformer-Transducer (TT) models can be trained from scratch in consumer and accessible GPUs in their entirety with pseudo-labeled (PL) speech from foundational speech models (FSM). This allows training a robust ASR model just in one stage and…
▽ More
The training of automatic speech recognition (ASR) with little to no supervised data remains an open question. In this work, we demonstrate that streaming Transformer-Transducer (TT) models can be trained from scratch in consumer and accessible GPUs in their entirety with pseudo-labeled (PL) speech from foundational speech models (FSM). This allows training a robust ASR model just in one stage and does not require large data and computational budget compared to the two-step scenario with pre-training and fine-tuning. We perform a comprehensive ablation on different aspects of PL-based streaming TT models such as the impact of (1) shallow fusion of n-gram LMs, (2) contextual biasing with named entities, (3) chunk-wise decoding for low-latency streaming applications, and (4) TT overall performance as the function of the FSM size. Our results demonstrate that TT can be trained from scratch without supervised data, even with very noisy PLs. We validate the proposed framework on 6 languages from CommonVoice and propose multiple heuristics to filter out hallucinated PLs.
△ Less
Submitted 7 October, 2024; v1 submitted 20 September, 2024;
originally announced September 2024.
-
TokenVerse: Towards Unifying Speech and NLP Tasks via Transducer-based ASR
Authors:
Shashi Kumar,
Srikanth Madikeri,
Juan Zuluaga-Gomez,
Iuliia Thorbecke,
Esaú Villatoro-Tello,
Sergio Burdisso,
Petr Motlicek,
Karthik Pandia,
Aravind Ganapathiraju
Abstract:
In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achie…
▽ More
In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achieved by integrating task-specific tokens into the reference text during ASR model training, streamlining the inference and eliminating the need for separate NLP models. In addition to ASR, we conduct experiments on 3 different tasks: speaker change detection, endpointing, and NER. Our experiments on a public and a private dataset show that the proposed method improves ASR by up to 7.7% in relative WER while outperforming the cascaded pipeline approach in individual task performance. Our code is publicly available: https://github.com/idiap/tokenverse-unifying-speech-nlp
△ Less
Submitted 8 October, 2024; v1 submitted 5 July, 2024;
originally announced July 2024.
-
XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models
Authors:
Shashi Kumar,
Srikanth Madikeri,
Juan Zuluaga-Gomez,
Esaú Villatoro-Tello,
Iuliia Thorbecke,
Petr Motlicek,
Manjunath K E,
Aravind Ganapathiraju
Abstract:
Self-supervised pretrained models exhibit competitive performance in automatic speech recognition on finetuning, even with limited in-domain supervised data. However, popular pretrained models are not suitable for streaming ASR because they are trained with full attention context. In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer setup. Our exper…
▽ More
Self-supervised pretrained models exhibit competitive performance in automatic speech recognition on finetuning, even with limited in-domain supervised data. However, popular pretrained models are not suitable for streaming ASR because they are trained with full attention context. In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer setup. Our experiments on the AMI dataset reveal that the XLSR-Transducer achieves 4% absolute WER improvement over Whisper large-v2 and 8% over a Zipformer transducer model trained from scratch. To enable streaming capabilities, we investigate different attention masking patterns in the self-attention computation of transformer layers within the XLSR-53 model. We validate XLSR-Transducer on AMI and 5 languages from CommonVoice under low-resource scenarios. Finally, with the introduction of attention sinks, we reduce the left context by half while achieving a relative 12% improvement in WER.
△ Less
Submitted 8 October, 2024; v1 submitted 5 July, 2024;
originally announced July 2024.