-
Improving endpoint detection in end-to-end streaming ASR for conversational speech
Authors:
Anandh C,
Karthik Pandia Durai,
Jeena Prakash,
Manickavela Arumugam,
Kadri Hacioglu,
S. Pavankumar Dubagunta,
Andreas Stolcke,
Shankar Venkatesan,
Aravind Ganapathiraju
Abstract:
ASR endpointing (EP) plays a major role in delivering a good user experience in products supporting human or artificial agents in human-human/machine conversations. Transducer-based ASR (T-ASR) is an end-to-end (E2E) ASR modelling technique preferred for streaming. A major limitation of T-ASR is delayed emission of ASR outputs, which could lead to errors or delays in EP. Inaccurate EP will cut the…
▽ More
ASR endpointing (EP) plays a major role in delivering a good user experience in products supporting human or artificial agents in human-human/machine conversations. Transducer-based ASR (T-ASR) is an end-to-end (E2E) ASR modelling technique preferred for streaming. A major limitation of T-ASR is delayed emission of ASR outputs, which could lead to errors or delays in EP. Inaccurate EP will cut the user off while speaking, returning incomplete transcript while delays in EP will increase the perceived latency, degrading the user experience. We propose methods to improve EP by addressing delayed emission along with EP mistakes. To address the delayed emission problem, we introduce an end-of-word token at the end of each word, along with a delay penalty. The EP delay is addressed by obtaining a reliable frame-level speech activity detection using an auxiliary network. We apply the proposed methods on Switchboard conversational speech corpus and evaluate it against a delay penalty method.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward
Authors:
Shashi Kumar,
Iuliia Thorbecke,
Sergio Burdisso,
Esaú Villatoro-Tello,
Manjunath K E,
Kadri Hacioğlu,
Pradeep Rangappa,
Petr Motlicek,
Aravind Ganapathiraju,
Andreas Stolcke
Abstract:
Recent research has demonstrated that training a linear connector between speech foundation encoders and large language models (LLMs) enables this architecture to achieve strong ASR capabilities. Despite the impressive results, it remains unclear whether these simple approaches are robust enough across different scenarios and speech conditions, such as domain shifts and speech perturbations. In th…
▽ More
Recent research has demonstrated that training a linear connector between speech foundation encoders and large language models (LLMs) enables this architecture to achieve strong ASR capabilities. Despite the impressive results, it remains unclear whether these simple approaches are robust enough across different scenarios and speech conditions, such as domain shifts and speech perturbations. In this paper, we address these questions by conducting various ablation experiments using a recent and widely adopted approach called SLAM-ASR. We present novel empirical findings that offer insights on how to effectively utilize the SLAM-ASR architecture across a wide range of settings. Our main findings indicate that SLAM-ASR exhibits poor performance in cross-domain evaluation settings. Additionally, speech perturbations on in-domain data, such as changes in speech rate or additive noise, can significantly degrade performance. Our findings offer critical insights for fine-tuning and configuring robust LLM-based ASR models, tailored to different data characteristics and computational resources.
△ Less
Submitted 22 January, 2025; v1 submitted 6 November, 2024;
originally announced November 2024.
-
LM-assisted keyword biasing with Aho-Corasick algorithm for Transducer-based ASR
Authors:
Iuliia Thorbecke,
Juan Zuluaga-Gomez,
Esaú Villatoro-Tello,
Andres Carofilis,
Shashi Kumar,
Petr Motlicek,
Karthik Pandia,
Aravind Ganapathiraju
Abstract:
Despite the recent success of end-to-end models for automatic speech recognition, recognizing special rare and out-of-vocabulary words, as well as fast domain adaptation with text, are still challenging. It often happens that biasing to the special entities leads to a degradation in the overall performance. We propose a light on-the-fly method to improve automatic speech recognition performance by…
▽ More
Despite the recent success of end-to-end models for automatic speech recognition, recognizing special rare and out-of-vocabulary words, as well as fast domain adaptation with text, are still challenging. It often happens that biasing to the special entities leads to a degradation in the overall performance. We propose a light on-the-fly method to improve automatic speech recognition performance by combining a bias list of named entities with a word-level n-gram language model with the shallow fusion approach based on the Aho-Corasick string matching algorithm. The Aho-Corasick algorithm has proved to be more efficient than other methods and allows fast context adaptation. An n-gram language model is introduced as a graph with fail and output arcs, where the arc weights are adapted from the n-gram probabilities. The language model is used as an additional support to keyword biasing when the language model is combined with bias entities in a single context graph to take care of the overall performance. We demonstrate our findings on 4 languages, 2 public and 1 private datasets including performance on named entities and out-of-vocabulary entities. We achieve up to 21.6% relative improvement in the general word error rate with no practical difference in the inverse real-time factor.
△ Less
Submitted 20 September, 2024;
originally announced September 2024.
-
Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper
Authors:
Iuliia Thorbecke,
Juan Zuluaga-Gomez,
Esaú Villatoro-Tello,
Shashi Kumar,
Pradeep Rangappa,
Sergio Burdisso,
Petr Motlicek,
Karthik Pandia,
Aravind Ganapathiraju
Abstract:
The training of automatic speech recognition (ASR) with little to no supervised data remains an open question. In this work, we demonstrate that streaming Transformer-Transducer (TT) models can be trained from scratch in consumer and accessible GPUs in their entirety with pseudo-labeled (PL) speech from foundational speech models (FSM). This allows training a robust ASR model just in one stage and…
▽ More
The training of automatic speech recognition (ASR) with little to no supervised data remains an open question. In this work, we demonstrate that streaming Transformer-Transducer (TT) models can be trained from scratch in consumer and accessible GPUs in their entirety with pseudo-labeled (PL) speech from foundational speech models (FSM). This allows training a robust ASR model just in one stage and does not require large data and computational budget compared to the two-step scenario with pre-training and fine-tuning. We perform a comprehensive ablation on different aspects of PL-based streaming TT models such as the impact of (1) shallow fusion of n-gram LMs, (2) contextual biasing with named entities, (3) chunk-wise decoding for low-latency streaming applications, and (4) TT overall performance as the function of the FSM size. Our results demonstrate that TT can be trained from scratch without supervised data, even with very noisy PLs. We validate the proposed framework on 6 languages from CommonVoice and propose multiple heuristics to filter out hallucinated PLs.
△ Less
Submitted 7 October, 2024; v1 submitted 20 September, 2024;
originally announced September 2024.
-
TokenVerse: Towards Unifying Speech and NLP Tasks via Transducer-based ASR
Authors:
Shashi Kumar,
Srikanth Madikeri,
Juan Zuluaga-Gomez,
Iuliia Thorbecke,
Esaú Villatoro-Tello,
Sergio Burdisso,
Petr Motlicek,
Karthik Pandia,
Aravind Ganapathiraju
Abstract:
In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achie…
▽ More
In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achieved by integrating task-specific tokens into the reference text during ASR model training, streamlining the inference and eliminating the need for separate NLP models. In addition to ASR, we conduct experiments on 3 different tasks: speaker change detection, endpointing, and NER. Our experiments on a public and a private dataset show that the proposed method improves ASR by up to 7.7% in relative WER while outperforming the cascaded pipeline approach in individual task performance. Our code is publicly available: https://github.com/idiap/tokenverse-unifying-speech-nlp
△ Less
Submitted 8 October, 2024; v1 submitted 5 July, 2024;
originally announced July 2024.
-
XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models
Authors:
Shashi Kumar,
Srikanth Madikeri,
Juan Zuluaga-Gomez,
Esaú Villatoro-Tello,
Iuliia Thorbecke,
Petr Motlicek,
Manjunath K E,
Aravind Ganapathiraju
Abstract:
Self-supervised pretrained models exhibit competitive performance in automatic speech recognition on finetuning, even with limited in-domain supervised data. However, popular pretrained models are not suitable for streaming ASR because they are trained with full attention context. In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer setup. Our exper…
▽ More
Self-supervised pretrained models exhibit competitive performance in automatic speech recognition on finetuning, even with limited in-domain supervised data. However, popular pretrained models are not suitable for streaming ASR because they are trained with full attention context. In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer setup. Our experiments on the AMI dataset reveal that the XLSR-Transducer achieves 4% absolute WER improvement over Whisper large-v2 and 8% over a Zipformer transducer model trained from scratch. To enable streaming capabilities, we investigate different attention masking patterns in the self-attention computation of transformer layers within the XLSR-53 model. We validate XLSR-Transducer on AMI and 5 languages from CommonVoice under low-resource scenarios. Finally, with the introduction of attention sinks, we reduce the left context by half while achieving a relative 12% improvement in WER.
△ Less
Submitted 8 October, 2024; v1 submitted 5 July, 2024;
originally announced July 2024.
-
Implementing contextual biasing in GPU decoder for online ASR
Authors:
Iuliia Nigmatulina,
Srikanth Madikeri,
Esaú Villatoro-Tello,
Petr Motliček,
Juan Zuluaga-Gomez,
Karthik Pandia,
Aravind Ganapathiraju
Abstract:
GPU decoding significantly accelerates the output of ASR predictions. While GPUs are already being used for online ASR decoding, post-processing and rescoring on GPUs have not been properly investigated yet. Rescoring with available contextual information can considerably improve ASR predictions. Previous studies have proven the viability of lattice rescoring in decoding and biasing language model…
▽ More
GPU decoding significantly accelerates the output of ASR predictions. While GPUs are already being used for online ASR decoding, post-processing and rescoring on GPUs have not been properly investigated yet. Rescoring with available contextual information can considerably improve ASR predictions. Previous studies have proven the viability of lattice rescoring in decoding and biasing language model (LM) weights in offline and online CPU scenarios. In real-time GPU decoding, partial recognition hypotheses are produced without lattice generation, which makes the implementation of biasing more complex. The paper proposes and describes an approach to integrate contextual biasing in real-time GPU decoding while exploiting the standard Kaldi GPU decoder. Besides the biasing of partial ASR predictions, our approach also permits dynamic context switching allowing a flexible rescoring per each speech segment directly on GPU. The code is publicly released and tested with open-sourced test sets.
△ Less
Submitted 23 June, 2023;
originally announced June 2023.
-
On the Efficacy and Noise-Robustness of Jointly Learned Speech Emotion and Automatic Speech Recognition
Authors:
Lokesh Bansal,
S. Pavankumar Dubagunta,
Malolan Chetlur,
Pushpak Jagtap,
Aravind Ganapathiraju
Abstract:
New-age conversational agent systems perform both speech emotion recognition (SER) and automatic speech recognition (ASR) using two separate and often independent approaches for real-world application in noisy environments. In this paper, we investigate a joint ASR-SER multitask learning approach in a low-resource setting and show that improvements are observed not only in SER, but also in ASR. We…
▽ More
New-age conversational agent systems perform both speech emotion recognition (SER) and automatic speech recognition (ASR) using two separate and often independent approaches for real-world application in noisy environments. In this paper, we investigate a joint ASR-SER multitask learning approach in a low-resource setting and show that improvements are observed not only in SER, but also in ASR. We also investigate the robustness of such jointly trained models to the presence of background noise, babble, and music. Experimental results on the IEMOCAP dataset show that joint learning can improve ASR word error rate (WER) and SER classification accuracy by 10.7% and 2.3% respectively in clean scenarios. In noisy scenarios, results on data augmented with MUSAN show that the joint approach outperforms the independent ASR and SER approaches across many noisy conditions. Overall, the joint ASR-SER approach yielded more noise-resistant models than the independent ASR and SER approaches.
△ Less
Submitted 25 May, 2023; v1 submitted 21 May, 2023;
originally announced May 2023.
-
Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks
Authors:
Esaú Villatoro-Tello,
Srikanth Madikeri,
Juan Zuluaga-Gomez,
Bidisha Sharma,
Seyyed Saeed Sarfjoo,
Iuliia Nigmatulina,
Petr Motlicek,
Alexei V. Ivanov,
Aravind Ganapathiraju
Abstract:
In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable perfo…
▽ More
In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable performance of different state-of-the-art SLU systems under different circumstances, e.g., automatically- vs. manually-generated transcripts. We evaluate the systems on the publicly available SLURP spoken language resource corpus. Our results indicate that using richer forms of Automatic Speech Recognition (ASR) outputs, namely word-consensus-networks, allows the SLU system to improve in comparison to the 1-best setup (5.5% relative improvement). However, crossmodal approaches, i.e., learning from acoustic and text embeddings, obtains performance similar to the oracle setup, a relative improvement of 17.8% over the 1-best configuration, being a recommended alternative to overcome the limitations of working with automatically generated transcripts.
△ Less
Submitted 17 March, 2023; v1 submitted 16 December, 2022;
originally announced December 2022.
-
Neural Network Based Speaker Classification and Verification Systems with Enhanced Features
Authors:
Zhenhao Ge,
Ananth N. Iyer,
Srinath Cheluvaraja,
Ram Sundaram,
Aravind Ganapathiraju
Abstract:
This work presents a novel framework based on feed-forward neural network for text-independent speaker classification and verification, two related systems of speaker recognition. With optimized features and model training, it achieves 100% classification rate in classification and less than 6% Equal Error Rate (ERR), using merely about 1 second and 5 seconds of data respectively. Features with st…
▽ More
This work presents a novel framework based on feed-forward neural network for text-independent speaker classification and verification, two related systems of speaker recognition. With optimized features and model training, it achieves 100% classification rate in classification and less than 6% Equal Error Rate (ERR), using merely about 1 second and 5 seconds of data respectively. Features with stricter Voice Active Detection (VAD) than the regular one for speech recognition ensure extracting stronger voiced portion for speaker recognition, speaker-level mean and variance normalization helps to eliminate the discrepancy between samples from the same speaker. Both are proven to improve the system performance. In building the neural network speaker classifier, the network structure parameters are optimized with grid search and dynamically reduced regularization parameters are used to avoid training terminated in local minimum. It enables the training goes further with lower cost. In speaker verification, performance is improved with prediction score normalization, which rewards the speaker identity indices with distinct peaks and penalizes the weak ones with high scores but more competitors, and speaker-specific thresholding, which significantly reduces ERR in the ROC curve. TIMIT corpus with 8K sampling rate is used here. First 200 male speakers are used to train and test the classification performance. The testing files of them are used as in-domain registered speakers, while data from the remaining 126 male speakers are used as out-of-domain speakers, i.e. imposters in speaker verification.
△ Less
Submitted 7 February, 2017;
originally announced February 2017.
-
Speaker Change Detection Using Features through A Neural Network Speaker Classifier
Authors:
Zhenhao Ge,
Ananth N. Iyer,
Srinath Cheluvaraja,
Aravind Ganapathiraju
Abstract:
The mechanism proposed here is for real-time speaker change detection in conversations, which firstly trains a neural network text-independent speaker classifier using in-domain speaker data. Through the network, features of conversational speech from out-of-domain speakers are then converted into likelihood vectors, i.e. similarity scores comparing to the in-domain speakers. These transformed fea…
▽ More
The mechanism proposed here is for real-time speaker change detection in conversations, which firstly trains a neural network text-independent speaker classifier using in-domain speaker data. Through the network, features of conversational speech from out-of-domain speakers are then converted into likelihood vectors, i.e. similarity scores comparing to the in-domain speakers. These transformed features demonstrate very distinctive patterns, which facilitates differentiating speakers and enable speaker change detection with some straight-forward distance metrics. The speaker classifier and the speaker change detector are trained/tested using speech of the first 200 (in-domain) and the remaining 126 (out-of-domain) male speakers in TIMIT respectively. For the speaker classification, 100% accuracy at a 200 speaker size is achieved on any testing file, given the speech duration is at least 0.97 seconds. For the speaker change detection using speaker classification outputs, performance based on 0.5, 1, and 2 seconds of inspection intervals were evaluated in terms of error rate and F1 score, using synthesized data by concatenating speech from various speakers. It captures close to 97% of the changes by comparing the current second of speech with the previous second, which is very competitive among literature using other methods.
△ Less
Submitted 7 February, 2017;
originally announced February 2017.
-
Generation and Pruning of Pronunciation Variants to Improve ASR Accuracy
Authors:
Zhenhao Ge,
Aravind Ganapathiraju,
Ananth N. Iyer,
Scott A. Randal,
Felix I. Wyss
Abstract:
Speech recognition, especially name recognition, is widely used in phone services such as company directory dialers, stock quote providers or location finders. It is usually challenging due to pronunciation variations. This paper proposes an efficient and robust data-driven technique which automatically learns acceptable word pronunciations and updates the pronunciation dictionary to build a bette…
▽ More
Speech recognition, especially name recognition, is widely used in phone services such as company directory dialers, stock quote providers or location finders. It is usually challenging due to pronunciation variations. This paper proposes an efficient and robust data-driven technique which automatically learns acceptable word pronunciations and updates the pronunciation dictionary to build a better lexicon without affecting recognition of other words similar to the target word. It generalizes well on datasets with various sizes, and reduces the error rate on a database with 13000+ human names by 42%, compared to a baseline with regular dictionaries already covering canonical pronunciations of 97%+ words in names, plus a well-trained spelling-to-pronunciation (STP) engine.
△ Less
Submitted 28 June, 2016;
originally announced June 2016.
-
Accent Classification with Phonetic Vowel Representation
Authors:
Zhenhao Ge,
Yingyi Tan,
Aravind Ganapathiraju
Abstract:
Previous accent classification research focused mainly on detecting accents with pure acoustic information without recognizing accented speech. This work combines phonetic knowledge such as vowels with acoustic information to build Guassian Mixture Model (GMM) classifier with Perceptual Linear Predictive (PLP) features, optimized by Hetroscedastic Linear Discriminant Analysis (HLDA). With input ab…
▽ More
Previous accent classification research focused mainly on detecting accents with pure acoustic information without recognizing accented speech. This work combines phonetic knowledge such as vowels with acoustic information to build Guassian Mixture Model (GMM) classifier with Perceptual Linear Predictive (PLP) features, optimized by Hetroscedastic Linear Discriminant Analysis (HLDA). With input about 20-second accented speech, this system achieves classification rate of 51% on a 7-way classification system focusing on the major types of accents in English, which is competitive to the state-of-the-art results in this field.
△ Less
Submitted 23 February, 2016;
originally announced April 2016.