-
Unifying Streaming and Non-streaming Zipformer-based ASR
Authors:
Bidisha Sharma,
Karthik Pandia Durai,
Shankar Venkatesan,
Jeena J Prakash,
Shashi Kumar,
Malolan Chetlur,
Andreas Stolcke
Abstract:
There has been increasing interest in unifying streaming and non-streaming automatic speech recognition (ASR) models to reduce development, training, and deployment costs. We present a unified framework that trains a single end-to-end ASR model for both streaming and non-streaming applications, leveraging future context information. We propose to use dynamic right-context through the chunked atten…
▽ More
There has been increasing interest in unifying streaming and non-streaming automatic speech recognition (ASR) models to reduce development, training, and deployment costs. We present a unified framework that trains a single end-to-end ASR model for both streaming and non-streaming applications, leveraging future context information. We propose to use dynamic right-context through the chunked attention masking in the training of zipformer-based ASR models. We demonstrate that using right-context is more effective in zipformer models compared to other conformer models due to its multi-scale nature. We analyze the effect of varying the number of right-context frames on accuracy and latency of the streaming ASR models. We use Librispeech and large in-house conversational datasets to train different versions of streaming and non-streaming models and evaluate them in a production grade server-client setup across diverse testsets of different domains. The proposed strategy reduces word error by relative 7.9\% with a small degradation in user-perceived latency. By adding more right-context frames, we are able to achieve streaming performance close to that of non-streaming models. Our approach also allows flexible control of the latency-accuracy tradeoff according to customers requirements.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
Better Pseudo-labeling with Multi-ASR Fusion and Error Correction by SpeechLLM
Authors:
Jeena Prakash,
Blessingh Kumar,
Kadri Hacioglu,
Bidisha Sharma,
Sindhuja Gopalan,
Malolan Chetlur,
Shankar Venkatesan,
Andreas Stolcke
Abstract:
Automatic speech recognition (ASR) models rely on high-quality transcribed data for effective training. Generating pseudo-labels for large unlabeled audio datasets often relies on complex pipelines that combine multiple ASR outputs through multi-stage processing, leading to error propagation, information loss and disjoint optimization. We propose a unified multi-ASR prompt-driven framework using p…
▽ More
Automatic speech recognition (ASR) models rely on high-quality transcribed data for effective training. Generating pseudo-labels for large unlabeled audio datasets often relies on complex pipelines that combine multiple ASR outputs through multi-stage processing, leading to error propagation, information loss and disjoint optimization. We propose a unified multi-ASR prompt-driven framework using postprocessing by either textual or speech-based large language models (LLMs), replacing voting or other arbitration logic for reconciling the ensemble outputs. We perform a comparative study of multiple architectures with and without LLMs, showing significant improvements in transcription accuracy compared to traditional methods. Furthermore, we use the pseudo-labels generated by the various approaches to train semi-supervised ASR models for different datasets, again showing improved performance with textual and speechLLM transcriptions compared to baselines.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering
Authors:
Andres Carofilis,
Pradeep Rangappa,
Srikanth Madikeri,
Shashi Kumar,
Sergio Burdisso,
Jeena Prakash,
Esau Villatoro-Tello,
Petr Motlicek,
Bidisha Sharma,
Kadri Hacioglu,
Shankar Venkatesan,
Saurabh Vyas,
Andreas Stolcke
Abstract:
Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxilia…
▽ More
Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxiliary data. Filtering based on multi-model consensus or named entity recognition (NER) is then applied to select and iteratively refine pseudo-labels, showing slower performance saturation compared to random selection. Evaluated on the multi-domain Wow call center and Fisher English corpora, it outperforms single-step fine-tuning. Consensus-based filtering outperforms other methods, providing up to 22.3% relative improvement on Wow and 24.8% on Fisher over single-step fine-tuning with random selection. NER is the second-best filter, providing competitive performance at a lower computational cost.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering
Authors:
Pradeep Rangappa,
Andres Carofilis,
Jeena Prakash,
Shashi Kumar,
Sergio Burdisso,
Srikanth Madikeri,
Esau Villatoro-Tello,
Bidisha Sharma,
Petr Motlicek,
Kadri Hacioglu,
Shankar Venkatesan,
Saurabh Vyas,
Andreas Stolcke
Abstract:
Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple sel…
▽ More
Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies -- including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis -- to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar performance; a similar trend is observed on Fisher English.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Improving endpoint detection in end-to-end streaming ASR for conversational speech
Authors:
Anandh C,
Karthik Pandia Durai,
Jeena Prakash,
Manickavela Arumugam,
Kadri Hacioglu,
S. Pavankumar Dubagunta,
Andreas Stolcke,
Shankar Venkatesan,
Aravind Ganapathiraju
Abstract:
ASR endpointing (EP) plays a major role in delivering a good user experience in products supporting human or artificial agents in human-human/machine conversations. Transducer-based ASR (T-ASR) is an end-to-end (E2E) ASR modelling technique preferred for streaming. A major limitation of T-ASR is delayed emission of ASR outputs, which could lead to errors or delays in EP. Inaccurate EP will cut the…
▽ More
ASR endpointing (EP) plays a major role in delivering a good user experience in products supporting human or artificial agents in human-human/machine conversations. Transducer-based ASR (T-ASR) is an end-to-end (E2E) ASR modelling technique preferred for streaming. A major limitation of T-ASR is delayed emission of ASR outputs, which could lead to errors or delays in EP. Inaccurate EP will cut the user off while speaking, returning incomplete transcript while delays in EP will increase the perceived latency, degrading the user experience. We propose methods to improve EP by addressing delayed emission along with EP mistakes. To address the delayed emission problem, we introduce an end-of-word token at the end of each word, along with a delay penalty. The EP delay is addressed by obtaining a reliable frame-level speech activity detection using an auxiliary network. We apply the proposed methods on Switchboard conversational speech corpus and evaluate it against a delay penalty method.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
SRPT vs Smith Predictor for Vehicle Teleoperation
Authors:
Jai Prakash,
Michele Vignati,
Edoardo Sabbioni
Abstract:
Vehicle teleoperation has potential applications in fallback solutions for autonomous vehicles, remote delivery services, and hazardous operations. However, network delays and limited situational awareness can compromise teleoperation performance and increase the cognitive workload of human operators. To address these issues, we previously introduced the novel successive reference pose tracking (S…
▽ More
Vehicle teleoperation has potential applications in fallback solutions for autonomous vehicles, remote delivery services, and hazardous operations. However, network delays and limited situational awareness can compromise teleoperation performance and increase the cognitive workload of human operators. To address these issues, we previously introduced the novel successive reference pose tracking (SRPT) approach, which transmits successive reference poses to the vehicle instead of steering commands. This paper compares the stability and performance of SRPT with Smith predictor-based approaches for direct vehicle teleoperation in challenging scenarios. The Smith predictor approach is further categorized, one with Lookahead driver and second with Stanley driver. Simulations are conducted in a Simulink environment, considering variable network delays and different vehicle speeds, and include maneuvers such as tight corners, slalom, low-adhesion roads, and strong crosswinds. The results show that the SRPT approach significantly improves stability and reference tracking performance, with negligible effect of network delays on path tracking. Our findings demonstrate the effectiveness of SRPT in eliminating the detrimental effect of network delays in vehicle teleoperation.
△ Less
Submitted 27 April, 2023;
originally announced May 2023.
-
Maximum entropy based non-negative optoacoustic tomographic image reconstruction
Authors:
Jaya Prakash,
Subhamoy Mandal,
Daniel Razansky,
Vasilis Ntziachristos
Abstract:
Objective:Optoacoustic (photoacoustic) tomography is aimed at reconstructing maps of the initial pressure rise induced by the absorption of light pulses in tissue. In practice, due to inaccurate assumptions in the forward model, noise and other experimental factors, the images are often afflicted by artifacts, occasionally manifested as negative values. The aim of the work is to develop an inversi…
▽ More
Objective:Optoacoustic (photoacoustic) tomography is aimed at reconstructing maps of the initial pressure rise induced by the absorption of light pulses in tissue. In practice, due to inaccurate assumptions in the forward model, noise and other experimental factors, the images are often afflicted by artifacts, occasionally manifested as negative values. The aim of the work is to develop an inversion method which reduces the occurrence of negative values and improves the quantitative performance of optoacoustic imaging. Methods: We present a novel method for optoacoustic tomography based on an entropy maximization algorithm, which uses logarithmic regularization for attaining non-negative reconstructions. The reconstruction image quality is further improved using structural prior based fluence correction. Results: We report the performance achieved by the entropy maximization scheme on numerical simulation, experimental phantoms and in-vivo samples. Conclusion: The proposed algorithm demonstrates superior reconstruction performance by delivering non-negative pixel values with no visible distortion of anatomical structures. Significance: Our method can enable quantitative optoacoustic imaging, and has the potential to improve pre-clinical and translational imaging applications.
△ Less
Submitted 11 January, 2019; v1 submitted 26 July, 2017;
originally announced July 2017.