Search | arXiv e-print repository

Unifying Streaming and Non-streaming Zipformer-based ASR

Authors: Bidisha Sharma, Karthik Pandia Durai, Shankar Venkatesan, Jeena J Prakash, Shashi Kumar, Malolan Chetlur, Andreas Stolcke

Abstract: There has been increasing interest in unifying streaming and non-streaming automatic speech recognition (ASR) models to reduce development, training, and deployment costs. We present a unified framework that trains a single end-to-end ASR model for both streaming and non-streaming applications, leveraging future context information. We propose to use dynamic right-context through the chunked atten… ▽ More There has been increasing interest in unifying streaming and non-streaming automatic speech recognition (ASR) models to reduce development, training, and deployment costs. We present a unified framework that trains a single end-to-end ASR model for both streaming and non-streaming applications, leveraging future context information. We propose to use dynamic right-context through the chunked attention masking in the training of zipformer-based ASR models. We demonstrate that using right-context is more effective in zipformer models compared to other conformer models due to its multi-scale nature. We analyze the effect of varying the number of right-context frames on accuracy and latency of the streaming ASR models. We use Librispeech and large in-house conversational datasets to train different versions of streaming and non-streaming models and evaluate them in a production grade server-client setup across diverse testsets of different domains. The proposed strategy reduces word error by relative 7.9\% with a small degradation in user-perceived latency. By adding more right-context frames, we are able to achieve streaming performance close to that of non-streaming models. Our approach also allows flexible control of the latency-accuracy tradeoff according to customers requirements. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: Accepted in ACL2025 Industry track

arXiv:2506.11089 [pdf, ps, other]

Better Pseudo-labeling with Multi-ASR Fusion and Error Correction by SpeechLLM

Authors: Jeena Prakash, Blessingh Kumar, Kadri Hacioglu, Bidisha Sharma, Sindhuja Gopalan, Malolan Chetlur, Shankar Venkatesan, Andreas Stolcke

Abstract: Automatic speech recognition (ASR) models rely on high-quality transcribed data for effective training. Generating pseudo-labels for large unlabeled audio datasets often relies on complex pipelines that combine multiple ASR outputs through multi-stage processing, leading to error propagation, information loss and disjoint optimization. We propose a unified multi-ASR prompt-driven framework using p… ▽ More Automatic speech recognition (ASR) models rely on high-quality transcribed data for effective training. Generating pseudo-labels for large unlabeled audio datasets often relies on complex pipelines that combine multiple ASR outputs through multi-stage processing, leading to error propagation, information loss and disjoint optimization. We propose a unified multi-ASR prompt-driven framework using postprocessing by either textual or speech-based large language models (LLMs), replacing voting or other arbitration logic for reconciling the ensemble outputs. We perform a comparative study of multiple architectures with and without LLMs, showing significant improvements in transcription accuracy compared to traditional methods. Furthermore, we use the pseudo-labels generated by the various approaches to train semi-supervised ASR models for different datasets, again showing improved performance with textual and speechLLM transcriptions compared to baselines. △ Less

Submitted 5 June, 2025; originally announced June 2025.

arXiv:2506.04981 [pdf, ps, other]

Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering

Authors: Andres Carofilis, Pradeep Rangappa, Srikanth Madikeri, Shashi Kumar, Sergio Burdisso, Jeena Prakash, Esau Villatoro-Tello, Petr Motlicek, Bidisha Sharma, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke

Abstract: Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxilia… ▽ More Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxiliary data. Filtering based on multi-model consensus or named entity recognition (NER) is then applied to select and iteratively refine pseudo-labels, showing slower performance saturation compared to random selection. Evaluated on the multi-domain Wow call center and Fisher English corpora, it outperforms single-step fine-tuning. Consensus-based filtering outperforms other methods, providing up to 22.3% relative improvement on Wow and 24.8% on Fisher over single-step fine-tuning with random selection. NER is the second-best filter, providing competitive performance at a lower computational cost. △ Less

Submitted 5 June, 2025; originally announced June 2025.

Comments: Accepted at Interspeech 2025, Netherlands

arXiv:2506.03681 [pdf, ps, other]

Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering

Authors: Pradeep Rangappa, Andres Carofilis, Jeena Prakash, Shashi Kumar, Sergio Burdisso, Srikanth Madikeri, Esau Villatoro-Tello, Bidisha Sharma, Petr Motlicek, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke

Abstract: Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple sel… ▽ More Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies -- including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis -- to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar performance; a similar trend is observed on Fisher English. △ Less

Submitted 4 June, 2025; originally announced June 2025.

Comments: Accepted at Interspeech 2025, Netherlands

arXiv:2505.17070 [pdf, other]

Improving endpoint detection in end-to-end streaming ASR for conversational speech

Authors: Anandh C, Karthik Pandia Durai, Jeena Prakash, Manickavela Arumugam, Kadri Hacioglu, S. Pavankumar Dubagunta, Andreas Stolcke, Shankar Venkatesan, Aravind Ganapathiraju

Abstract: ASR endpointing (EP) plays a major role in delivering a good user experience in products supporting human or artificial agents in human-human/machine conversations. Transducer-based ASR (T-ASR) is an end-to-end (E2E) ASR modelling technique preferred for streaming. A major limitation of T-ASR is delayed emission of ASR outputs, which could lead to errors or delays in EP. Inaccurate EP will cut the… ▽ More ASR endpointing (EP) plays a major role in delivering a good user experience in products supporting human or artificial agents in human-human/machine conversations. Transducer-based ASR (T-ASR) is an end-to-end (E2E) ASR modelling technique preferred for streaming. A major limitation of T-ASR is delayed emission of ASR outputs, which could lead to errors or delays in EP. Inaccurate EP will cut the user off while speaking, returning incomplete transcript while delays in EP will increase the perceived latency, degrading the user experience. We propose methods to improve EP by addressing delayed emission along with EP mistakes. To address the delayed emission problem, we introduce an end-of-word token at the end of each word, along with a delay penalty. The EP delay is addressed by obtaining a reliable frame-level speech activity detection using an auxiliary network. We apply the proposed methods on Switchboard conversational speech corpus and evaluate it against a delay penalty method. △ Less

Submitted 19 May, 2025; originally announced May 2025.

Comments: Submitted to Interspeech 2024

arXiv:2305.00911 [pdf, ps, other]

SRPT vs Smith Predictor for Vehicle Teleoperation

Authors: Jai Prakash, Michele Vignati, Edoardo Sabbioni

Abstract: Vehicle teleoperation has potential applications in fallback solutions for autonomous vehicles, remote delivery services, and hazardous operations. However, network delays and limited situational awareness can compromise teleoperation performance and increase the cognitive workload of human operators. To address these issues, we previously introduced the novel successive reference pose tracking (S… ▽ More Vehicle teleoperation has potential applications in fallback solutions for autonomous vehicles, remote delivery services, and hazardous operations. However, network delays and limited situational awareness can compromise teleoperation performance and increase the cognitive workload of human operators. To address these issues, we previously introduced the novel successive reference pose tracking (SRPT) approach, which transmits successive reference poses to the vehicle instead of steering commands. This paper compares the stability and performance of SRPT with Smith predictor-based approaches for direct vehicle teleoperation in challenging scenarios. The Smith predictor approach is further categorized, one with Lookahead driver and second with Stanley driver. Simulations are conducted in a Simulink environment, considering variable network delays and different vehicle speeds, and include maneuvers such as tight corners, slalom, low-adhesion roads, and strong crosswinds. The results show that the SRPT approach significantly improves stability and reference tracking performance, with negligible effect of network delays on path tracking. Our findings demonstrate the effectiveness of SRPT in eliminating the detrimental effect of network delays in vehicle teleoperation. △ Less

Submitted 27 April, 2023; originally announced May 2023.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:1707.08391 [pdf, other]

doi 10.1109/TBME.2019.2892842

Maximum entropy based non-negative optoacoustic tomographic image reconstruction

Authors: Jaya Prakash, Subhamoy Mandal, Daniel Razansky, Vasilis Ntziachristos

Abstract: Objective:Optoacoustic (photoacoustic) tomography is aimed at reconstructing maps of the initial pressure rise induced by the absorption of light pulses in tissue. In practice, due to inaccurate assumptions in the forward model, noise and other experimental factors, the images are often afflicted by artifacts, occasionally manifested as negative values. The aim of the work is to develop an inversi… ▽ More Objective:Optoacoustic (photoacoustic) tomography is aimed at reconstructing maps of the initial pressure rise induced by the absorption of light pulses in tissue. In practice, due to inaccurate assumptions in the forward model, noise and other experimental factors, the images are often afflicted by artifacts, occasionally manifested as negative values. The aim of the work is to develop an inversion method which reduces the occurrence of negative values and improves the quantitative performance of optoacoustic imaging. Methods: We present a novel method for optoacoustic tomography based on an entropy maximization algorithm, which uses logarithmic regularization for attaining non-negative reconstructions. The reconstruction image quality is further improved using structural prior based fluence correction. Results: We report the performance achieved by the entropy maximization scheme on numerical simulation, experimental phantoms and in-vivo samples. Conclusion: The proposed algorithm demonstrates superior reconstruction performance by delivering non-negative pixel values with no visible distortion of anatomical structures. Significance: Our method can enable quantitative optoacoustic imaging, and has the potential to improve pre-clinical and translational imaging applications. △ Less

Submitted 11 January, 2019; v1 submitted 26 July, 2017; originally announced July 2017.

Comments: This article has been accepted for publication in IEEE Transactions on Biomedical Engineering (30 Dec 2018)

Showing 1–7 of 7 results for author: Prakash, J