Showing 1–2 of 2 results for author: Jogi, Y

Search v0.5.6 released 2020-02-24

arXiv:2502.13446 [pdf, other]

eess.AS cs.LG

Adopting Whisper for Confidence Estimation

Authors: Vaibhav Aggarwal, Shabari S Nair, Yash Verma, Yash Jogi

Abstract: Recent research on word-level confidence estimation for speech recognition systems has primarily focused on lightweight models known as Confidence Estimation Modules (CEMs), which rely on hand-engineered features derived from Automatic Speech Recognition (ASR) outputs. In contrast, we propose a novel end-to-end approach that leverages the ASR model itself (Whisper) to generate word-level confidenc… ▽ More Recent research on word-level confidence estimation for speech recognition systems has primarily focused on lightweight models known as Confidence Estimation Modules (CEMs), which rely on hand-engineered features derived from Automatic Speech Recognition (ASR) outputs. In contrast, we propose a novel end-to-end approach that leverages the ASR model itself (Whisper) to generate word-level confidence scores. Specifically, we introduce a method in which the Whisper model is fine-tuned to produce scalar confidence scores given an audio input and its corresponding hypothesis transcript. Our experiments demonstrate that the fine-tuned Whisper-tiny model, comparable in size to a strong CEM baseline, achieves similar performance on the in-domain dataset and surpasses the CEM baseline on eight out-of-domain datasets, whereas the fine-tuned Whisper-large model consistently outperforms the CEM baseline by a substantial margin across all datasets. △ Less

Submitted 19 February, 2025; originally announced February 2025.

Comments: Accepted at IEEE ICASSP 2025
arXiv:2502.11572 [pdf, other]

eess.AS cs.SD

Improving Rare-Word Recognition of Whisper in Zero-Shot Settings

Authors: Yash Jogi, Vaibhav Aggarwal, Shabari S Nair, Yash Verma, Aayush Kubba

Abstract: Whisper, despite being trained on 680K hours of web-scaled audio data, faces difficulty in recognising rare words like domain-specific terms, with a solution being contextual biasing through prompting. To improve upon this method, in this paper, we propose a supervised learning strategy to fine-tune Whisper for contextual biasing instruction. We demonstrate that by using only 670 hours of Common V… ▽ More Whisper, despite being trained on 680K hours of web-scaled audio data, faces difficulty in recognising rare words like domain-specific terms, with a solution being contextual biasing through prompting. To improve upon this method, in this paper, we propose a supervised learning strategy to fine-tune Whisper for contextual biasing instruction. We demonstrate that by using only 670 hours of Common Voice English set for fine-tuning, our model generalises to 11 diverse open-source English datasets, achieving a 45.6% improvement in recognition of rare words and 60.8% improvement in recognition of words unseen during fine-tuning over the baseline method. Surprisingly, our model's contextual biasing ability generalises even to languages unseen during fine-tuning. △ Less

Submitted 18 February, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

Comments: Accepted at IEEE SLT 2024

Search v0.5.6 released 2020-02-24