Skip to main content

Showing 1–16 of 16 results for author: Madikeri, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.04981  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering

    Authors: Andres Carofilis, Pradeep Rangappa, Srikanth Madikeri, Shashi Kumar, Sergio Burdisso, Jeena Prakash, Esau Villatoro-Tello, Petr Motlicek, Bidisha Sharma, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke

    Abstract: Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxilia… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Comments: Accepted at Interspeech 2025, Netherlands

  2. arXiv:2506.03681  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering

    Authors: Pradeep Rangappa, Andres Carofilis, Jeena Prakash, Shashi Kumar, Sergio Burdisso, Srikanth Madikeri, Esau Villatoro-Tello, Bidisha Sharma, Petr Motlicek, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke

    Abstract: Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple sel… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: Accepted at Interspeech 2025, Netherlands

  3. arXiv:2407.04444  [pdf, other

    cs.CL cs.SD eess.AS

    TokenVerse: Towards Unifying Speech and NLP Tasks via Transducer-based ASR

    Authors: Shashi Kumar, Srikanth Madikeri, Juan Zuluaga-Gomez, Iuliia Thorbecke, Esaú Villatoro-Tello, Sergio Burdisso, Petr Motlicek, Karthik Pandia, Aravind Ganapathiraju

    Abstract: In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achie… ▽ More

    Submitted 8 October, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

    Comments: Accepted at EMNLP 2024 (Main Conference)

  4. arXiv:2407.04439  [pdf, other

    eess.AS

    XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

    Authors: Shashi Kumar, Srikanth Madikeri, Juan Zuluaga-Gomez, Esaú Villatoro-Tello, Iuliia Thorbecke, Petr Motlicek, Manjunath K E, Aravind Ganapathiraju

    Abstract: Self-supervised pretrained models exhibit competitive performance in automatic speech recognition on finetuning, even with limited in-domain supervised data. However, popular pretrained models are not suitable for streaming ASR because they are trained with full attention context. In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer setup. Our exper… ▽ More

    Submitted 8 October, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

    Comments: 5 pages, double column

  5. arXiv:2306.15685  [pdf, other

    eess.AS cs.CL

    Implementing contextual biasing in GPU decoder for online ASR

    Authors: Iuliia Nigmatulina, Srikanth Madikeri, Esaú Villatoro-Tello, Petr Motliček, Juan Zuluaga-Gomez, Karthik Pandia, Aravind Ganapathiraju

    Abstract: GPU decoding significantly accelerates the output of ASR predictions. While GPUs are already being used for online ASR decoding, post-processing and rescoring on GPUs have not been properly investigated yet. Rescoring with available contextual information can considerably improve ASR predictions. Previous studies have proven the viability of lattice rescoring in decoding and biasing language model… ▽ More

    Submitted 23 June, 2023; originally announced June 2023.

    Comments: Accepted to Interspeech 2023

  6. arXiv:2305.01155  [pdf, other

    eess.AS cs.CL cs.HC cs.SD

    Lessons Learned in ATCO2: 5000 hours of Air Traffic Control Communications for Robust Automatic Speech Recognition and Understanding

    Authors: Juan Zuluaga-Gomez, Iuliia Nigmatulina, Amrutha Prasad, Petr Motlicek, Driss Khalil, Srikanth Madikeri, Allan Tart, Igor Szoke, Vincent Lenders, Mickael Rigault, Khalid Choukri

    Abstract: Voice communication between air traffic controllers (ATCos) and pilots is critical for ensuring safe and efficient air traffic control (ATC). This task requires high levels of awareness from ATCos and can be tedious and error-prone. Recent attempts have been made to integrate artificial intelligence (AI) into ATC in order to reduce the workload of ATCos. However, the development of data-driven AI… ▽ More

    Submitted 1 May, 2023; originally announced May 2023.

    Comments: Manuscript under review

  7. arXiv:2212.08489  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks

    Authors: Esaú Villatoro-Tello, Srikanth Madikeri, Juan Zuluaga-Gomez, Bidisha Sharma, Seyyed Saeed Sarfjoo, Iuliia Nigmatulina, Petr Motlicek, Alexei V. Ivanov, Aravind Ganapathiraju

    Abstract: In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable perfo… ▽ More

    Submitted 17 March, 2023; v1 submitted 16 December, 2022; originally announced December 2022.

    Comments: Accepted in ICASSP 2023

    ACM Class: I.2.7

    Journal ref: ICASSP 2023

  8. A Comparison of Methods for OOV-word Recognition on a New Public Dataset

    Authors: Rudolf A. Braun, Srikanth Madikeri, Petr Motlicek

    Abstract: A common problem for automatic speech recognition systems is how to recognize words that they did not see during training. Currently there is no established method of evaluating different techniques for tackling this problem. We propose using the CommonVoice dataset to create test sets for multiple languages which have a high out-of-vocabulary (OOV) ratio relative to a training set and release a n… ▽ More

    Submitted 16 July, 2021; originally announced July 2021.

  9. arXiv:2104.02558  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Comparing CTC and LFMMI for out-of-domain adaptation of wav2vec 2.0 acoustic model

    Authors: Apoorv Vyas, Srikanth Madikeri, Hervé Bourlard

    Abstract: In this work, we investigate if the wav2vec 2.0 self-supervised pretraining helps mitigate the overfitting issues with connectionist temporal classification (CTC) training to reduce its performance gap with flat-start lattice-free MMI (E2E-LFMMI) for automatic speech recognition with limited training data. Towards that objective, we use the pretrained wav2vec 2.0 BASE model and fine-tune it on thr… ▽ More

    Submitted 6 April, 2021; originally announced April 2021.

  10. arXiv:2012.14252  [pdf, ps, other

    cs.LG cs.SD eess.AS

    Lattice-Free MMI Adaptation Of Self-Supervised Pretrained Acoustic Models

    Authors: Apoorv Vyas, Srikanth Madikeri, Hervé Bourlard

    Abstract: In this work, we propose lattice-free MMI (LFMMI) for supervised adaptation of self-supervised pretrained acoustic model. We pretrain a Transformer model on thousand hours of untranscribed Librispeech data followed by supervised adaptation with LFMMI on three different datasets. Our results show that fine-tuning with LFMMI, we consistently obtain relative WER improvements of 10% and 35.3% on the c… ▽ More

    Submitted 6 April, 2021; v1 submitted 28 December, 2020; originally announced December 2020.

  11. arXiv:2010.12277  [pdf, other

    cs.SD eess.AS

    Speech Activity Detection Based on Multilingual Speech Recognition System

    Authors: Seyyed Saeed Sarfjoo, Srikanth Madikeri, Petr Motlicek

    Abstract: To better model the contextual information and increase the generalization ability of Speech Activity Detection (SAD) system, this paper leverages a multi-lingual Automatic Speech Recognition (ASR) system to perform SAD. Sequence discriminative training of Acoustic Model (AM) using Lattice-Free Maximum Mutual Information (LF-MMI) loss function, effectively extracts the contextual information of th… ▽ More

    Submitted 11 April, 2021; v1 submitted 23 October, 2020; originally announced October 2020.

    Comments: Submitted to Interspeech 2021

  12. Novel Architectures for Unsupervised Information Bottleneck based Speaker Diarization of Meetings

    Authors: Nauman Dawalatabad, Srikanth Madikeri, C. Chandra Sekhar, Hema A. Murthy

    Abstract: Speaker diarization is an important problem that is topical, and is especially useful as a preprocessor for conversational speech related applications. The objective of this paper is two-fold: (i) segment initialization by uniformly distributing speaker information across the initial segments, and (ii) incorporating speaker discriminative features within the unsupervised diarization framework. In… ▽ More

    Submitted 13 October, 2020; originally announced October 2020.

    Comments: Accepted in IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, 2021, pp 14-27

  13. arXiv:2010.03466  [pdf, ps, other

    eess.AS cs.SD

    Pkwrap: a PyTorch Package for LF-MMI Training of Acoustic Models

    Authors: Srikanth Madikeri, Sibo Tong, Juan Zuluaga-Gomez, Apoorv Vyas, Petr Motlicek, Hervé Bourlard

    Abstract: We present a simple wrapper that is useful to train acoustic models in PyTorch using Kaldi's LF-MMI training framework. The wrapper, called pkwrap (short form of PyTorch kaldi wrapper), enables the user to utilize the flexibility provided by PyTorch in designing model architectures. It exposes the LF-MMI cost function as an autograd function. Other capabilities of Kaldi have also been ported to Py… ▽ More

    Submitted 7 October, 2020; originally announced October 2020.

  14. arXiv:2006.09054  [pdf, other

    eess.AS cs.SD

    Quantization of Acoustic Model Parameters in Automatic Speech Recognition Framework

    Authors: Amrutha Prasad, Petr Motlicek, Srikanth Madikeri

    Abstract: State-of-the-art hybrid automatic speech recognition (ASR) system exploits deep neural network (DNN) based acoustic models (AM) trained with Lattice Free-Maximum Mutual Information (LF-MMI) criterion and n-gram language models. The AMs typically have millions of parameters and require significant parameter reduction to operate on embedded devices. The impact of parameter quantization on the overal… ▽ More

    Submitted 20 November, 2020; v1 submitted 16 June, 2020; originally announced June 2020.

    Comments: Submitted to ICASSP21

  15. arXiv:2006.02093  [pdf, other

    cs.SI cs.SD eess.AS

    Graph2Speak: Improving Speaker Identification using Network Knowledge in Criminal Conversational Data

    Authors: Mael Fabien, Seyyed Saeed Sarfjoo, Petr Motlicek, Srikanth Madikeri

    Abstract: Criminal investigations mostly rely on the collection of speech conversational data in order to identify speakers and build or enrich an existing criminal network. Social network analysis tools are then applied to identify the most central characters and the different communities within the network. We introduce two candidate datasets for criminal conversational data, Crime Scene Investigation (CS… ▽ More

    Submitted 21 September, 2020; v1 submitted 3 June, 2020; originally announced June 2020.

  16. Incremental Transfer Learning in Two-pass Information Bottleneck based Speaker Diarization System for Meetings

    Authors: Nauman Dawalatabad, Srikanth Madikeri, C Chandra Sekhar, Hema A Murthy

    Abstract: The two-pass information bottleneck (TPIB) based speaker diarization system operates independently on different conversational recordings. TPIB system does not consider previously learned speaker discriminative information while diarizing new conversations. Hence, the real time factor (RTF) of TPIB system is high owing to the training time required for the artificial neural network (ANN). This pap… ▽ More

    Submitted 21 February, 2019; originally announced February 2019.

    Comments: 5 pages, 2 figures, To appear in Proc. ICASSP 2019, May 12-17, 2019, Brighton, UK