-
Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering
Authors:
Andres Carofilis,
Pradeep Rangappa,
Srikanth Madikeri,
Shashi Kumar,
Sergio Burdisso,
Jeena Prakash,
Esau Villatoro-Tello,
Petr Motlicek,
Bidisha Sharma,
Kadri Hacioglu,
Shankar Venkatesan,
Saurabh Vyas,
Andreas Stolcke
Abstract:
Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxilia…
▽ More
Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxiliary data. Filtering based on multi-model consensus or named entity recognition (NER) is then applied to select and iteratively refine pseudo-labels, showing slower performance saturation compared to random selection. Evaluated on the multi-domain Wow call center and Fisher English corpora, it outperforms single-step fine-tuning. Consensus-based filtering outperforms other methods, providing up to 22.3% relative improvement on Wow and 24.8% on Fisher over single-step fine-tuning with random selection. NER is the second-best filter, providing competitive performance at a lower computational cost.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering
Authors:
Pradeep Rangappa,
Andres Carofilis,
Jeena Prakash,
Shashi Kumar,
Sergio Burdisso,
Srikanth Madikeri,
Esau Villatoro-Tello,
Bidisha Sharma,
Petr Motlicek,
Kadri Hacioglu,
Shankar Venkatesan,
Saurabh Vyas,
Andreas Stolcke
Abstract:
Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple sel…
▽ More
Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies -- including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis -- to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar performance; a similar trend is observed on Fisher English.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
TokenVerse: Towards Unifying Speech and NLP Tasks via Transducer-based ASR
Authors:
Shashi Kumar,
Srikanth Madikeri,
Juan Zuluaga-Gomez,
Iuliia Thorbecke,
Esaú Villatoro-Tello,
Sergio Burdisso,
Petr Motlicek,
Karthik Pandia,
Aravind Ganapathiraju
Abstract:
In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achie…
▽ More
In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achieved by integrating task-specific tokens into the reference text during ASR model training, streamlining the inference and eliminating the need for separate NLP models. In addition to ASR, we conduct experiments on 3 different tasks: speaker change detection, endpointing, and NER. Our experiments on a public and a private dataset show that the proposed method improves ASR by up to 7.7% in relative WER while outperforming the cascaded pipeline approach in individual task performance. Our code is publicly available: https://github.com/idiap/tokenverse-unifying-speech-nlp
△ Less
Submitted 8 October, 2024; v1 submitted 5 July, 2024;
originally announced July 2024.
-
XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models
Authors:
Shashi Kumar,
Srikanth Madikeri,
Juan Zuluaga-Gomez,
Esaú Villatoro-Tello,
Iuliia Thorbecke,
Petr Motlicek,
Manjunath K E,
Aravind Ganapathiraju
Abstract:
Self-supervised pretrained models exhibit competitive performance in automatic speech recognition on finetuning, even with limited in-domain supervised data. However, popular pretrained models are not suitable for streaming ASR because they are trained with full attention context. In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer setup. Our exper…
▽ More
Self-supervised pretrained models exhibit competitive performance in automatic speech recognition on finetuning, even with limited in-domain supervised data. However, popular pretrained models are not suitable for streaming ASR because they are trained with full attention context. In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer setup. Our experiments on the AMI dataset reveal that the XLSR-Transducer achieves 4% absolute WER improvement over Whisper large-v2 and 8% over a Zipformer transducer model trained from scratch. To enable streaming capabilities, we investigate different attention masking patterns in the self-attention computation of transformer layers within the XLSR-53 model. We validate XLSR-Transducer on AMI and 5 languages from CommonVoice under low-resource scenarios. Finally, with the introduction of attention sinks, we reduce the left context by half while achieving a relative 12% improvement in WER.
△ Less
Submitted 8 October, 2024; v1 submitted 5 July, 2024;
originally announced July 2024.
-
Implementing contextual biasing in GPU decoder for online ASR
Authors:
Iuliia Nigmatulina,
Srikanth Madikeri,
Esaú Villatoro-Tello,
Petr Motliček,
Juan Zuluaga-Gomez,
Karthik Pandia,
Aravind Ganapathiraju
Abstract:
GPU decoding significantly accelerates the output of ASR predictions. While GPUs are already being used for online ASR decoding, post-processing and rescoring on GPUs have not been properly investigated yet. Rescoring with available contextual information can considerably improve ASR predictions. Previous studies have proven the viability of lattice rescoring in decoding and biasing language model…
▽ More
GPU decoding significantly accelerates the output of ASR predictions. While GPUs are already being used for online ASR decoding, post-processing and rescoring on GPUs have not been properly investigated yet. Rescoring with available contextual information can considerably improve ASR predictions. Previous studies have proven the viability of lattice rescoring in decoding and biasing language model (LM) weights in offline and online CPU scenarios. In real-time GPU decoding, partial recognition hypotheses are produced without lattice generation, which makes the implementation of biasing more complex. The paper proposes and describes an approach to integrate contextual biasing in real-time GPU decoding while exploiting the standard Kaldi GPU decoder. Besides the biasing of partial ASR predictions, our approach also permits dynamic context switching allowing a flexible rescoring per each speech segment directly on GPU. The code is publicly released and tested with open-sourced test sets.
△ Less
Submitted 23 June, 2023;
originally announced June 2023.
-
Lessons Learned in ATCO2: 5000 hours of Air Traffic Control Communications for Robust Automatic Speech Recognition and Understanding
Authors:
Juan Zuluaga-Gomez,
Iuliia Nigmatulina,
Amrutha Prasad,
Petr Motlicek,
Driss Khalil,
Srikanth Madikeri,
Allan Tart,
Igor Szoke,
Vincent Lenders,
Mickael Rigault,
Khalid Choukri
Abstract:
Voice communication between air traffic controllers (ATCos) and pilots is critical for ensuring safe and efficient air traffic control (ATC). This task requires high levels of awareness from ATCos and can be tedious and error-prone. Recent attempts have been made to integrate artificial intelligence (AI) into ATC in order to reduce the workload of ATCos. However, the development of data-driven AI…
▽ More
Voice communication between air traffic controllers (ATCos) and pilots is critical for ensuring safe and efficient air traffic control (ATC). This task requires high levels of awareness from ATCos and can be tedious and error-prone. Recent attempts have been made to integrate artificial intelligence (AI) into ATC in order to reduce the workload of ATCos. However, the development of data-driven AI systems for ATC demands large-scale annotated datasets, which are currently lacking in the field. This paper explores the lessons learned from the ATCO2 project, a project that aimed to develop a unique platform to collect and preprocess large amounts of ATC data from airspace in real time. Audio and surveillance data were collected from publicly accessible radio frequency channels with VHF receivers owned by a community of volunteers and later uploaded to Opensky Network servers, which can be considered an "unlimited source" of data. In addition, this paper reviews previous work from ATCO2 partners, including (i) robust automatic speech recognition, (ii) natural language processing, (iii) English language identification of ATC communications, and (iv) the integration of surveillance data such as ADS-B. We believe that the pipeline developed during the ATCO2 project, along with the open-sourcing of its data, will encourage research in the ATC field. A sample of the ATCO2 corpus is available on the following website: https://www.atco2.org/data, while the full corpus can be purchased through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484. We demonstrated that ATCO2 is an appropriate dataset to develop ASR engines when little or near to no ATC in-domain data is available. For instance, with the CNN-TDNNf kaldi model, we reached the performance of as low as 17.9% and 24.9% WER on public ATC datasets which is 6.6/7.6% better than "out-of-domain" but supervised CNN-TDNNf model.
△ Less
Submitted 1 May, 2023;
originally announced May 2023.
-
Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks
Authors:
Esaú Villatoro-Tello,
Srikanth Madikeri,
Juan Zuluaga-Gomez,
Bidisha Sharma,
Seyyed Saeed Sarfjoo,
Iuliia Nigmatulina,
Petr Motlicek,
Alexei V. Ivanov,
Aravind Ganapathiraju
Abstract:
In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable perfo…
▽ More
In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable performance of different state-of-the-art SLU systems under different circumstances, e.g., automatically- vs. manually-generated transcripts. We evaluate the systems on the publicly available SLURP spoken language resource corpus. Our results indicate that using richer forms of Automatic Speech Recognition (ASR) outputs, namely word-consensus-networks, allows the SLU system to improve in comparison to the 1-best setup (5.5% relative improvement). However, crossmodal approaches, i.e., learning from acoustic and text embeddings, obtains performance similar to the oracle setup, a relative improvement of 17.8% over the 1-best configuration, being a recommended alternative to overcome the limitations of working with automatically generated transcripts.
△ Less
Submitted 17 March, 2023; v1 submitted 16 December, 2022;
originally announced December 2022.
-
A Comparison of Methods for OOV-word Recognition on a New Public Dataset
Authors:
Rudolf A. Braun,
Srikanth Madikeri,
Petr Motlicek
Abstract:
A common problem for automatic speech recognition systems is how to recognize words that they did not see during training. Currently there is no established method of evaluating different techniques for tackling this problem. We propose using the CommonVoice dataset to create test sets for multiple languages which have a high out-of-vocabulary (OOV) ratio relative to a training set and release a n…
▽ More
A common problem for automatic speech recognition systems is how to recognize words that they did not see during training. Currently there is no established method of evaluating different techniques for tackling this problem. We propose using the CommonVoice dataset to create test sets for multiple languages which have a high out-of-vocabulary (OOV) ratio relative to a training set and release a new tool for calculating relevant performance metrics. We then evaluate, within the context of a hybrid ASR system, how much better subword models are at recognizing OOVs, and how much benefit one can get from incorporating OOV-word information into an existing system by modifying WFSTs. Additionally, we propose a new method for modifying a subword-based language model so as to better recognize OOV-words. We showcase very large improvements in OOV-word recognition and make both the data and code available.
△ Less
Submitted 16 July, 2021;
originally announced July 2021.
-
Comparing CTC and LFMMI for out-of-domain adaptation of wav2vec 2.0 acoustic model
Authors:
Apoorv Vyas,
Srikanth Madikeri,
Hervé Bourlard
Abstract:
In this work, we investigate if the wav2vec 2.0 self-supervised pretraining helps mitigate the overfitting issues with connectionist temporal classification (CTC) training to reduce its performance gap with flat-start lattice-free MMI (E2E-LFMMI) for automatic speech recognition with limited training data. Towards that objective, we use the pretrained wav2vec 2.0 BASE model and fine-tune it on thr…
▽ More
In this work, we investigate if the wav2vec 2.0 self-supervised pretraining helps mitigate the overfitting issues with connectionist temporal classification (CTC) training to reduce its performance gap with flat-start lattice-free MMI (E2E-LFMMI) for automatic speech recognition with limited training data. Towards that objective, we use the pretrained wav2vec 2.0 BASE model and fine-tune it on three different datasets including out-of-domain (Switchboard) and cross-lingual (Babel) scenarios. Our results show that for supervised adaptation of the wav2vec 2.0 model, both E2E-LFMMI and CTC achieve similar results; significantly outperforming the baselines trained only with supervised data. Fine-tuning the wav2vec 2.0 model with E2E-LFMMI and CTC we obtain the following relative WER improvements over the supervised baseline trained with E2E-LFMMI. We get relative improvements of 40% and 44% on the clean-set and 64% and 58% on the test set of Librispeech (100h) respectively. On Switchboard (300h) we obtain relative improvements of 33% and 35% respectively. Finally, for Babel languages, we obtain relative improvements of 26% and 23% on Swahili (38h) and 18% and 17% on Tagalog (84h) respectively.
△ Less
Submitted 6 April, 2021;
originally announced April 2021.
-
Lattice-Free MMI Adaptation Of Self-Supervised Pretrained Acoustic Models
Authors:
Apoorv Vyas,
Srikanth Madikeri,
Hervé Bourlard
Abstract:
In this work, we propose lattice-free MMI (LFMMI) for supervised adaptation of self-supervised pretrained acoustic model. We pretrain a Transformer model on thousand hours of untranscribed Librispeech data followed by supervised adaptation with LFMMI on three different datasets. Our results show that fine-tuning with LFMMI, we consistently obtain relative WER improvements of 10% and 35.3% on the c…
▽ More
In this work, we propose lattice-free MMI (LFMMI) for supervised adaptation of self-supervised pretrained acoustic model. We pretrain a Transformer model on thousand hours of untranscribed Librispeech data followed by supervised adaptation with LFMMI on three different datasets. Our results show that fine-tuning with LFMMI, we consistently obtain relative WER improvements of 10% and 35.3% on the clean and other test sets of Librispeech (100h), 10.8% on Switchboard (300h), and 4.3% on Swahili (38h) and 4.4% on Tagalog (84h) compared to the baseline trained only with supervised data.
△ Less
Submitted 6 April, 2021; v1 submitted 28 December, 2020;
originally announced December 2020.
-
Speech Activity Detection Based on Multilingual Speech Recognition System
Authors:
Seyyed Saeed Sarfjoo,
Srikanth Madikeri,
Petr Motlicek
Abstract:
To better model the contextual information and increase the generalization ability of Speech Activity Detection (SAD) system, this paper leverages a multi-lingual Automatic Speech Recognition (ASR) system to perform SAD. Sequence discriminative training of Acoustic Model (AM) using Lattice-Free Maximum Mutual Information (LF-MMI) loss function, effectively extracts the contextual information of th…
▽ More
To better model the contextual information and increase the generalization ability of Speech Activity Detection (SAD) system, this paper leverages a multi-lingual Automatic Speech Recognition (ASR) system to perform SAD. Sequence discriminative training of Acoustic Model (AM) using Lattice-Free Maximum Mutual Information (LF-MMI) loss function, effectively extracts the contextual information of the input acoustic frame. Multi-lingual AM training, causes the robustness to noise and language variabilities. The index of maximum output posterior is considered as a frame-level speech/non-speech decision function. Majority voting and logistic regression are applied to fuse the language-dependent decisions. The multi-lingual ASR is trained on 18 languages of BABEL datasets and the built SAD is evaluated on 3 different languages. On out-of-domain datasets, the proposed SAD model shows significantly better performance with respect to baseline models. On the Ester2 dataset, without using any in-domain data, this model outperforms the WebRTC, phoneme recognizer based VAD (Phn Rec), and Pyannote baselines (respectively by 7.1, 1.7, and 2.7% absolute) in Detection Error Rate (DetER) metrics. Similarly, on the LiveATC dataset, this model outperforms the WebRTC, Phn Rec, and Pyannote baselines (respectively by 6.4, 10.0, and 3.7% absolutely) in DetER metrics.
△ Less
Submitted 11 April, 2021; v1 submitted 23 October, 2020;
originally announced October 2020.
-
Novel Architectures for Unsupervised Information Bottleneck based Speaker Diarization of Meetings
Authors:
Nauman Dawalatabad,
Srikanth Madikeri,
C. Chandra Sekhar,
Hema A. Murthy
Abstract:
Speaker diarization is an important problem that is topical, and is especially useful as a preprocessor for conversational speech related applications. The objective of this paper is two-fold: (i) segment initialization by uniformly distributing speaker information across the initial segments, and (ii) incorporating speaker discriminative features within the unsupervised diarization framework. In…
▽ More
Speaker diarization is an important problem that is topical, and is especially useful as a preprocessor for conversational speech related applications. The objective of this paper is two-fold: (i) segment initialization by uniformly distributing speaker information across the initial segments, and (ii) incorporating speaker discriminative features within the unsupervised diarization framework. In the first part of the work, a varying length segment initialization technique for Information Bottleneck (IB) based speaker diarization system using phoneme rate as the side information is proposed. This initialization distributes speaker information uniformly across the segments and provides a better starting point for IB based clustering. In the second part of the work, we present a Two-Pass Information Bottleneck (TPIB) based speaker diarization system that incorporates speaker discriminative features during the process of diarization. The TPIB based speaker diarization system has shown improvement over the baseline IB based system. During the first pass of the TPIB system, a coarse segmentation is performed using IB based clustering. The alignments obtained are used to generate speaker discriminative features using a shallow feed-forward neural network and linear discriminant analysis. The discriminative features obtained are used in the second pass to obtain the final speaker boundaries. In the final part of the paper, variable segment initialization is combined with the TPIB framework. This leverages the advantages of better segment initialization and speaker discriminative features that results in an additional improvement in performance. An evaluation on standard meeting datasets shows that a significant absolute improvement of 3.9% and 4.7% is obtained on the NIST and AMI datasets, respectively.
△ Less
Submitted 13 October, 2020;
originally announced October 2020.
-
Pkwrap: a PyTorch Package for LF-MMI Training of Acoustic Models
Authors:
Srikanth Madikeri,
Sibo Tong,
Juan Zuluaga-Gomez,
Apoorv Vyas,
Petr Motlicek,
Hervé Bourlard
Abstract:
We present a simple wrapper that is useful to train acoustic models in PyTorch using Kaldi's LF-MMI training framework. The wrapper, called pkwrap (short form of PyTorch kaldi wrapper), enables the user to utilize the flexibility provided by PyTorch in designing model architectures. It exposes the LF-MMI cost function as an autograd function. Other capabilities of Kaldi have also been ported to Py…
▽ More
We present a simple wrapper that is useful to train acoustic models in PyTorch using Kaldi's LF-MMI training framework. The wrapper, called pkwrap (short form of PyTorch kaldi wrapper), enables the user to utilize the flexibility provided by PyTorch in designing model architectures. It exposes the LF-MMI cost function as an autograd function. Other capabilities of Kaldi have also been ported to PyTorch. This includes the parallel training ability when multi-GPU environments are unavailable and decode with graphs created in Kaldi. The package is available on Github at https://github.com/idiap/pkwrap.
△ Less
Submitted 7 October, 2020;
originally announced October 2020.
-
Quantization of Acoustic Model Parameters in Automatic Speech Recognition Framework
Authors:
Amrutha Prasad,
Petr Motlicek,
Srikanth Madikeri
Abstract:
State-of-the-art hybrid automatic speech recognition (ASR) system exploits deep neural network (DNN) based acoustic models (AM) trained with Lattice Free-Maximum Mutual Information (LF-MMI) criterion and n-gram language models. The AMs typically have millions of parameters and require significant parameter reduction to operate on embedded devices. The impact of parameter quantization on the overal…
▽ More
State-of-the-art hybrid automatic speech recognition (ASR) system exploits deep neural network (DNN) based acoustic models (AM) trained with Lattice Free-Maximum Mutual Information (LF-MMI) criterion and n-gram language models. The AMs typically have millions of parameters and require significant parameter reduction to operate on embedded devices. The impact of parameter quantization on the overall word recognition performance is studied in this paper. Following approaches are presented: (i) AM trained in Kaldi framework with conventional factorized TDNN (TDNN-F) architecture, (ii) the TDNN AM built in Kaldi loaded into the PyTorch toolkit using a C++ wrapper for post-training quantization, (iii) quantization-aware training in PyTorch for Kaldi TDNN model, (iv) quantization-aware training in Kaldi. Results obtained on standard Librispeech setup provide an interesting overview of recognition accuracy w.r.t. applied quantization scheme.
△ Less
Submitted 20 November, 2020; v1 submitted 16 June, 2020;
originally announced June 2020.
-
Graph2Speak: Improving Speaker Identification using Network Knowledge in Criminal Conversational Data
Authors:
Mael Fabien,
Seyyed Saeed Sarfjoo,
Petr Motlicek,
Srikanth Madikeri
Abstract:
Criminal investigations mostly rely on the collection of speech conversational data in order to identify speakers and build or enrich an existing criminal network. Social network analysis tools are then applied to identify the most central characters and the different communities within the network. We introduce two candidate datasets for criminal conversational data, Crime Scene Investigation (CS…
▽ More
Criminal investigations mostly rely on the collection of speech conversational data in order to identify speakers and build or enrich an existing criminal network. Social network analysis tools are then applied to identify the most central characters and the different communities within the network. We introduce two candidate datasets for criminal conversational data, Crime Scene Investigation (CSI), a television show, and the ROXANNE simulated data. We also introduce the metric of conversation accuracy in the context of criminal investigations. By re-ranking candidate speakers based on the frequency of previous interactions, we improve the speaker identification baseline by 1.2% absolute (1.3% relative), and the conversation accuracy by 2.6% absolute (3.4% relative) on CSI data, and by 1.1% absolute (1.2% relative), and 2% absolute (2.5% relative) respectively on the ROXANNE simulated data.
△ Less
Submitted 21 September, 2020; v1 submitted 3 June, 2020;
originally announced June 2020.
-
Incremental Transfer Learning in Two-pass Information Bottleneck based Speaker Diarization System for Meetings
Authors:
Nauman Dawalatabad,
Srikanth Madikeri,
C Chandra Sekhar,
Hema A Murthy
Abstract:
The two-pass information bottleneck (TPIB) based speaker diarization system operates independently on different conversational recordings. TPIB system does not consider previously learned speaker discriminative information while diarizing new conversations. Hence, the real time factor (RTF) of TPIB system is high owing to the training time required for the artificial neural network (ANN). This pap…
▽ More
The two-pass information bottleneck (TPIB) based speaker diarization system operates independently on different conversational recordings. TPIB system does not consider previously learned speaker discriminative information while diarizing new conversations. Hence, the real time factor (RTF) of TPIB system is high owing to the training time required for the artificial neural network (ANN). This paper attempts to improve the RTF of the TPIB system using an incremental transfer learning approach where the parameters learned by the ANN from other conversations are updated using current conversation rather than learning parameters from scratch. This reduces the RTF significantly. The effectiveness of the proposed approach compared to the baseline IB and the TPIB systems is demonstrated on standard NIST and AMI conversational meeting datasets. With a minor degradation in performance, the proposed system shows a significant improvement of 33.07% and 24.45% in RTF with respect to TPIB system on the NIST RT-04Eval and AMI-1 datasets, respectively.
△ Less
Submitted 21 February, 2019;
originally announced February 2019.