-
Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering
Authors:
Andres Carofilis,
Pradeep Rangappa,
Srikanth Madikeri,
Shashi Kumar,
Sergio Burdisso,
Jeena Prakash,
Esau Villatoro-Tello,
Petr Motlicek,
Bidisha Sharma,
Kadri Hacioglu,
Shankar Venkatesan,
Saurabh Vyas,
Andreas Stolcke
Abstract:
Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxilia…
▽ More
Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxiliary data. Filtering based on multi-model consensus or named entity recognition (NER) is then applied to select and iteratively refine pseudo-labels, showing slower performance saturation compared to random selection. Evaluated on the multi-domain Wow call center and Fisher English corpora, it outperforms single-step fine-tuning. Consensus-based filtering outperforms other methods, providing up to 22.3% relative improvement on Wow and 24.8% on Fisher over single-step fine-tuning with random selection. NER is the second-best filter, providing competitive performance at a lower computational cost.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering
Authors:
Pradeep Rangappa,
Andres Carofilis,
Jeena Prakash,
Shashi Kumar,
Sergio Burdisso,
Srikanth Madikeri,
Esau Villatoro-Tello,
Bidisha Sharma,
Petr Motlicek,
Kadri Hacioglu,
Shankar Venkatesan,
Saurabh Vyas,
Andreas Stolcke
Abstract:
Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple sel…
▽ More
Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies -- including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis -- to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar performance; a similar trend is observed on Fisher English.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
LM-assisted keyword biasing with Aho-Corasick algorithm for Transducer-based ASR
Authors:
Iuliia Thorbecke,
Juan Zuluaga-Gomez,
Esaú Villatoro-Tello,
Andres Carofilis,
Shashi Kumar,
Petr Motlicek,
Karthik Pandia,
Aravind Ganapathiraju
Abstract:
Despite the recent success of end-to-end models for automatic speech recognition, recognizing special rare and out-of-vocabulary words, as well as fast domain adaptation with text, are still challenging. It often happens that biasing to the special entities leads to a degradation in the overall performance. We propose a light on-the-fly method to improve automatic speech recognition performance by…
▽ More
Despite the recent success of end-to-end models for automatic speech recognition, recognizing special rare and out-of-vocabulary words, as well as fast domain adaptation with text, are still challenging. It often happens that biasing to the special entities leads to a degradation in the overall performance. We propose a light on-the-fly method to improve automatic speech recognition performance by combining a bias list of named entities with a word-level n-gram language model with the shallow fusion approach based on the Aho-Corasick string matching algorithm. The Aho-Corasick algorithm has proved to be more efficient than other methods and allows fast context adaptation. An n-gram language model is introduced as a graph with fail and output arcs, where the arc weights are adapted from the n-gram probabilities. The language model is used as an additional support to keyword biasing when the language model is combined with bias entities in a single context graph to take care of the overall performance. We demonstrate our findings on 4 languages, 2 public and 1 private datasets including performance on named entities and out-of-vocabulary entities. We achieve up to 21.6% relative improvement in the general word error rate with no practical difference in the inverse real-time factor.
△ Less
Submitted 20 September, 2024;
originally announced September 2024.
-
MeWEHV: Mel and Wave Embeddings for Human Voice Tasks
Authors:
Andrés Carofilis,
Laura Fernández-Robles,
Enrique Alegre,
Eduardo Fidalgo
Abstract:
A recent trend in speech processing is the use of embeddings created through machine learning models trained on a specific task with large datasets. By leveraging the knowledge already acquired, these models can be reused in new tasks where the amount of available data is small. This paper proposes a pipeline to create a new model, called Mel and Wave Embeddings for Human Voice Tasks (MeWEHV), cap…
▽ More
A recent trend in speech processing is the use of embeddings created through machine learning models trained on a specific task with large datasets. By leveraging the knowledge already acquired, these models can be reused in new tasks where the amount of available data is small. This paper proposes a pipeline to create a new model, called Mel and Wave Embeddings for Human Voice Tasks (MeWEHV), capable of generating robust embeddings for speech processing. MeWEHV combines the embeddings generated by a pre-trained raw audio waveform encoder model, and deep features extracted from Mel Frequency Cepstral Coefficients (MFCCs) using Convolutional Neural Networks (CNNs). We evaluate the performance of MeWEHV on three tasks: speaker, language, and accent identification. For the first one, we use the VoxCeleb1 dataset and present YouSpeakers204, a new and publicly available dataset for English speaker identification that contains 19607 audio clips from 204 persons speaking in six different accents, allowing other researchers to work with a very balanced dataset, and to create new models that are robust to multiple accents. For evaluating the language identification task, we use the VoxForge and Common Language datasets. Finally, for accent identification, we use the Latin American Spanish Corpora (LASC) and Common Voice datasets. Our approach allows a significant increase in the performance of state-of-the-art models on all the tested datasets, with a low additional computational cost.
△ Less
Submitted 24 June, 2023; v1 submitted 28 September, 2022;
originally announced September 2022.
-
Classifying Suspicious Content in Tor Darknet
Authors:
Eduardo Fidalgo Fernandez,
Roberto Andrés Vasco Carofilis,
Francisco Jáñez Martino,
Pablo Blanco Medina
Abstract:
One of the tasks of law enforcement agencies is to find evidence of criminal activity in the Darknet. However, visiting thousands of domains to locate visual information containing illegal acts manually requires a considerable amount of time and resources. Furthermore, the background of the images can pose a challenge when performing classification. To solve this problem, in this paper, we explore…
▽ More
One of the tasks of law enforcement agencies is to find evidence of criminal activity in the Darknet. However, visiting thousands of domains to locate visual information containing illegal acts manually requires a considerable amount of time and resources. Furthermore, the background of the images can pose a challenge when performing classification. To solve this problem, in this paper, we explore the automatic classification Tor Darknet images using Semantic Attention Keypoint Filtering, a strategy that filters non-significant features at a pixel level that do not belong to the object of interest, by combining saliency maps with Bag of Visual Words (BoVW). We evaluated SAKF on a custom Tor image dataset against CNN features: MobileNet v1 and Resnet50, and BoVW using dense SIFT descriptors, achieving a result of 87.98% accuracy and outperforming all other approaches.
△ Less
Submitted 21 May, 2020; v1 submitted 20 May, 2020;
originally announced May 2020.