-
Unified Semi-Supervised Pipeline for Automatic Speech Recognition
Authors:
Nune Tadevosyan,
Nikolay Karpov,
Andrei Andrusenko,
Vitaly Lavrukhin,
Ante Jukic
Abstract:
Automatic Speech Recognition has been a longstanding research area, with substantial efforts dedicated to integrating semi-supervised learning due to the scarcity of labeled datasets. However, most prior work has focused on improving learning algorithms using existing datasets, without providing a complete public framework for large-scale semi-supervised training across new datasets or languages.…
▽ More
Automatic Speech Recognition has been a longstanding research area, with substantial efforts dedicated to integrating semi-supervised learning due to the scarcity of labeled datasets. However, most prior work has focused on improving learning algorithms using existing datasets, without providing a complete public framework for large-scale semi-supervised training across new datasets or languages. In this work, we introduce a fully open-source semi-supervised training framework encompassing the entire pipeline: from unlabeled data collection to pseudo-labeling and model training. Our approach enables scalable dataset creation for any language using publicly available speech data under Creative Commons licenses. We also propose a novel pseudo-labeling algorithm, TopIPL, and evaluate it in both low-resource (Portuguese, Armenian) and high-resource (Spanish) settings. Notably, TopIPL achieves relative WER improvements of 18-40% for Portuguese, 5-16% for Armenian, and 2-8% for Spanish.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
Pushing the Limits of Beam Search Decoding for Transducer-based ASR models
Authors:
Lilit Grigoryan,
Vladimir Bataev,
Andrei Andrusenko,
Hainan Xu,
Vitaly Lavrukhin,
Boris Ginsburg
Abstract:
Transducer models have emerged as a promising choice for end-to-end ASR systems, offering a balanced trade-off between recognition accuracy, streaming capabilities, and inference speed in greedy decoding. However, beam search significantly slows down Transducers due to repeated evaluations of key network components, limiting practical applications. This paper introduces a universal method to accel…
▽ More
Transducer models have emerged as a promising choice for end-to-end ASR systems, offering a balanced trade-off between recognition accuracy, streaming capabilities, and inference speed in greedy decoding. However, beam search significantly slows down Transducers due to repeated evaluations of key network components, limiting practical applications. This paper introduces a universal method to accelerate beam search for Transducers, enabling the implementation of two optimized algorithms: ALSD++ and AES++. The proposed method utilizes batch operations, a tree-based hypothesis structure, novel blank scoring for enhanced shallow fusion, and CUDA graph execution for efficient GPU inference. This narrows the speed gap between beam and greedy modes to only 10-20% for the whole system, achieves 14-30% relative improvement in WER compared to greedy decoding, and improves shallow fusion for low-resource up to 11% compared to existing implementations. All the algorithms are open sourced.
△ Less
Submitted 30 May, 2025;
originally announced June 2025.
-
NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding
Authors:
Vladimir Bataev,
Andrei Andrusenko,
Lilit Grigoryan,
Aleksandr Laptev,
Vitaly Lavrukhin,
Boris Ginsburg
Abstract:
Statistical n-gram language models are widely used for context-biasing tasks in Automatic Speech Recognition (ASR). However, existing implementations lack computational efficiency due to poor parallelization, making context-biasing less appealing for industrial use. This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized in…
▽ More
Statistical n-gram language models are widely used for context-biasing tasks in Automatic Speech Recognition (ASR). However, existing implementations lack computational efficiency due to poor parallelization, making context-biasing less appealing for industrial use. This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference. Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types - including transducers, attention encoder-decoder models, and CTC - with less than 7% computational overhead. The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search. The implementation of the proposed NGPU-LM is open-sourced.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Granary: Speech Recognition and Translation Dataset in 25 European Languages
Authors:
Nithin Rao Koluguri,
Monica Sekoyan,
George Zelenfroynd,
Sasha Meister,
Shuoyang Ding,
Sofia Kostandian,
He Huang,
Nikolay Karpov,
Jagadeesh Balam,
Vitaly Lavrukhin,
Yifan Peng,
Sara Papi,
Marco Gaido,
Alessio Brutti,
Boris Ginsburg
Abstract:
Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. This is the first open-source effort at this scale for both transcription and translation. We enhance d…
▽ More
Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. This is the first open-source effort at this scale for both transcription and translation. We enhance data quality using a pseudo-labeling pipeline with segmentation, two-pass inference, hallucination filtering, and punctuation restoration. We further generate translation pairs from pseudo-labeled transcriptions using EuroLLM, followed by a data filtration pipeline. Designed for efficiency, our pipeline processes vast amount of data within hours. We assess models trained on processed data by comparing their performance on previously curated datasets for both high- and low-resource languages. Our findings show that these models achieve similar performance using approx. 50% less data. Dataset will be made available at https://hf.co/datasets/nvidia/Granary
△ Less
Submitted 21 May, 2025; v1 submitted 19 May, 2025;
originally announced May 2025.
-
Training and Inference Efficiency of Encoder-Decoder Speech Models
Authors:
Piotr Żelasko,
Kunal Dhawan,
Daniel Galvez,
Krishna C. Puvvada,
Ankita Pasad,
Nithin Rao Koluguri,
Ke Hu,
Vitaly Lavrukhin,
Jagadeesh Balam,
Boris Ginsburg
Abstract:
Attention encoder-decoder model architecture is the backbone of several recent top performing foundation speech models: Whisper, Seamless, OWSM, and Canary-1B. However, the reported data and compute requirements for their training are prohibitive for many in the research community. In this work, we focus on the efficiency angle and ask the questions of whether we are training these speech models e…
▽ More
Attention encoder-decoder model architecture is the backbone of several recent top performing foundation speech models: Whisper, Seamless, OWSM, and Canary-1B. However, the reported data and compute requirements for their training are prohibitive for many in the research community. In this work, we focus on the efficiency angle and ask the questions of whether we are training these speech models efficiently, and what can we do to improve? We argue that a major, if not the most severe, detrimental factor for training efficiency is related to the sampling strategy of sequential data. We show that negligence in mini-batch sampling leads to more than 50% computation being spent on padding. To that end, we study, profile, and optimize Canary-1B training to show gradual improvement in GPU utilization leading up to 5x increase in average batch sizes versus its original training settings. This in turn allows us to train an equivalent model using 4x less GPUs in the same wall time, or leverage the original resources and train it in 2x shorter wall time. Finally, we observe that the major inference bottleneck lies in the autoregressive decoder steps. We find that adjusting the model architecture to transfer model parameters from the decoder to the encoder results in a 3x inference speedup as measured by inverse real-time factor (RTFx) while preserving the accuracy and compute requirements for convergence. The training code and models will be available as open-source.
△ Less
Submitted 19 March, 2025; v1 submitted 7 March, 2025;
originally announced March 2025.
-
Methods to Increase the Amount of Data for Speech Recognition for Low Resource Languages
Authors:
Alexan Ayrapetyan,
Sofia Kostandian,
Ara Yeroyan,
Mher Yerznkanyan,
Nikolay Karpov,
Nune Tadevosyan,
Vitaly Lavrukhin,
Boris Ginsburg
Abstract:
This study explores methods to increase data volume for low-resource languages using techniques such as crowdsourcing, pseudo-labeling, advanced data preprocessing and various permissive data sources such as audiobooks, Common Voice, YouTube. While these methods are well-explored for highresource languages, their application for low-resource languages remains underexplored. Using Armenian and Geor…
▽ More
This study explores methods to increase data volume for low-resource languages using techniques such as crowdsourcing, pseudo-labeling, advanced data preprocessing and various permissive data sources such as audiobooks, Common Voice, YouTube. While these methods are well-explored for highresource languages, their application for low-resource languages remains underexplored. Using Armenian and Georgian as case studies, we demonstrate how linguistic and resource-specific characteristics influence the success of these methods. This work provides practical guidance for researchers to choose cost-effective and quality-driven dataset extension strategies for low-resource languages. The key takeaway from various data extension approaches is that paid crowd-sourcing offers the best balance between cost and quality, outperforming volunteer crowd-sourcing, open-source audiobooks, and unlabeled data usage. Ablation study shows that models trained on the expanded datasets outperform existing baselines and achieve 5.73% for Gergian and 9.9% for Armenian ASR word error rate using a relatively small FastConformer architecture. We open-sourced both the Armenian and Georgian models to allow further research and practical applications.
△ Less
Submitted 7 February, 2025; v1 submitted 8 January, 2025;
originally announced January 2025.
-
TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer
Authors:
Vladimir Bataev,
Subhankar Ghosh,
Vitaly Lavrukhin,
Jason Li
Abstract:
This work introduces TTS-Transducer - a novel architecture for text-to-speech, leveraging the strengths of audio codec models and neural transducers. Transducers, renowned for their superior quality and robustness in speech recognition, are employed to learn monotonic alignments and allow for avoiding using explicit duration predictors. Neural audio codecs efficiently compress audio into discrete…
▽ More
This work introduces TTS-Transducer - a novel architecture for text-to-speech, leveraging the strengths of audio codec models and neural transducers. Transducers, renowned for their superior quality and robustness in speech recognition, are employed to learn monotonic alignments and allow for avoiding using explicit duration predictors. Neural audio codecs efficiently compress audio into discrete codes, revealing the possibility of applying text modeling approaches to speech generation. However, the complexity of predicting multiple tokens per frame from several codebooks, as necessitated by audio codec models with residual quantizers, poses a significant challenge. The proposed system first uses a transducer architecture to learn monotonic alignments between tokenized text and speech codec tokens for the first codebook. Next, a non-autoregressive Transformer predicts the remaining codes using the alignment extracted from transducer loss. The proposed system is trained end-to-end. We show that TTS-Transducer is a competitive and robust alternative to contemporary TTS systems.
△ Less
Submitted 10 January, 2025;
originally announced January 2025.
-
EMMeTT: Efficient Multimodal Machine Translation Training
Authors:
Piotr Żelasko,
Zhehuai Chen,
Mengru Wang,
Daniel Galvez,
Oleksii Hrinchuk,
Shuoyang Ding,
Ke Hu,
Jagadeesh Balam,
Vitaly Lavrukhin,
Boris Ginsburg
Abstract:
A rising interest in the modality extension of foundation language models warrants discussion on the most effective, and efficient, multimodal training approach. This work focuses on neural machine translation (NMT) and proposes a joint multimodal training regime of Speech-LLM to include automatic speech translation (AST). We investigate two different foundation model architectures, decoder-only G…
▽ More
A rising interest in the modality extension of foundation language models warrants discussion on the most effective, and efficient, multimodal training approach. This work focuses on neural machine translation (NMT) and proposes a joint multimodal training regime of Speech-LLM to include automatic speech translation (AST). We investigate two different foundation model architectures, decoder-only GPT and encoder-decoder T5, extended with Canary-1B's speech encoder. To handle joint multimodal training, we propose a novel training framework called EMMeTT. EMMeTT improves training efficiency with the following: balanced sampling across languages, datasets, and modalities; efficient sequential data iteration; and a novel 2D bucketing scheme for multimodal data, complemented by a batch size optimizer (OOMptimizer). We show that a multimodal training consistently helps with both architectures. Moreover, SALM-T5 trained with EMMeTT retains the original NMT capability while outperforming AST baselines on four-language subsets of FLORES and FLEURS. The resultant Multimodal Translation Model produces strong text and speech translation results at the same time.
△ Less
Submitted 20 September, 2024;
originally announced September 2024.
-
Less is More: Accurate Speech Recognition & Translation without Web-Scale Data
Authors:
Krishna C. Puvvada,
Piotr Żelasko,
He Huang,
Oleksii Hrinchuk,
Nithin Rao Koluguri,
Kunal Dhawan,
Somshubra Majumdar,
Elena Rastorgueva,
Zhehuai Chen,
Vitaly Lavrukhin,
Jagadeesh Balam,
Boris Ginsburg
Abstract:
Recent advances in speech recognition and translation rely on hundreds of thousands of hours of Internet speech data. We argue that state-of-the art accuracy can be reached without relying on web-scale data. Canary - multilingual ASR and speech translation model, outperforms current state-of-the-art models - Whisper, OWSM, and Seamless-M4T on English, French, Spanish, and German languages, while b…
▽ More
Recent advances in speech recognition and translation rely on hundreds of thousands of hours of Internet speech data. We argue that state-of-the art accuracy can be reached without relying on web-scale data. Canary - multilingual ASR and speech translation model, outperforms current state-of-the-art models - Whisper, OWSM, and Seamless-M4T on English, French, Spanish, and German languages, while being trained on an order of magnitude less data than these models. Three key factors enables such data-efficient model: (1) a FastConformer-based attention encoder-decoder architecture (2) training on synthetic data generated with machine translation and (3) advanced training techniques: data-balancing, dynamic data blending, dynamic bucketing and noise-robust fine-tuning. The model, weights, and training code will be open-sourced.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter
Authors:
Andrei Andrusenko,
Aleksandr Laptev,
Vladimir Bataev,
Vitaly Lavrukhin,
Boris Ginsburg
Abstract:
Accurate recognition of rare and new words remains a pressing problem for contextualized Automatic Speech Recognition (ASR) systems. Most context-biasing methods involve modification of the ASR model or the beam-search decoding algorithm, complicating model reuse and slowing down inference. This work presents a new approach to fast context-biasing with CTC-based Word Spotter (CTC-WS) for CTC and T…
▽ More
Accurate recognition of rare and new words remains a pressing problem for contextualized Automatic Speech Recognition (ASR) systems. Most context-biasing methods involve modification of the ASR model or the beam-search decoding algorithm, complicating model reuse and slowing down inference. This work presents a new approach to fast context-biasing with CTC-based Word Spotter (CTC-WS) for CTC and Transducer (RNN-T) ASR models. The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates. The valid candidates then replace their greedy recognition counterparts in corresponding frame intervals. A Hybrid Transducer-CTC model enables the CTC-WS application for the Transducer model. The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER compared to baseline methods. The proposed method is publicly available in the NVIDIA NeMo toolkit.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Label-Looping: Highly Efficient Decoding for Transducers
Authors:
Vladimir Bataev,
Hainan Xu,
Daniel Galvez,
Vitaly Lavrukhin,
Boris Ginsburg
Abstract:
This paper introduces a highly efficient greedy decoding algorithm for Transducer-based speech recognition models. We redesign the standard nested-loop design for RNN-T decoding, swapping loops over frames and labels: the outer loop iterates over labels, while the inner loop iterates over frames searching for the next non-blank symbol. Additionally, we represent partial hypotheses in a special str…
▽ More
This paper introduces a highly efficient greedy decoding algorithm for Transducer-based speech recognition models. We redesign the standard nested-loop design for RNN-T decoding, swapping loops over frames and labels: the outer loop iterates over labels, while the inner loop iterates over frames searching for the next non-blank symbol. Additionally, we represent partial hypotheses in a special structure using CUDA tensors, supporting parallelized hypotheses manipulations. Experiments show that the label-looping algorithm is up to 2.0X faster than conventional batched decoding when using batch size 32. It can be further combined with other compiler or GPU call-related techniques to achieve even more speedup. Our algorithm is general-purpose and can work with both conventional Transducers and Token-and-Duration Transducers. We open-source our implementation to benefit the research community.
△ Less
Submitted 16 September, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio
Authors:
Yang Zhang,
Krishna C. Puvvada,
Vitaly Lavrukhin,
Boris Ginsburg
Abstract:
We propose CONF-TSASR, a non-autoregressive end-to-end time-frequency domain architecture for single-channel target-speaker automatic speech recognition (TS-ASR). The model consists of a TitaNet based speaker embedding module, a Conformer based masking as well as ASR modules. These modules are jointly optimized to transcribe a target-speaker, while ignoring speech from other speakers. For training…
▽ More
We propose CONF-TSASR, a non-autoregressive end-to-end time-frequency domain architecture for single-channel target-speaker automatic speech recognition (TS-ASR). The model consists of a TitaNet based speaker embedding module, a Conformer based masking as well as ASR modules. These modules are jointly optimized to transcribe a target-speaker, while ignoring speech from other speakers. For training we use Connectionist Temporal Classification (CTC) loss and introduce a scale-invariant spectrogram reconstruction loss to encourage the model better separate the target-speaker's spectrogram from mixture. We obtain state-of-the-art target-speaker word error rate (TS-WER) on WSJ0-2mix-extr (4.2%). Further, we report for the first time TS-WER on WSJ0-3mix-extr (12.4%), LibriSpeech2Mix (4.2%) and LibriSpeech3Mix (7.6%) datasets, establishing new benchmarks for TS-ASR. The proposed model will be open-sourced through NVIDIA NeMo toolkit.
△ Less
Submitted 9 August, 2023;
originally announced August 2023.
-
Confidence-based Ensembles of End-to-End Speech Recognition Models
Authors:
Igor Gitman,
Vitaly Lavrukhin,
Aleksandr Laptev,
Boris Ginsburg
Abstract:
The number of end-to-end speech recognition models grows every year. These models are often adapted to new domains or languages resulting in a proliferation of expert systems that achieve great results on target data, while generally showing inferior performance outside of their domain of expertise. We explore combination of such experts via confidence-based ensembles: ensembles of models where on…
▽ More
The number of end-to-end speech recognition models grows every year. These models are often adapted to new domains or languages resulting in a proliferation of expert systems that achieve great results on target data, while generally showing inferior performance outside of their domain of expertise. We explore combination of such experts via confidence-based ensembles: ensembles of models where only the output of the most-confident model is used. We assume that models' target data is not available except for a small validation set. We demonstrate effectiveness of our approach with two applications. First, we show that a confidence-based ensemble of 5 monolingual models outperforms a system where model selection is performed via a dedicated language identification block. Second, we demonstrate that it is possible to combine base and adapted models to achieve strong results on both original and target data. We validate all our results on multiple datasets and model architectures.
△ Less
Submitted 27 June, 2023;
originally announced June 2023.
-
Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator
Authors:
Vladimir Bataev,
Roman Korostik,
Evgeny Shabalin,
Vitaly Lavrukhin,
Boris Ginsburg
Abstract:
We propose an end-to-end Automatic Speech Recognition (ASR) system that can be trained on transcribed speech data, text-only data, or a mixture of both. The proposed model uses an integrated auxiliary block for text-based training. This block combines a non-autoregressive multi-speaker text-to-mel-spectrogram generator with a GAN-based enhancer to improve the spectrogram quality. The proposed syst…
▽ More
We propose an end-to-end Automatic Speech Recognition (ASR) system that can be trained on transcribed speech data, text-only data, or a mixture of both. The proposed model uses an integrated auxiliary block for text-based training. This block combines a non-autoregressive multi-speaker text-to-mel-spectrogram generator with a GAN-based enhancer to improve the spectrogram quality. The proposed system can generate a mel-spectrogram dynamically during training. It can be used to adapt the ASR model to a new domain by using text-only data from this domain. We demonstrate that the proposed training method significantly improves ASR accuracy compared to the system trained on transcribed speech only. It also surpasses cascade TTS systems with the vocoder in the adaptation quality and training speed.
△ Less
Submitted 16 August, 2023; v1 submitted 27 February, 2023;
originally announced February 2023.
-
Damage Control During Domain Adaptation for Transducer Based Automatic Speech Recognition
Authors:
Somshubra Majumdar,
Shantanu Acharya,
Vitaly Lavrukhin,
Boris Ginsburg
Abstract:
Automatic speech recognition models are often adapted to improve their accuracy in a new domain. A potential drawback of model adaptation to new domains is catastrophic forgetting, where the Word Error Rate on the original domain is significantly degraded. This paper addresses the situation when we want to simultaneously adapt automatic speech recognition models to a new domain and limit the degra…
▽ More
Automatic speech recognition models are often adapted to improve their accuracy in a new domain. A potential drawback of model adaptation to new domains is catastrophic forgetting, where the Word Error Rate on the original domain is significantly degraded. This paper addresses the situation when we want to simultaneously adapt automatic speech recognition models to a new domain and limit the degradation of accuracy on the original domain without access to the original training dataset. We propose several techniques such as a limited training strategy and regularized adapter modules for the Transducer encoder, prediction, and joiner network. We apply these methods to the Google Speech Commands and to the UK and Ireland English Dialect speech data set and obtain strong results on the new target domain while limiting the degradation on the original domain.
△ Less
Submitted 6 October, 2022;
originally announced October 2022.
-
A Toolbox for Construction and Analysis of Speech Datasets
Authors:
Evelina Bakhturina,
Vitaly Lavrukhin,
Boris Ginsburg
Abstract:
Automatic Speech Recognition and Text-to-Speech systems are primarily trained in a supervised fashion and require high-quality, accurately labeled speech datasets. In this work, we examine common problems with speech data and introduce a toolbox for the construction and interactive error analysis of speech datasets. The construction tool is based on Kürzinger et al. work, and, to the best of our k…
▽ More
Automatic Speech Recognition and Text-to-Speech systems are primarily trained in a supervised fashion and require high-quality, accurately labeled speech datasets. In this work, we examine common problems with speech data and introduce a toolbox for the construction and interactive error analysis of speech datasets. The construction tool is based on Kürzinger et al. work, and, to the best of our knowledge, the dataset exploration tool is the world's first open-source tool of this kind. We demonstrate how to apply these tools to create a Russian speech dataset and analyze existing speech datasets (Multilingual LibriSpeech, Mozilla Common Voice). The tools are open sourced as a part of the NeMo framework.
△ Less
Submitted 6 January, 2022; v1 submitted 10 April, 2021;
originally announced April 2021.
-
SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition
Authors:
Patrick K. O'Neill,
Vitaly Lavrukhin,
Somshubra Majumdar,
Vahid Noroozi,
Yuekai Zhang,
Oleksii Kuchaiev,
Jagadeesh Balam,
Yuliya Dovzhenko,
Keenan Freyberg,
Michael D. Shulman,
Boris Ginsburg,
Shinji Watanabe,
Georg Kucsko
Abstract:
In the English speech-to-text (STT) machine learning task, acoustic models are conventionally trained on uncased Latin characters, and any necessary orthography (such as capitalization, punctuation, and denormalization of non-standard words) is imputed by separate post-processing models. This adds complexity and limits performance, as many formatting tasks benefit from semantic information present…
▽ More
In the English speech-to-text (STT) machine learning task, acoustic models are conventionally trained on uncased Latin characters, and any necessary orthography (such as capitalization, punctuation, and denormalization of non-standard words) is imputed by separate post-processing models. This adds complexity and limits performance, as many formatting tasks benefit from semantic information present in the acoustic signal but absent in transcription. Here we propose a new STT task: end-to-end neural transcription with fully formatted text for target labels. We present baseline Conformer-based models trained on a corpus of 5,000 hours of professionally transcribed earnings calls, achieving a CER of 1.7. As a contribution to the STT research community, we release the corpus free for non-commercial use at https://datasets.kensho.com/datasets/scribe.
△ Less
Submitted 6 April, 2021; v1 submitted 5 April, 2021;
originally announced April 2021.
-
Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition
Authors:
Somshubra Majumdar,
Jagadeesh Balam,
Oleksii Hrinchuk,
Vitaly Lavrukhin,
Vahid Noroozi,
Boris Ginsburg
Abstract:
We propose Citrinet - a new end-to-end convolutional Connectionist Temporal Classification (CTC) based automatic speech recognition (ASR) model. Citrinet is deep residual neural model which uses 1D time-channel separable convolutions combined with sub-word encoding and squeeze-and-excitation. The resulting architecture significantly reduces the gap between non-autoregressive and sequence-to-sequen…
▽ More
We propose Citrinet - a new end-to-end convolutional Connectionist Temporal Classification (CTC) based automatic speech recognition (ASR) model. Citrinet is deep residual neural model which uses 1D time-channel separable convolutions combined with sub-word encoding and squeeze-and-excitation. The resulting architecture significantly reduces the gap between non-autoregressive and sequence-to-sequence and transducer models. We evaluate Citrinet on LibriSpeech, TED-LIUM2, AISHELL-1 and Multilingual LibriSpeech (MLS) English speech datasets. Citrinet accuracy on these datasets is close to the best autoregressive Transducer models.
△ Less
Submitted 4 April, 2021;
originally announced April 2021.
-
Hi-Fi Multi-Speaker English TTS Dataset
Authors:
Evelina Bakhturina,
Vitaly Lavrukhin,
Boris Ginsburg,
Yang Zhang
Abstract:
This paper introduces a new multi-speaker English dataset for training text-to-speech models. The dataset is based on LibriVox audiobooks and Project Gutenberg texts, both in the public domain. The new dataset contains about 292 hours of speech from 10 speakers with at least 17 hours per speaker sampled at 44.1 kHz. To select speech samples with high quality, we considered audio recordings with a…
▽ More
This paper introduces a new multi-speaker English dataset for training text-to-speech models. The dataset is based on LibriVox audiobooks and Project Gutenberg texts, both in the public domain. The new dataset contains about 292 hours of speech from 10 speakers with at least 17 hours per speaker sampled at 44.1 kHz. To select speech samples with high quality, we considered audio recordings with a signal bandwidth of at least 13 kHz and a signal-to-noise ratio (SNR) of at least 32 dB. The dataset is publicly released at http://www.openslr.org/109/ .
△ Less
Submitted 14 June, 2021; v1 submitted 3 April, 2021;
originally announced April 2021.
-
Improving Noise Robustness of an End-to-End Neural Model for Automatic Speech Recognition
Authors:
Jagadeesh Balam,
Jocelyn Huang,
Vitaly Lavrukhin,
Slyne Deng,
Somshubra Majumdar,
Boris Ginsburg
Abstract:
We present our experiments in training robust to noise an end-to-end automatic speech recognition (ASR) model using intensive data augmentation. We explore the efficacy of fine-tuning a pre-trained model to improve noise robustness, and we find it to be a very efficient way to train for various noisy conditions, especially when the conditions in which the model will be used, are unknown. Starting…
▽ More
We present our experiments in training robust to noise an end-to-end automatic speech recognition (ASR) model using intensive data augmentation. We explore the efficacy of fine-tuning a pre-trained model to improve noise robustness, and we find it to be a very efficient way to train for various noisy conditions, especially when the conditions in which the model will be used, are unknown. Starting with a model trained on clean data helps establish baseline performance on clean speech. We carefully fine-tune this model to both maintain the performance on clean speech, and improve the model accuracy in noisy conditions. With this schema, we trained robust to noise English and Mandarin ASR models on large public corpora. All described models and training recipes are open sourced in NeMo, a toolkit for conversational AI.
△ Less
Submitted 23 October, 2020;
originally announced October 2020.
-
SpeakerNet: 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification
Authors:
Nithin Rao Koluguri,
Jason Li,
Vitaly Lavrukhin,
Boris Ginsburg
Abstract:
We propose SpeakerNet - a new neural architecture for speaker recognition and speaker verification tasks. It is composed of residual blocks with 1D depth-wise separable convolutions, batch-normalization, and ReLU layers. This architecture uses x-vector based statistics pooling layer to map variable-length utterances to a fixed-length embedding (q-vector). SpeakerNet-M is a simple lightweight model…
▽ More
We propose SpeakerNet - a new neural architecture for speaker recognition and speaker verification tasks. It is composed of residual blocks with 1D depth-wise separable convolutions, batch-normalization, and ReLU layers. This architecture uses x-vector based statistics pooling layer to map variable-length utterances to a fixed-length embedding (q-vector). SpeakerNet-M is a simple lightweight model with just 5M parameters. It doesn't use voice activity detection (VAD) and achieves close to state-of-the-art performance scoring an Equal Error Rate (EER) of 2.10% on the VoxCeleb1 cleaned and 2.29% on the VoxCeleb1 trial files.
△ Less
Submitted 23 October, 2020;
originally announced October 2020.
-
Cross-Language Transfer Learning, Continuous Learning, and Domain Adaptation for End-to-End Automatic Speech Recognition
Authors:
Jocelyn Huang,
Oleksii Kuchaiev,
Patrick O'Neill,
Vitaly Lavrukhin,
Jason Li,
Adriana Flores,
Georg Kucsko,
Boris Ginsburg
Abstract:
In this paper, we demonstrate the efficacy of transfer learning and continuous learning for various automatic speech recognition (ASR) tasks. We start with a pre-trained English ASR model and show that transfer learning can be effectively and easily performed on: (1) different English accents, (2) different languages (German, Spanish and Russian) and (3) application-specific domains. Our experimen…
▽ More
In this paper, we demonstrate the efficacy of transfer learning and continuous learning for various automatic speech recognition (ASR) tasks. We start with a pre-trained English ASR model and show that transfer learning can be effectively and easily performed on: (1) different English accents, (2) different languages (German, Spanish and Russian) and (3) application-specific domains. Our experiments demonstrate that in all three cases, transfer learning from a good base model has higher accuracy than a model trained from scratch. It is preferred to fine-tune large models than small pre-trained models, even if the dataset for fine-tuning is small. Moreover, transfer learning significantly speeds up convergence for both very small and very large target datasets.
△ Less
Submitted 8 May, 2020;
originally announced May 2020.
-
QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions
Authors:
Samuel Kriman,
Stanislav Beliaev,
Boris Ginsburg,
Jocelyn Huang,
Oleksii Kuchaiev,
Vitaly Lavrukhin,
Ryan Leary,
Jason Li,
Yang Zhang
Abstract:
We propose a new end-to-end neural acoustic model for automatic speech recognition. The model is composed of multiple blocks with residual connections between them. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It is trained with CTC loss. The proposed network achieves near state-of-the-art accuracy on LibriSpe…
▽ More
We propose a new end-to-end neural acoustic model for automatic speech recognition. The model is composed of multiple blocks with residual connections between them. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It is trained with CTC loss. The proposed network achieves near state-of-the-art accuracy on LibriSpeech and Wall Street Journal, while having fewer parameters than all competing models. We also demonstrate that this model can be effectively fine-tuned on new datasets.
△ Less
Submitted 22 October, 2019;
originally announced October 2019.
-
NeMo: a toolkit for building AI applications using Neural Modules
Authors:
Oleksii Kuchaiev,
Jason Li,
Huyen Nguyen,
Oleksii Hrinchuk,
Ryan Leary,
Boris Ginsburg,
Samuel Kriman,
Stanislav Beliaev,
Vitaly Lavrukhin,
Jack Cook,
Patrice Castonguay,
Mariya Popova,
Jocelyn Huang,
Jonathan M. Cohen
Abstract:
NeMo (Neural Modules) is a Python framework-agnostic toolkit for creating AI applications through re-usability, abstraction, and composition. NeMo is built around neural modules, conceptual blocks of neural networks that take typed inputs and produce typed outputs. Such modules typically represent data layers, encoders, decoders, language models, loss functions, or methods of combining activations…
▽ More
NeMo (Neural Modules) is a Python framework-agnostic toolkit for creating AI applications through re-usability, abstraction, and composition. NeMo is built around neural modules, conceptual blocks of neural networks that take typed inputs and produce typed outputs. Such modules typically represent data layers, encoders, decoders, language models, loss functions, or methods of combining activations. NeMo makes it easy to combine and re-use these building blocks while providing a level of semantic correctness checking via its neural type system. The toolkit comes with extendable collections of pre-built modules for automatic speech recognition and natural language processing. Furthermore, NeMo provides built-in support for distributed training and mixed precision on latest NVIDIA GPUs. NeMo is open-source https://github.com/NVIDIA/NeMo
△ Less
Submitted 13 September, 2019;
originally announced September 2019.
-
Jasper: An End-to-End Convolutional Neural Acoustic Model
Authors:
Jason Li,
Vitaly Lavrukhin,
Boris Ginsburg,
Ryan Leary,
Oleksii Kuchaiev,
Jonathan M. Cohen,
Huyen Nguyen,
Ravi Teja Gadde
Abstract:
In this paper, we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data. Our model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections. To improve training, we further introduce a new layer-wise optimizer called NovoGrad. Through experiments, we demonstrate that the proposed deep arc…
▽ More
In this paper, we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data. Our model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections. To improve training, we further introduce a new layer-wise optimizer called NovoGrad. Through experiments, we demonstrate that the proposed deep architecture performs as well or better than more complex choices. Our deepest Jasper variant uses 54 convolutional layers. With this architecture, we achieve 2.95% WER using a beam-search decoder with an external neural language model and 3.86% WER with a greedy decoder on LibriSpeech test-clean. We also report competitive results on the Wall Street Journal and the Hub5'00 conversational evaluation datasets.
△ Less
Submitted 26 August, 2019; v1 submitted 5 April, 2019;
originally announced April 2019.
-
Training Neural Speech Recognition Systems with Synthetic Speech Augmentation
Authors:
Jason Li,
Ravi Gadde,
Boris Ginsburg,
Vitaly Lavrukhin
Abstract:
Building an accurate automatic speech recognition (ASR) system requires a large dataset that contains many hours of labeled speech samples produced by a diverse set of speakers. The lack of such open free datasets is one of the main issues preventing advancements in ASR research. To address this problem, we propose to augment a natural speech dataset with synthetic speech. We train very large end-…
▽ More
Building an accurate automatic speech recognition (ASR) system requires a large dataset that contains many hours of labeled speech samples produced by a diverse set of speakers. The lack of such open free datasets is one of the main issues preventing advancements in ASR research. To address this problem, we propose to augment a natural speech dataset with synthetic speech. We train very large end-to-end neural speech recognition models using the LibriSpeech dataset augmented with synthetic speech. These new models achieve state of the art Word Error Rate (WER) for character-level based models without an external language model.
△ Less
Submitted 1 November, 2018;
originally announced November 2018.