Skip to main content

Showing 1–26 of 26 results for author: Lavrukhin, V

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.07659  [pdf, ps, other

    eess.AS

    Unified Semi-Supervised Pipeline for Automatic Speech Recognition

    Authors: Nune Tadevosyan, Nikolay Karpov, Andrei Andrusenko, Vitaly Lavrukhin, Ante Jukic

    Abstract: Automatic Speech Recognition has been a longstanding research area, with substantial efforts dedicated to integrating semi-supervised learning due to the scarcity of labeled datasets. However, most prior work has focused on improving learning algorithms using existing datasets, without providing a complete public framework for large-scale semi-supervised training across new datasets or languages.… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: Accepted to Interspeech 2025

    ACM Class: I.5.1

  2. arXiv:2506.00185  [pdf, ps, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    Pushing the Limits of Beam Search Decoding for Transducer-based ASR models

    Authors: Lilit Grigoryan, Vladimir Bataev, Andrei Andrusenko, Hainan Xu, Vitaly Lavrukhin, Boris Ginsburg

    Abstract: Transducer models have emerged as a promising choice for end-to-end ASR systems, offering a balanced trade-off between recognition accuracy, streaming capabilities, and inference speed in greedy decoding. However, beam search significantly slows down Transducers due to repeated evaluations of key network components, limiting practical applications. This paper introduces a universal method to accel… ▽ More

    Submitted 30 May, 2025; originally announced June 2025.

    Comments: Accepted to Interspeech 2025

  3. arXiv:2505.22857  [pdf, ps, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding

    Authors: Vladimir Bataev, Andrei Andrusenko, Lilit Grigoryan, Aleksandr Laptev, Vitaly Lavrukhin, Boris Ginsburg

    Abstract: Statistical n-gram language models are widely used for context-biasing tasks in Automatic Speech Recognition (ASR). However, existing implementations lack computational efficiency due to poor parallelization, making context-biasing less appealing for industrial use. This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized in… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: Accepted to Interspeech 2025

  4. arXiv:2505.13404  [pdf, other

    cs.CL eess.AS

    Granary: Speech Recognition and Translation Dataset in 25 European Languages

    Authors: Nithin Rao Koluguri, Monica Sekoyan, George Zelenfroynd, Sasha Meister, Shuoyang Ding, Sofia Kostandian, He Huang, Nikolay Karpov, Jagadeesh Balam, Vitaly Lavrukhin, Yifan Peng, Sara Papi, Marco Gaido, Alessio Brutti, Boris Ginsburg

    Abstract: Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. This is the first open-source effort at this scale for both transcription and translation. We enhance d… ▽ More

    Submitted 21 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

    Comments: Accepted at Interspeech 2025 v2: Added links

  5. arXiv:2503.05931  [pdf, other

    cs.CL eess.AS

    Training and Inference Efficiency of Encoder-Decoder Speech Models

    Authors: Piotr Żelasko, Kunal Dhawan, Daniel Galvez, Krishna C. Puvvada, Ankita Pasad, Nithin Rao Koluguri, Ke Hu, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg

    Abstract: Attention encoder-decoder model architecture is the backbone of several recent top performing foundation speech models: Whisper, Seamless, OWSM, and Canary-1B. However, the reported data and compute requirements for their training are prohibitive for many in the research community. In this work, we focus on the efficiency angle and ask the questions of whether we are training these speech models e… ▽ More

    Submitted 19 March, 2025; v1 submitted 7 March, 2025; originally announced March 2025.

  6. arXiv:2501.14788  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Methods to Increase the Amount of Data for Speech Recognition for Low Resource Languages

    Authors: Alexan Ayrapetyan, Sofia Kostandian, Ara Yeroyan, Mher Yerznkanyan, Nikolay Karpov, Nune Tadevosyan, Vitaly Lavrukhin, Boris Ginsburg

    Abstract: This study explores methods to increase data volume for low-resource languages using techniques such as crowdsourcing, pseudo-labeling, advanced data preprocessing and various permissive data sources such as audiobooks, Common Voice, YouTube. While these methods are well-explored for highresource languages, their application for low-resource languages remains underexplored. Using Armenian and Geor… ▽ More

    Submitted 7 February, 2025; v1 submitted 8 January, 2025; originally announced January 2025.

    Comments: The first four authors contributed equally

  7. arXiv:2501.06320  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer

    Authors: Vladimir Bataev, Subhankar Ghosh, Vitaly Lavrukhin, Jason Li

    Abstract: This work introduces TTS-Transducer - a novel architecture for text-to-speech, leveraging the strengths of audio codec models and neural transducers. Transducers, renowned for their superior quality and robustness in speech recognition, are employed to learn monotonic alignments and allow for avoiding using explicit duration predictors. Neural audio codecs efficiently compress audio into discrete… ▽ More

    Submitted 10 January, 2025; originally announced January 2025.

    Comments: Accepted by ICASSP 2025

  8. arXiv:2409.13523  [pdf, other

    cs.CL cs.SD eess.AS

    EMMeTT: Efficient Multimodal Machine Translation Training

    Authors: Piotr Żelasko, Zhehuai Chen, Mengru Wang, Daniel Galvez, Oleksii Hrinchuk, Shuoyang Ding, Ke Hu, Jagadeesh Balam, Vitaly Lavrukhin, Boris Ginsburg

    Abstract: A rising interest in the modality extension of foundation language models warrants discussion on the most effective, and efficient, multimodal training approach. This work focuses on neural machine translation (NMT) and proposes a joint multimodal training regime of Speech-LLM to include automatic speech translation (AST). We investigate two different foundation model architectures, decoder-only G… ▽ More

    Submitted 20 September, 2024; originally announced September 2024.

    Comments: 4 pages, submitted to ICASSP 2025

  9. arXiv:2406.19674  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Less is More: Accurate Speech Recognition & Translation without Web-Scale Data

    Authors: Krishna C. Puvvada, Piotr Żelasko, He Huang, Oleksii Hrinchuk, Nithin Rao Koluguri, Kunal Dhawan, Somshubra Majumdar, Elena Rastorgueva, Zhehuai Chen, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg

    Abstract: Recent advances in speech recognition and translation rely on hundreds of thousands of hours of Internet speech data. We argue that state-of-the art accuracy can be reached without relying on web-scale data. Canary - multilingual ASR and speech translation model, outperforms current state-of-the-art models - Whisper, OWSM, and Seamless-M4T on English, French, Spanish, and German languages, while b… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech-2024

  10. arXiv:2406.07096  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter

    Authors: Andrei Andrusenko, Aleksandr Laptev, Vladimir Bataev, Vitaly Lavrukhin, Boris Ginsburg

    Abstract: Accurate recognition of rare and new words remains a pressing problem for contextualized Automatic Speech Recognition (ASR) systems. Most context-biasing methods involve modification of the ASR model or the beam-search decoding algorithm, complicating model reuse and slowing down inference. This work presents a new approach to fast context-biasing with CTC-based Word Spotter (CTC-WS) for CTC and T… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  11. arXiv:2406.06220  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    Label-Looping: Highly Efficient Decoding for Transducers

    Authors: Vladimir Bataev, Hainan Xu, Daniel Galvez, Vitaly Lavrukhin, Boris Ginsburg

    Abstract: This paper introduces a highly efficient greedy decoding algorithm for Transducer-based speech recognition models. We redesign the standard nested-loop design for RNN-T decoding, swapping loops over frames and labels: the outer loop iterates over labels, while the inner loop iterates over frames searching for the next non-blank symbol. Additionally, we represent partial hypotheses in a special str… ▽ More

    Submitted 16 September, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

    Comments: Accepted at IEEE SLT 2024

  12. Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio

    Authors: Yang Zhang, Krishna C. Puvvada, Vitaly Lavrukhin, Boris Ginsburg

    Abstract: We propose CONF-TSASR, a non-autoregressive end-to-end time-frequency domain architecture for single-channel target-speaker automatic speech recognition (TS-ASR). The model consists of a TitaNet based speaker embedding module, a Conformer based masking as well as ASR modules. These modules are jointly optimized to transcribe a target-speaker, while ignoring speech from other speakers. For training… ▽ More

    Submitted 9 August, 2023; originally announced August 2023.

  13. Confidence-based Ensembles of End-to-End Speech Recognition Models

    Authors: Igor Gitman, Vitaly Lavrukhin, Aleksandr Laptev, Boris Ginsburg

    Abstract: The number of end-to-end speech recognition models grows every year. These models are often adapted to new domains or languages resulting in a proliferation of expert systems that achieve great results on target data, while generally showing inferior performance outside of their domain of expertise. We explore combination of such experts via confidence-based ensembles: ensembles of models where on… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

    Comments: To appear in Proc. INTERSPEECH 2023, August 20-24, 2023, Dublin, Ireland

  14. arXiv:2302.14036  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator

    Authors: Vladimir Bataev, Roman Korostik, Evgeny Shabalin, Vitaly Lavrukhin, Boris Ginsburg

    Abstract: We propose an end-to-end Automatic Speech Recognition (ASR) system that can be trained on transcribed speech data, text-only data, or a mixture of both. The proposed model uses an integrated auxiliary block for text-based training. This block combines a non-autoregressive multi-speaker text-to-mel-spectrogram generator with a GAN-based enhancer to improve the spectrogram quality. The proposed syst… ▽ More

    Submitted 16 August, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

    Comments: Accepted to INTERSPEECH 2023

  15. arXiv:2210.03255  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Damage Control During Domain Adaptation for Transducer Based Automatic Speech Recognition

    Authors: Somshubra Majumdar, Shantanu Acharya, Vitaly Lavrukhin, Boris Ginsburg

    Abstract: Automatic speech recognition models are often adapted to improve their accuracy in a new domain. A potential drawback of model adaptation to new domains is catastrophic forgetting, where the Word Error Rate on the original domain is significantly degraded. This paper addresses the situation when we want to simultaneously adapt automatic speech recognition models to a new domain and limit the degra… ▽ More

    Submitted 6 October, 2022; originally announced October 2022.

    Comments: To appear in Proc. SLT 2022, Jan 09-12, 2023, Doha, Qatar

  16. arXiv:2104.04896  [pdf

    eess.AS cs.CL cs.SD

    A Toolbox for Construction and Analysis of Speech Datasets

    Authors: Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg

    Abstract: Automatic Speech Recognition and Text-to-Speech systems are primarily trained in a supervised fashion and require high-quality, accurately labeled speech datasets. In this work, we examine common problems with speech data and introduce a toolbox for the construction and interactive error analysis of speech datasets. The construction tool is based on Kürzinger et al. work, and, to the best of our k… ▽ More

    Submitted 6 January, 2022; v1 submitted 10 April, 2021; originally announced April 2021.

  17. arXiv:2104.02014  [pdf, other

    cs.CL eess.AS

    SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition

    Authors: Patrick K. O'Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, Georg Kucsko

    Abstract: In the English speech-to-text (STT) machine learning task, acoustic models are conventionally trained on uncased Latin characters, and any necessary orthography (such as capitalization, punctuation, and denormalization of non-standard words) is imputed by separate post-processing models. This adds complexity and limits performance, as many formatting tasks benefit from semantic information present… ▽ More

    Submitted 6 April, 2021; v1 submitted 5 April, 2021; originally announced April 2021.

    Comments: 5 pages, 1 figure. Submitted to INTERSPEECH 2021

  18. arXiv:2104.01721  [pdf, other

    eess.AS

    Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition

    Authors: Somshubra Majumdar, Jagadeesh Balam, Oleksii Hrinchuk, Vitaly Lavrukhin, Vahid Noroozi, Boris Ginsburg

    Abstract: We propose Citrinet - a new end-to-end convolutional Connectionist Temporal Classification (CTC) based automatic speech recognition (ASR) model. Citrinet is deep residual neural model which uses 1D time-channel separable convolutions combined with sub-word encoding and squeeze-and-excitation. The resulting architecture significantly reduces the gap between non-autoregressive and sequence-to-sequen… ▽ More

    Submitted 4 April, 2021; originally announced April 2021.

  19. arXiv:2104.01497  [pdf, other

    eess.AS

    Hi-Fi Multi-Speaker English TTS Dataset

    Authors: Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, Yang Zhang

    Abstract: This paper introduces a new multi-speaker English dataset for training text-to-speech models. The dataset is based on LibriVox audiobooks and Project Gutenberg texts, both in the public domain. The new dataset contains about 292 hours of speech from 10 speakers with at least 17 hours per speaker sampled at 44.1 kHz. To select speech samples with high quality, we considered audio recordings with a… ▽ More

    Submitted 14 June, 2021; v1 submitted 3 April, 2021; originally announced April 2021.

  20. arXiv:2010.12715  [pdf, other

    eess.AS

    Improving Noise Robustness of an End-to-End Neural Model for Automatic Speech Recognition

    Authors: Jagadeesh Balam, Jocelyn Huang, Vitaly Lavrukhin, Slyne Deng, Somshubra Majumdar, Boris Ginsburg

    Abstract: We present our experiments in training robust to noise an end-to-end automatic speech recognition (ASR) model using intensive data augmentation. We explore the efficacy of fine-tuning a pre-trained model to improve noise robustness, and we find it to be a very efficient way to train for various noisy conditions, especially when the conditions in which the model will be used, are unknown. Starting… ▽ More

    Submitted 23 October, 2020; originally announced October 2020.

  21. arXiv:2010.12653  [pdf, other

    eess.AS

    SpeakerNet: 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification

    Authors: Nithin Rao Koluguri, Jason Li, Vitaly Lavrukhin, Boris Ginsburg

    Abstract: We propose SpeakerNet - a new neural architecture for speaker recognition and speaker verification tasks. It is composed of residual blocks with 1D depth-wise separable convolutions, batch-normalization, and ReLU layers. This architecture uses x-vector based statistics pooling layer to map variable-length utterances to a fixed-length embedding (q-vector). SpeakerNet-M is a simple lightweight model… ▽ More

    Submitted 23 October, 2020; originally announced October 2020.

    Comments: Preprint, submitted to ICASSP 2021

  22. arXiv:2005.04290  [pdf, other

    eess.AS

    Cross-Language Transfer Learning, Continuous Learning, and Domain Adaptation for End-to-End Automatic Speech Recognition

    Authors: Jocelyn Huang, Oleksii Kuchaiev, Patrick O'Neill, Vitaly Lavrukhin, Jason Li, Adriana Flores, Georg Kucsko, Boris Ginsburg

    Abstract: In this paper, we demonstrate the efficacy of transfer learning and continuous learning for various automatic speech recognition (ASR) tasks. We start with a pre-trained English ASR model and show that transfer learning can be effectively and easily performed on: (1) different English accents, (2) different languages (German, Spanish and Russian) and (3) application-specific domains. Our experimen… ▽ More

    Submitted 8 May, 2020; originally announced May 2020.

  23. arXiv:1910.10261  [pdf, other

    eess.AS

    QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions

    Authors: Samuel Kriman, Stanislav Beliaev, Boris Ginsburg, Jocelyn Huang, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Yang Zhang

    Abstract: We propose a new end-to-end neural acoustic model for automatic speech recognition. The model is composed of multiple blocks with residual connections between them. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It is trained with CTC loss. The proposed network achieves near state-of-the-art accuracy on LibriSpe… ▽ More

    Submitted 22 October, 2019; originally announced October 2019.

    Comments: Submitted to ICASSP 2020

  24. arXiv:1909.09577  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    NeMo: a toolkit for building AI applications using Neural Modules

    Authors: Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kriman, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, Patrice Castonguay, Mariya Popova, Jocelyn Huang, Jonathan M. Cohen

    Abstract: NeMo (Neural Modules) is a Python framework-agnostic toolkit for creating AI applications through re-usability, abstraction, and composition. NeMo is built around neural modules, conceptual blocks of neural networks that take typed inputs and produce typed outputs. Such modules typically represent data layers, encoders, decoders, language models, loss functions, or methods of combining activations… ▽ More

    Submitted 13 September, 2019; originally announced September 2019.

    Comments: 6 pages plus references

  25. arXiv:1904.03288  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Jasper: An End-to-End Convolutional Neural Acoustic Model

    Authors: Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M. Cohen, Huyen Nguyen, Ravi Teja Gadde

    Abstract: In this paper, we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data. Our model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections. To improve training, we further introduce a new layer-wise optimizer called NovoGrad. Through experiments, we demonstrate that the proposed deep arc… ▽ More

    Submitted 26 August, 2019; v1 submitted 5 April, 2019; originally announced April 2019.

    Comments: Accepted to INTERSPEECH 2019

  26. arXiv:1811.00707  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Training Neural Speech Recognition Systems with Synthetic Speech Augmentation

    Authors: Jason Li, Ravi Gadde, Boris Ginsburg, Vitaly Lavrukhin

    Abstract: Building an accurate automatic speech recognition (ASR) system requires a large dataset that contains many hours of labeled speech samples produced by a diverse set of speakers. The lack of such open free datasets is one of the main issues preventing advancements in ASR research. To address this problem, we propose to augment a natural speech dataset with synthetic speech. We train very large end-… ▽ More

    Submitted 1 November, 2018; originally announced November 2018.

    Comments: Pre-print. Work in progress, 5 pages, 1 figure