Skip to main content

Showing 1–27 of 27 results for author: Saon, G

Searching in archive eess. Search in all archives.
.
  1. arXiv:2505.08699  [pdf, other

    eess.AS

    Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities

    Authors: George Saon, Avihu Dekel, Alexander Brooks, Tohru Nagano, Abraham Daniels, Aharon Satt, Ashish Mittal, Brian Kingsbury, David Haws, Edmilson Morais, Gakuto Kurata, Hagai Aronowitz, Ibrahim Ibrahim, Jeff Kuo, Kate Soule, Luis Lastras, Masayuki Suzuki, Ron Hoory, Samuel Thomas, Sashi Novitasari, Takashi Fukuda, Vishal Sunder, Xiaodong Cui, Zvi Kons

    Abstract: Granite-speech LLMs are compact and efficient speech language models specifically designed for English ASR and automatic speech translation (AST). The models were trained by modality aligning the 2B and 8B parameter variants of granite-3.3-instruct to speech on publicly available open-source corpora containing audio inputs and text targets consisting of either human transcripts for ASR or automati… ▽ More

    Submitted 13 May, 2025; v1 submitted 13 May, 2025; originally announced May 2025.

    Comments: 7 pages, 9 figures

  2. arXiv:2501.09104  [pdf, other

    cs.SD cs.AI eess.AS

    A Non-autoregressive Model for Joint STT and TTS

    Authors: Vishal Sunder, Brian Kingsbury, George Saon, Samuel Thomas, Slava Shechtman, Hagai Aronowitz, Eric Fosler-Lussier, Luis Lastras

    Abstract: In this paper, we take a step towards jointly modeling automatic speech recognition (STT) and speech synthesis (TTS) in a fully non-autoregressive way. We develop a novel multimodal framework capable of handling the speech and text modalities as input either individually or together. The proposed model can also be trained with unpaired speech or text data owing to its multimodal nature. We further… ▽ More

    Submitted 20 January, 2025; v1 submitted 15 January, 2025; originally announced January 2025.

    Comments: 5 pages, 3 figures, 3 tables

  3. arXiv:2402.00235  [pdf, other

    cs.CL cs.SD eess.AS

    Exploring the limits of decoder-only models trained on public speech recognition corpora

    Authors: Ankit Gupta, George Saon, Brian Kingsbury

    Abstract: The emergence of industrial-scale speech recognition (ASR) models such as Whisper and USM, trained on 1M hours of weakly labelled and 12M hours of audio only proprietary data respectively, has led to a stronger need for large scale public ASR corpora and competitive open source pipelines. Unlike the said models, large language models are typically based on Transformer decoders, and it remains uncl… ▽ More

    Submitted 31 January, 2024; originally announced February 2024.

  4. arXiv:2309.10926  [pdf, other

    cs.CL cs.SD eess.AS

    Semi-Autoregressive Streaming ASR With Label Context

    Authors: Siddhant Arora, George Saon, Shinji Watanabe, Brian Kingsbury

    Abstract: Non-autoregressive (NAR) modeling has gained significant interest in speech processing since these models achieve dramatically lower inference time than autoregressive (AR) models while also achieving good transcription accuracy. Since NAR automatic speech recognition (ASR) models must wait for the completion of the entire utterance before processing, some works explore streaming NAR models based… ▽ More

    Submitted 20 February, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

    Comments: Accepted at ICASSP 2024

  5. arXiv:2309.04031  [pdf, other

    cs.CL cs.SD eess.AS

    Multiple Representation Transfer from Large Language Models to End-to-End ASR Systems

    Authors: Takuma Udagawa, Masayuki Suzuki, Gakuto Kurata, Masayasu Muraoka, George Saon

    Abstract: Transferring the knowledge of large language models (LLMs) is a promising technique to incorporate linguistic knowledge into end-to-end automatic speech recognition (ASR) systems. However, existing works only transfer a single representation of LLM (e.g. the last layer of pretrained BERT), while the representation of a text is inherently non-unique and can be obtained variously from different laye… ▽ More

    Submitted 25 December, 2023; v1 submitted 7 September, 2023; originally announced September 2023.

    Comments: Accepted to ICASSP 2024

  6. arXiv:2302.14120  [pdf, other

    eess.AS cs.SD

    Diagonal State Space Augmented Transformers for Speech Recognition

    Authors: George Saon, Ankit Gupta, Xiaodong Cui

    Abstract: We improve on the popular conformer architecture by replacing the depthwise temporal convolutions with diagonal state space (DSS) models. DSS is a recently introduced variant of linear RNNs obtained by discretizing a linear dynamical system with a diagonal state transition matrix. DSS layers project the input sequence onto a space of orthogonal polynomials where the choice of basis functions, metr… ▽ More

    Submitted 27 February, 2023; originally announced February 2023.

    Comments: to be presented at ICASSP 2023

  7. arXiv:2208.01818  [pdf, other

    cs.SD cs.CL eess.AS

    VQ-T: RNN Transducers using Vector-Quantized Prediction Network States

    Authors: Jiatong Shi, George Saon, David Haws, Shinji Watanabe, Brian Kingsbury

    Abstract: Beam search, which is the dominant ASR decoding algorithm for end-to-end models, generates tree-structured hypotheses. However, recent studies have shown that decoding with hypothesis merging can achieve a more efficient search with comparable or better performance. But, the full context in recurrent networks is not compatible with hypothesis merging. We propose to use vector-quantized long short-… ▽ More

    Submitted 2 August, 2022; originally announced August 2022.

    Comments: Interspeech 2022 accepted paper

  8. arXiv:2207.13965  [pdf, other

    eess.AS cs.SD

    Extending RNN-T-based speech recognition systems with emotion and language classification

    Authors: Zvi Kons, Hagai Aronowitz, Edmilson Morais, Matheus Damasceno, Hong-Kwang Kuo, Samuel Thomas, George Saon

    Abstract: Speech transcription, emotion recognition, and language identification are usually considered to be three different tasks. Each one requires a different model with a different architecture and training process. We propose using a recurrent neural network transducer (RNN-T)-based speech-to-text (STT) system as a common component that can be used for emotion recognition and language identification a… ▽ More

    Submitted 28 July, 2022; originally announced July 2022.

    Comments: Accepted for publication in Interspeech 2022

  9. arXiv:2206.07882  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit Quantization

    Authors: Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Kailash Gopalakrishnan

    Abstract: We report on aggressive quantization strategies that greatly accelerate inference of Recurrent Neural Network Transducers (RNN-T). We use a 4 bit integer representation for both weights and activations and apply Quantization Aware Training (QAT) to retrain the full model (acoustic encoder and language model) and achieve near-iso-accuracy. We show that customized quantization schemes that are tailo… ▽ More

    Submitted 15 June, 2022; originally announced June 2022.

    Comments: 5 pages, 2 figures, 1 table. Paper accepted to Interspeech 2022

    ACM Class: I.2.6

  10. arXiv:2204.00212  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Effect and Analysis of Large-scale Language Model Rescoring on Competitive ASR Systems

    Authors: Takuma Udagawa, Masayuki Suzuki, Gakuto Kurata, Nobuyasu Itoh, George Saon

    Abstract: Large-scale language models (LLMs) such as GPT-2, BERT and RoBERTa have been successfully applied to ASR N-best rescoring. However, whether or how they can benefit competitive, near state-of-the-art ASR systems remains unexplored. In this study, we incorporate LLM rescoring into one of the most competitive ASR baselines: the Conformer-Transducer model. We demonstrate that consistent improvement is… ▽ More

    Submitted 18 August, 2022; v1 submitted 1 April, 2022; originally announced April 2022.

    Comments: Accepted to Interspeech 2022

  11. arXiv:2203.15176  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Generalization of Deep Neural Network Acoustic Models with Length Perturbation and N-best Based Label Smoothing

    Authors: Xiaodong Cui, George Saon, Tohru Nagano, Masayuki Suzuki, Takashi Fukuda, Brian Kingsbury, Gakuto Kurata

    Abstract: We introduce two techniques, length perturbation and n-best based label smoothing, to improve generalization of deep neural network (DNN) acoustic models for automatic speech recognition (ASR). Length perturbation is a data augmentation algorithm that randomly drops and inserts frames of an utterance to alter the length of the speech feature sequence. N-best based label smoothing randomly injects… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: Submitted to Interspeech 2022

  12. arXiv:2203.00006  [pdf, other

    cs.CL cs.SD eess.AS

    Towards Reducing the Need for Speech Training Data To Build Spoken Language Understanding Systems

    Authors: Samuel Thomas, Hong-Kwang J. Kuo, Brian Kingsbury, George Saon

    Abstract: The lack of speech data annotated with labels required for spoken language understanding (SLU) is often a major hurdle in building end-to-end (E2E) systems that can directly process speech inputs. In contrast, large amounts of text data with suitable labels are usually available. In this paper, we propose a novel text representation and training methodology that allows E2E SLU systems to be effect… ▽ More

    Submitted 26 February, 2022; originally announced March 2022.

    Comments: \c{opyright}2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. arXiv admin note: text overlap with arXiv:2202.13155

  13. arXiv:2202.13155  [pdf, other

    cs.CL cs.SD eess.AS

    Integrating Text Inputs For Training and Adapting RNN Transducer ASR Models

    Authors: Samuel Thomas, Brian Kingsbury, George Saon, Hong-Kwang J. Kuo

    Abstract: Compared to hybrid automatic speech recognition (ASR) systems that use a modular architecture in which each component can be independently adapted to a new domain, recent end-to-end (E2E) ASR system are harder to customize due to their all-neural monolithic construction. In this paper, we propose a novel text representation and training framework for E2E ASR models. With this approach, we show tha… ▽ More

    Submitted 26 February, 2022; originally announced February 2022.

    Comments: \c{opyright}2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  14. arXiv:2201.12105  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Improving End-to-End Models for Set Prediction in Spoken Language Understanding

    Authors: Hong-Kwang J. Kuo, Zoltan Tuske, Samuel Thomas, Brian Kingsbury, George Saon

    Abstract: The goal of spoken language understanding (SLU) systems is to determine the meaning of the input speech signal, unlike speech recognition which aims to produce verbatim transcripts. Advances in end-to-end (E2E) speech modeling have made it possible to train solely on semantic entities, which are far cheaper to collect than verbatim transcripts. We focus on this set prediction problem, where entity… ▽ More

    Submitted 28 January, 2022; originally announced January 2022.

    Comments: ICASSP \c{opyright}2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

    ACM Class: I.2.7

  15. arXiv:2110.02743  [pdf, other

    eess.AS cs.LG cs.NE q-bio.QM

    Towards efficient end-to-end speech recognition with biologically-inspired neural networks

    Authors: Thomas Bohnstingl, Ayush Garg, Stanisław Woźniak, George Saon, Evangelos Eleftheriou, Angeliki Pantazi

    Abstract: Automatic speech recognition (ASR) is a capability which enables a program to process human speech into a written form. Recent developments in artificial intelligence (AI) have led to high-accuracy ASR systems based on deep neural networks, such as the recurrent neural network transducer (RNN-T). However, the core components and the performed operations of these approaches depart from the powerful… ▽ More

    Submitted 4 November, 2021; v1 submitted 4 October, 2021; originally announced October 2021.

    Comments: Accepted at the Efficient Natural Language and Speech Processing workshop at NeurIPS 2021

  16. arXiv:2108.12074  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    4-bit Quantization of LSTM-based Speech Recognition Models

    Authors: Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Xiao Sun, Naigang Wang, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Wei Zhang, Zoltán Tüske, Kailash Gopalakrishnan

    Abstract: We investigate the impact of aggressive low-precision representations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM - Hidden Markov Models (DBLSTM-HMMs) and Recurrent Neural Network - Transducers (RNN-Ts). Using a 4-bit integer representation, a naïve quantization approach applied to the LSTM port… ▽ More

    Submitted 26 August, 2021; originally announced August 2021.

    Comments: 5 pages, 3 figures, Andrea Fasoli and Chia-Yu Chen equally contributed to this work. Paper accepted to Interspeech 2021

    ACM Class: I.2.6

  17. arXiv:2108.10803  [pdf, ps, other

    cs.CL cs.AI cs.SD eess.AS

    Reducing Exposure Bias in Training Recurrent Neural Network Transducers

    Authors: Xiaodong Cui, Brian Kingsbury, George Saon, David Haws, Zoltan Tuske

    Abstract: When recurrent neural network transducers (RNNTs) are trained using the typical maximum likelihood criterion, the prediction network is trained only on ground truth label sequences. This leads to a mismatch during inference, known as exposure bias, when the model must deal with label sequences containing errors. In this paper we investigate approaches to reducing exposure bias in training to impro… ▽ More

    Submitted 24 August, 2021; originally announced August 2021.

    Comments: accepted to Interspeech 2021

  18. arXiv:2108.08405  [pdf, other

    cs.CL cs.SD eess.AS

    Integrating Dialog History into End-to-End Spoken Language Understanding Systems

    Authors: Jatin Ganhotra, Samuel Thomas, Hong-Kwang J. Kuo, Sachindra Joshi, George Saon, Zoltán Tüske, Brian Kingsbury

    Abstract: End-to-end spoken language understanding (SLU) systems that process human-human or human-computer interactions are often context independent and process each turn of a conversation independently. Spoken conversations on the other hand, are very much context dependent, and dialog history contains useful information that can improve the processing of each conversational turn. In this paper, we inves… ▽ More

    Submitted 18 August, 2021; originally announced August 2021.

    Comments: Interspeech 2021

  19. arXiv:2105.00982  [pdf, other

    cs.CL cs.SD eess.AS

    On the limit of English conversational speech recognition

    Authors: Zoltán Tüske, George Saon, Brian Kingsbury

    Abstract: In our previous work we demonstrated that a single headed attention encoder-decoder model is able to reach state-of-the-art results in conversational speech recognition. In this paper, we further improve the results for both Switchboard 300 and 2000. Through use of an improved optimizer, speaker vector embeddings, and alternative speech representations we reduce the recognition errors of our LSTM… ▽ More

    Submitted 3 May, 2021; originally announced May 2021.

  20. arXiv:2104.03842  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    RNN Transducer Models For Spoken Language Understanding

    Authors: Samuel Thomas, Hong-Kwang J. Kuo, George Saon, Zoltán Tüske, Brian Kingsbury, Gakuto Kurata, Zvi Kons, Ron Hoory

    Abstract: We present a comprehensive study on building and adapting RNN transducer (RNN-T) models for spoken language understanding(SLU). These end-to-end (E2E) models are constructed in three practical settings: a case where verbatim transcripts are available, a constrained case where the only available annotations are SLU labels and their values, and a more restrictive case where transcripts are available… ▽ More

    Submitted 8 April, 2021; originally announced April 2021.

    Comments: To appear in the proceedings of ICASSP 2021

  21. arXiv:2103.09935  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Advancing RNN Transducer Technology for Speech Recognition

    Authors: George Saon, Zoltan Tueske, Daniel Bolanos, Brian Kingsbury

    Abstract: We investigate a set of techniques for RNN Transducers (RNN-Ts) that were instrumental in lowering the word error rate on three different tasks (Switchboard 300 hours, conversational Spanish 780 hours and conversational Italian 900 hours). The techniques pertain to architectural changes, speaker adaptation, language model fusion, model combination and general training recipe. First, we introduce a… ▽ More

    Submitted 17 March, 2021; originally announced March 2021.

    Comments: Accepted at ICASSP 2021

  22. arXiv:2002.10502  [pdf, other

    cs.DC cs.LG cs.SD eess.AS

    Distributed Training of Deep Neural Network Acoustic Models for Automatic Speech Recognition

    Authors: Xiaodong Cui, Wei Zhang, Ulrich Finkler, George Saon, Michael Picheny, David Kung

    Abstract: The past decade has witnessed great progress in Automatic Speech Recognition (ASR) due to advances in deep learning. The improvements in performance can be attributed to both improved models and large-scale training data. Key to training such models is the employment of efficient distributed learning techniques. In this article, we provide an overview of distributed training techniques for deep ne… ▽ More

    Submitted 24 February, 2020; originally announced February 2020.

    Comments: Accepted to IEEE Signal Processing Magazine

  23. arXiv:2001.07263  [pdf, other

    eess.AS cs.CL

    Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard

    Authors: Zoltán Tüske, George Saon, Kartik Audhkhasi, Brian Kingsbury

    Abstract: It is generally believed that direct sequence-to-sequence (seq2seq) speech recognition models are competitive with hybrid models only when a large amount of data, at least a thousand hours, is available for training. In this paper, we show that state-of-the-art recognition performance can be achieved on the Switchboard-300 database using a single headed attention, LSTM based model. Using a cross-u… ▽ More

    Submitted 19 October, 2020; v1 submitted 20 January, 2020; originally announced January 2020.

    Comments: 5 pages, 2 figures

    MSC Class: 68T10 ACM Class: I.2.7

  24. arXiv:1908.03455  [pdf, other

    cs.CL cs.SD eess.AS

    Challenging the Boundaries of Speech Recognition: The MALACH Corpus

    Authors: Michael Picheny, Zóltan Tüske, Brian Kingsbury, Kartik Audhkhasi, Xiaodong Cui, George Saon

    Abstract: There has been huge progress in speech recognition over the last several years. Tasks once thought extremely difficult, such as SWITCHBOARD, now approach levels of human performance. The MALACH corpus (LDC catalog LDC2012S05), a 375-Hour subset of a large archive of Holocaust testimonies collected by the Survivors of the Shoah Visual History Foundation, presents significant challenges to the speec… ▽ More

    Submitted 9 August, 2019; originally announced August 2019.

    Comments: Accepted for publication at INTERSPEECH 2019

  25. arXiv:1907.05701  [pdf, other

    eess.AS cs.DC cs.LG cs.SD stat.ML

    A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition

    Authors: Wei Zhang, Xiaodong Cui, Ulrich Finkler, George Saon, Abdullah Kayi, Alper Buyuktosunoglu, Brian Kingsbury, David Kung, Michael Picheny

    Abstract: Modern Automatic Speech Recognition (ASR) systems rely on distributed deep learning to for quick training completion. To enable efficient distributed training, it is imperative that the training algorithms can converge with a large mini-batch size. In this work, we discovered that Asynchronous Decentralized Parallel Stochastic Gradient Descent (ADPSGD) can work with much larger batch size than com… ▽ More

    Submitted 10 July, 2019; originally announced July 2019.

    Journal ref: INTERSPEECH 2019

  26. arXiv:1904.13258  [pdf, other

    cs.CL cs.SD eess.AS

    English Broadcast News Speech Recognition by Humans and Machines

    Authors: Samuel Thomas, Masayuki Suzuki, Yinghui Huang, Gakuto Kurata, Zoltan Tuske, George Saon, Brian Kingsbury, Michael Picheny, Tom Dibert, Alice Kaiser-Schatzlein, Bern Samko

    Abstract: With recent advances in deep learning, considerable attention has been given to achieving automatic speech recognition performance close to human performance on tasks like conversational telephone speech (CTS) recognition. In this paper we evaluate the usefulness of these proposed techniques on broadcast news (BN), a similar challenging task. We also perform a set of recognition measurements to un… ▽ More

    Submitted 30 April, 2019; originally announced April 2019.

    Comments: ©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  27. arXiv:1904.04956  [pdf, other

    cs.SD cs.CL cs.LG eess.AS stat.ML

    Distributed Deep Learning Strategies For Automatic Speech Recognition

    Authors: Wei Zhang, Xiaodong Cui, Ulrich Finkler, Brian Kingsbury, George Saon, David Kung, Michael Picheny

    Abstract: In this paper, we propose and investigate a variety of distributed deep learning strategies for automatic speech recognition (ASR) and evaluate them with a state-of-the-art Long short-term memory (LSTM) acoustic model on the 2000-hour Switchboard (SWB2000), which is one of the most widely used datasets for ASR performance benchmark. We first investigate what are the proper hyper-parameters (e.g.,… ▽ More

    Submitted 9 April, 2019; originally announced April 2019.

    Comments: Published in ICASSP'19