Skip to main content

Showing 1–13 of 13 results for author: Variani, E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2304.13134  [pdf, ps, other

    cs.CL

    LAST: Scalable Lattice-Based Speech Modelling in JAX

    Authors: Ke Wu, Ehsan Variani, Tom Bagby, Michael Riley

    Abstract: We introduce LAST, a LAttice-based Speech Transducer library in JAX. With an emphasis on flexibility, ease-of-use, and scalability, LAST implements differentiable weighted finite state automaton (WFSA) algorithms needed for training \& inference that scale to a large WFSA such as a recognition lattice over the entire utterance. Despite these WFSA algorithms being well-known in the literature, new… ▽ More

    Submitted 25 April, 2023; originally announced April 2023.

  2. arXiv:2302.08583  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition

    Authors: Zhong Meng, Weiran Wang, Rohit Prabhavalkar, Tara N. Sainath, Tongzhou Chen, Ehsan Variani, Yu Zhang, Bo Li, Andrew Rosenberg, Bhuvana Ramabhadran

    Abstract: We propose JEIT, a joint end-to-end (E2E) model and internal language model (ILM) training method to inject large-scale unpaired text into ILM during E2E training which improves rare-word speech recognition. With JEIT, the E2E model computes an E2E loss on audio-transcript pairs while its ILM estimates a cross-entropy loss on unpaired text. The E2E model is trained to minimize a weighted sum of E2… ▽ More

    Submitted 16 February, 2023; originally announced February 2023.

    Comments: 5 pages, 3 figures, in ICASSP 2023

    Journal ref: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes island, Greece

  3. arXiv:2212.12442  [pdf, ps, other

    cs.CL cs.LG

    Alignment Entropy Regularization

    Authors: Ehsan Variani, Ke Wu, David Rybach, Cyril Allauzen, Michael Riley

    Abstract: Existing training criteria in automatic speech recognition(ASR) permit the model to freely explore more than one time alignments between the feature and label sequences. In this paper, we use entropy to measure a model's uncertainty, i.e. how it chooses to distribute the probability mass over the set of allowed alignments. Furthermore, we evaluate the effect of entropy regularization in encouragin… ▽ More

    Submitted 22 December, 2022; originally announced December 2022.

  4. arXiv:2210.17049  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Modular Hybrid Autoregressive Transducer

    Authors: Zhong Meng, Tongzhou Chen, Rohit Prabhavalkar, Yu Zhang, Gary Wang, Kartik Audhkhasi, Jesse Emond, Trevor Strohman, Bhuvana Ramabhadran, W. Ronny Huang, Ehsan Variani, Yinghui Huang, Pedro J. Moreno

    Abstract: Text-only adaptation of a transducer model remains challenging for end-to-end speech recognition since the transducer has no clearly separated acoustic model (AM), language model (LM) or blank model. In this work, we propose a modular hybrid autoregressive transducer (MHAT) that has structurally separated label and blank decoders to predict label and blank distributions, respectively, along with a… ▽ More

    Submitted 16 February, 2023; v1 submitted 30 October, 2022; originally announced October 2022.

    Comments: 8 pages, 1 figure, in SLT 2022

    Journal ref: 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar

  5. arXiv:2207.00706  [pdf, other

    eess.AS cs.CL cs.LG

    UserLibri: A Dataset for ASR Personalization Using Only Text

    Authors: Theresa Breiner, Swaroop Ramaswamy, Ehsan Variani, Shefali Garg, Rajiv Mathews, Khe Chai Sim, Kilol Gupta, Mingqing Chen, Lara McConnaughey

    Abstract: Personalization of speech models on mobile devices (on-device personalization) is an active area of research, but more often than not, mobile devices have more text-only data than paired audio-text data. We explore training a personalized language model on text-only data, used during inference to improve speech recognition performance for that user. We experiment on a user-clustered LibriSpeech co… ▽ More

    Submitted 1 July, 2022; originally announced July 2022.

    Comments: Accepted for publication in Interspeech 2022. 9 total pages with appendix, 9 total tables, 5 total figures

  6. arXiv:2205.13674  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Global Normalization for Streaming Speech Recognition in a Modular Framework

    Authors: Ehsan Variani, Ke Wu, Michael Riley, David Rybach, Matt Shannon, Cyril Allauzen

    Abstract: We introduce the Globally Normalized Autoregressive Transducer (GNAT) for addressing the label bias problem in streaming speech recognition. Our solution admits a tractable exact computation of the denominator for the sequence-level normalization. Through theoretical and empirical results, we demonstrate that by switching to a globally normalized model, the word error rate gap between streaming an… ▽ More

    Submitted 26 May, 2022; originally announced May 2022.

  7. arXiv:2204.07553  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Rare Word Recognition with LM-aware MWER Training

    Authors: Weiran Wang, Tongzhou Chen, Tara N. Sainath, Ehsan Variani, Rohit Prabhavalkar, Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach

    Abstract: Language models (LMs) significantly improve the recognition accuracy of end-to-end (E2E) models on words rarely seen during training, when used in either the shallow fusion or the rescoring setups. In this work, we introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework, to mitigate the training versus inference gap regarding the use… ▽ More

    Submitted 27 June, 2022; v1 submitted 15 April, 2022; originally announced April 2022.

    Comments: To appear in INTERSPEECH 2022

  8. arXiv:2010.14606  [pdf, other

    eess.AS cs.CL cs.SD

    Cascaded encoders for unifying streaming and non-streaming ASR

    Authors: Arun Narayanan, Tara N. Sainath, Ruoming Pang, Jiahui Yu, Chung-Cheng Chiu, Rohit Prabhavalkar, Ehsan Variani, Trevor Strohman

    Abstract: End-to-end (E2E) automatic speech recognition (ASR) models, by now, have shown competitive performance on several benchmarks. These models are structured to either operate in streaming or non-streaming mode. This work presents cascaded encoders for building a single E2E ASR model that can operate in both these modes simultaneously. The proposed model consists of streaming and non-streaming encoder… ▽ More

    Submitted 27 October, 2020; originally announced October 2020.

  9. arXiv:2003.07705  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Hybrid Autoregressive Transducer (hat)

    Authors: Ehsan Variani, David Rybach, Cyril Allauzen, Michael Riley

    Abstract: This paper proposes and evaluates the hybrid autoregressive transducer (HAT) model, a time-synchronous encoderdecoder model that preserves the modularity of conventional automatic speech recognition systems. The HAT model provides a way to measure the quality of the internal language model that can be used to decide whether inference with an external language model is beneficial or not. This artic… ▽ More

    Submitted 12 March, 2020; originally announced March 2020.

  10. arXiv:2002.11268  [pdf, other

    eess.AS cs.CL cs.SD

    A Density Ratio Approach to Language Model Fusion in End-To-End Automatic Speech Recognition

    Authors: Erik McDermott, Hasim Sak, Ehsan Variani

    Abstract: This article describes a density ratio approach to integrating external Language Models (LMs) into end-to-end models for Automatic Speech Recognition (ASR). Applied to a Recurrent Neural Network Transducer (RNN-T) ASR model trained on a given domain, a matched in-domain RNN-LM, and a target domain RNN-LM, the proposed method uses Bayes' Rule to define RNN-T posteriors for the target domain, in a m… ▽ More

    Submitted 27 February, 2020; v1 submitted 25 February, 2020; originally announced February 2020.

    Comments: 8 pages, 4 figures, presented at 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019)

  11. arXiv:1811.08417  [pdf, other

    cs.LG cs.CL stat.ML

    WEST: Word Encoded Sequence Transducers

    Authors: Ehsan Variani, Ananda Theertha Suresh, Mitchel Weintraub

    Abstract: Most of the parameters in large vocabulary models are used in embedding layer to map categorical features to vectors and in softmax layer for classification weights. This is a bottle-neck in memory constraint on-device training applications like federated learning and on-device inference applications like automatic speech recognition (ASR). One way of compressing the embedding and softmax layers i… ▽ More

    Submitted 20 November, 2018; originally announced November 2018.

    Comments: 12 pages

  12. arXiv:1712.03439  [pdf, other

    cs.SD eess.AS eess.SP

    Efficient Implementation of the Room Simulator for Training Deep Neural Network Acoustic Models

    Authors: Chanwoo Kim, Ehsan Variani, Arun Narayanan, Michiel Bacchiani

    Abstract: In this paper, we describe how to efficiently implement an acoustic room simulator to generate large-scale simulated data for training deep neural networks. Even though Google Room Simulator in [1] was shown to be quite effective in reducing the Word Error Rates (WERs) for far-field applications by generating simulated far-field training sets, it requires a very large number of Fast Fourier Transf… ▽ More

    Submitted 31 December, 2018; v1 submitted 9 December, 2017; originally announced December 2017.

    Comments: Published at INTERSPEECH 2018. (https://www.isca-speech.org/archive/Interspeech_2018/abstracts/2566.html)

  13. arXiv:1504.05996  [pdf, other

    cs.IT cs.AI

    Non-Adaptive Policies for 20 Questions Target Localization

    Authors: Ehsan Variani, Kamel Lahouel, Avner Bar-Hen, Bruno Jedynak

    Abstract: The problem of target localization with noise is addressed. The target is a sample from a continuous random variable with known distribution and the goal is to locate it with minimum mean squared error distortion. The localization scheme or policy proceeds by queries, or questions, weather or not the target belongs to some subset as it is addressed in the 20-question framework. These subsets are n… ▽ More

    Submitted 1 May, 2015; v1 submitted 22 April, 2015; originally announced April 2015.