-
A Deliberation-based Joint Acoustic and Text Decoder
Authors:
Sepand Mavandadi,
Tara N. Sainath,
Ke Hu,
Zelin Wu
Abstract:
We propose a new two-pass E2E speech recognition model that improves ASR performance by training on a combination of paired data and unpaired text data. Previously, the joint acoustic and text decoder (JATD) has shown promising results through the use of text data during model training and the recently introduced deliberation architecture has reduced recognition errors by leveraging first-pass dec…
▽ More
We propose a new two-pass E2E speech recognition model that improves ASR performance by training on a combination of paired data and unpaired text data. Previously, the joint acoustic and text decoder (JATD) has shown promising results through the use of text data during model training and the recently introduced deliberation architecture has reduced recognition errors by leveraging first-pass decoding results. Our method, dubbed Deliberation-JATD, combines the spelling correcting abilities of deliberation with JATD's use of unpaired text data to further improve performance. The proposed model produces substantial gains across multiple test sets, especially those focused on rare words, where it reduces word error rate (WER) by between 12% and 22.5% relative. This is done without increasing model size or requiring multi-stage training, making Deliberation-JATD an efficient candidate for on-device applications.
△ Less
Submitted 23 March, 2023;
originally announced March 2023.
-
Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification
Authors:
Chao Zhang,
Bo Li,
Tara Sainath,
Trevor Strohman,
Sepand Mavandadi,
Shuo-yiin Chang,
Parisa Haghani
Abstract:
Language identification is critical for many downstream tasks in automatic speech recognition (ASR), and is beneficial to integrate into multilingual end-to-end ASR as an additional task. In this paper, we propose to modify the structure of the cascaded-encoder-based recurrent neural network transducer (RNN-T) model by integrating a per-frame language identifier (LID) predictor. RNN-T with cascade…
▽ More
Language identification is critical for many downstream tasks in automatic speech recognition (ASR), and is beneficial to integrate into multilingual end-to-end ASR as an additional task. In this paper, we propose to modify the structure of the cascaded-encoder-based recurrent neural network transducer (RNN-T) model by integrating a per-frame language identifier (LID) predictor. RNN-T with cascaded encoders can achieve streaming ASR with low latency using first-pass decoding with no right-context, and achieve lower word error rates (WERs) using second-pass decoding with longer right-context. By leveraging such differences in the right-contexts and a streaming implementation of statistics pooling, the proposed method can achieve accurate streaming LID prediction with little extra test-time cost. Experimental results on a voice search dataset with 9 language locales shows that the proposed method achieves an average of 96.2% LID prediction accuracy and the same second-pass WER as that obtained by including oracle LID in the input.
△ Less
Submitted 13 September, 2022;
originally announced September 2022.
-
Improving Deliberation by Text-Only and Semi-Supervised Training
Authors:
Ke Hu,
Tara N. Sainath,
Yanzhang He,
Rohit Prabhavalkar,
Trevor Strohman,
Sepand Mavandadi,
Weiran Wang
Abstract:
Text-only and semi-supervised training based on audio-only data has gained popularity recently due to the wide availability of unlabeled text and speech data. In this work, we propose incorporating text-only and semi-supervised training into an attention-based deliberation model. By incorporating text-only data in training a bidirectional encoder representation from transformer (BERT) for the deli…
▽ More
Text-only and semi-supervised training based on audio-only data has gained popularity recently due to the wide availability of unlabeled text and speech data. In this work, we propose incorporating text-only and semi-supervised training into an attention-based deliberation model. By incorporating text-only data in training a bidirectional encoder representation from transformer (BERT) for the deliberation text encoder, and large-scale text-to-speech and audio-only utterances using joint acoustic and text decoder (JATD) and semi-supervised training, we achieved 4%-12% WER reduction for various tasks compared to the baseline deliberation. Compared to a state-of-the-art language model (LM) rescoring method, the deliberation model reduces the Google Voice Search WER by 11% relative. We show that the deliberation model also achieves a positive human side-by-side evaluation compared to the state-of-the-art LM rescorer with reasonable endpointer latencies.
△ Less
Submitted 29 June, 2022;
originally announced June 2022.
-
Improving Rare Word Recognition with LM-aware MWER Training
Authors:
Weiran Wang,
Tongzhou Chen,
Tara N. Sainath,
Ehsan Variani,
Rohit Prabhavalkar,
Ronny Huang,
Bhuvana Ramabhadran,
Neeraj Gaur,
Sepand Mavandadi,
Cal Peyser,
Trevor Strohman,
Yanzhang He,
David Rybach
Abstract:
Language models (LMs) significantly improve the recognition accuracy of end-to-end (E2E) models on words rarely seen during training, when used in either the shallow fusion or the rescoring setups. In this work, we introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework, to mitigate the training versus inference gap regarding the use…
▽ More
Language models (LMs) significantly improve the recognition accuracy of end-to-end (E2E) models on words rarely seen during training, when used in either the shallow fusion or the rescoring setups. In this work, we introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework, to mitigate the training versus inference gap regarding the use of LMs. For the shallow fusion setup, we use LMs during both hypotheses generation and loss computation, and the LM-aware MWER-trained model achieves 10\% relative improvement over the model trained with standard MWER on voice search test sets containing rare words. For the rescoring setup, we learn a small neural module to generate per-token fusion weights in a data-dependent manner. This model achieves the same rescoring WER as regular MWER-trained model, but without the need for sweeping fusion weights.
△ Less
Submitted 27 June, 2022; v1 submitted 15 April, 2022;
originally announced April 2022.
-
Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus
Authors:
Cal Peyser,
Sepand Mavandadi,
Tara N. Sainath,
James Apfel,
Ruoming Pang,
Shankar Kumar
Abstract:
End-to-end (E2E) automatic speech recognition (ASR) systems lack the distinct language model (LM) component that characterizes traditional speech systems. While this simplifies the model architecture, it complicates the task of incorporating text-only data into training, which is important to the recognition of tail words that do not occur often in audio-text pairs. While shallow fusion has been p…
▽ More
End-to-end (E2E) automatic speech recognition (ASR) systems lack the distinct language model (LM) component that characterizes traditional speech systems. While this simplifies the model architecture, it complicates the task of incorporating text-only data into training, which is important to the recognition of tail words that do not occur often in audio-text pairs. While shallow fusion has been proposed as a method for incorporating a pre-trained LM into an E2E model at inference time, it has not yet been explored for very large text corpora, and it has been shown to be very sensitive to hyperparameter settings in the beam search. In this work, we apply shallow fusion to incorporate a very large text corpus into a state-of-the-art E2EASR model. We explore the impact of model size and show that intelligent pruning of the training set can be more effective than increasing the parameter count. Additionally, we show that incorporating the LM in minimum word error rate (MWER) fine tuning makes shallow fusion far less dependent on optimal hyperparameter settings, reducing the difficulty of that tuning problem.
△ Less
Submitted 25 August, 2020; v1 submitted 24 August, 2020;
originally announced August 2020.
-
BigFoot: Analysis, monitoring, tracking and sharing of bio-medical features of human appendages using consumer-grade home and office based imaging devices
Authors:
Sam Mavandadi,
Steve Feng,
Frank Yu,
Richard Yu,
Aydogan Ozcan
Abstract:
Here we describe a system for personal and professional management and analysis of bio-medical images captured using off-the-shelf, consumer-grade imaging devices such as scanners, digital cameras, cellphones, webcams and tablet PCs. Specifically, we describe an implementation of this system for the analysis, monitoring and tracking of conditions and features of human feet using a flatbed scanner…
▽ More
Here we describe a system for personal and professional management and analysis of bio-medical images captured using off-the-shelf, consumer-grade imaging devices such as scanners, digital cameras, cellphones, webcams and tablet PCs. Specifically, we describe an implementation of this system for the analysis, monitoring and tracking of conditions and features of human feet using a flatbed scanner as the image capture device and a custom-designed set of algorithms and software to manage and analyze the acquired data.
△ Less
Submitted 5 December, 2012;
originally announced December 2012.