-
Style-agnostic evaluation of ASR using multiple reference transcripts
Authors:
Quinten McNamara,
Miguel Ángel del Río Fernández,
Nishchal Bhandari,
Martin Ratajczak,
Danny Chen,
Corey Miller,
Migüel Jetté
Abstract:
Word error rate (WER) as a metric has a variety of limitations that have plagued the field of speech recognition. Evaluation datasets suffer from varying style, formality, and inherent ambiguity of the transcription task. In this work, we attempt to mitigate some of these differences by performing style-agnostic evaluation of ASR systems using multiple references transcribed under opposing style p…
▽ More
Word error rate (WER) as a metric has a variety of limitations that have plagued the field of speech recognition. Evaluation datasets suffer from varying style, formality, and inherent ambiguity of the transcription task. In this work, we attempt to mitigate some of these differences by performing style-agnostic evaluation of ASR systems using multiple references transcribed under opposing style parameters. As a result, we find that existing WER reports are likely significantly over-estimating the number of contentful errors made by state-of-the-art ASR systems. In addition, we have found our multireference method to be a useful mechanism for comparing the quality of ASR models that differ in the stylistic makeup of their training data and target task.
△ Less
Submitted 10 December, 2024;
originally announced December 2024.
-
Reverb: Open-Source ASR and Diarization from Rev
Authors:
Nishchal Bhandari,
Danny Chen,
Miguel Ángel del Río Fernández,
Natalie Delworth,
Jennifer Drexler Fox,
Migüel Jetté,
Quinten McNamara,
Corey Miller,
Ondřej Novotný,
Ján Profant,
Nan Qin,
Martin Ratajczak,
Jean-Philippe Robichaud
Abstract:
Today, we are open-sourcing our core speech recognition and diarization models for non-commercial use. We are releasing both a full production pipeline for developers as well as pared-down research models for experimentation. Rev hopes that these releases will spur research and innovation in the fast-moving domain of voice technology. The speech recognition models released today outperform all exi…
▽ More
Today, we are open-sourcing our core speech recognition and diarization models for non-commercial use. We are releasing both a full production pipeline for developers as well as pared-down research models for experimentation. Rev hopes that these releases will spur research and innovation in the fast-moving domain of voice technology. The speech recognition models released today outperform all existing open source speech recognition models across a variety of long-form speech recognition domains.
△ Less
Submitted 24 February, 2025; v1 submitted 4 October, 2024;
originally announced October 2024.
-
Quantification of stylistic differences in human- and ASR-produced transcripts of African American English
Authors:
Annika Heuser,
Tyler Kendall,
Miguel del Rio,
Quinten McNamara,
Nishchal Bhandari,
Corey Miller,
Migüel Jetté
Abstract:
Common measures of accuracy used to assess the performance of automatic speech recognition (ASR) systems, as well as human transcribers, conflate multiple sources of error. Stylistic differences, such as verbatim vs non-verbatim, can play a significant role in ASR performance evaluation when differences exist between training and test datasets. The problem is compounded for speech from underrepres…
▽ More
Common measures of accuracy used to assess the performance of automatic speech recognition (ASR) systems, as well as human transcribers, conflate multiple sources of error. Stylistic differences, such as verbatim vs non-verbatim, can play a significant role in ASR performance evaluation when differences exist between training and test datasets. The problem is compounded for speech from underrepresented varieties, where the speech to orthography mapping is not as standardized. We categorize the kinds of stylistic differences between 6 transcription versions, 4 human- and 2 ASR-produced, of 10 hours of African American English (AAE) speech. Focusing on verbatim features and AAE morphosyntactic features, we investigate the interactions of these categories with how well transcripts can be compared via word error rate (WER). The results, and overall analysis, help clarify how ASR outputs are a function of the decisions made by the training data's human transcribers.
△ Less
Submitted 4 September, 2024;
originally announced September 2024.
-
Updated Corpora and Benchmarks for Long-Form Speech Recognition
Authors:
Jennifer Drexler Fox,
Desh Raj,
Natalie Delworth,
Quinn McNamara,
Corey Miller,
Migüel Jetté
Abstract:
The vast majority of ASR research uses corpora in which both the training and test data have been pre-segmented into utterances. In most real-word ASR use-cases, however, test audio is not segmented, leading to a mismatch between inference-time conditions and models trained on segmented utterances. In this paper, we re-release three standard ASR corpora - TED-LIUM 3, Gigapeech, and VoxPopuli-en -…
▽ More
The vast majority of ASR research uses corpora in which both the training and test data have been pre-segmented into utterances. In most real-word ASR use-cases, however, test audio is not segmented, leading to a mismatch between inference-time conditions and models trained on segmented utterances. In this paper, we re-release three standard ASR corpora - TED-LIUM 3, Gigapeech, and VoxPopuli-en - with updated transcription and alignments to enable their use for long-form ASR research. We use these reconstituted corpora to study the train-test mismatch problem for transducers and attention-based encoder-decoders (AEDs), confirming that AEDs are more susceptible to this issue. Finally, we benchmark a simple long-form training for these models, showing its efficacy for model robustness under this domain shift.
△ Less
Submitted 26 September, 2023;
originally announced September 2023.
-
Earnings-21: A Practical Benchmark for ASR in the Wild
Authors:
Miguel Del Rio,
Natalie Delworth,
Ryan Westerman,
Michelle Huang,
Nishchal Bhandari,
Joseph Palakapilly,
Quinten McNamara,
Joshua Dong,
Piotr Zelasko,
Miguel Jette
Abstract:
Commonly used speech corpora inadequately challenge academic and commercial ASR systems. In particular, speech corpora lack metadata needed for detailed analysis and WER measurement. In response, we present Earnings-21, a 39-hour corpus of earnings calls containing entity-dense speech from nine different financial sectors. This corpus is intended to benchmark ASR systems in the wild with special a…
▽ More
Commonly used speech corpora inadequately challenge academic and commercial ASR systems. In particular, speech corpora lack metadata needed for detailed analysis and WER measurement. In response, we present Earnings-21, a 39-hour corpus of earnings calls containing entity-dense speech from nine different financial sectors. This corpus is intended to benchmark ASR systems in the wild with special attention towards named entity recognition. We benchmark four commercial ASR models, two internal models built with open-source tools, and an open-source LibriSpeech model and discuss their differences in performance on Earnings-21. Using our recently released fstalign tool, we provide a candid analysis of each model's recognition capabilities under different partitions. Our analysis finds that ASR accuracy for certain NER categories is poor, presenting a significant impediment to transcript comprehension and usage. Earnings-21 bridges academic and commercial ASR system evaluation and enables further research on entity modeling and WER on real world audio.
△ Less
Submitted 15 June, 2021; v1 submitted 22 April, 2021;
originally announced April 2021.
-
Accented Speech Recognition: A Survey
Authors:
Arthur Hinsvark,
Natalie Delworth,
Miguel Del Rio,
Quinten McNamara,
Joshua Dong,
Ryan Westerman,
Michelle Huang,
Joseph Palakapilly,
Jennifer Drexler,
Ilya Pirkin,
Nishchal Bhandari,
Miguel Jette
Abstract:
Automatic Speech Recognition (ASR) systems generalize poorly on accented speech. The phonetic and linguistic variability of accents present hard challenges for ASR systems today in both data collection and modeling strategies. The resulting bias in ASR performance across accents comes at a cost to both users and providers of ASR.
We present a survey of current promising approaches to accented sp…
▽ More
Automatic Speech Recognition (ASR) systems generalize poorly on accented speech. The phonetic and linguistic variability of accents present hard challenges for ASR systems today in both data collection and modeling strategies. The resulting bias in ASR performance across accents comes at a cost to both users and providers of ASR.
We present a survey of current promising approaches to accented speech recognition and highlight the key challenges in the space. Approaches mostly focus on single model generalization and accent feature engineering. Among the challenges, lack of a standard benchmark makes research and comparison especially difficult.
△ Less
Submitted 2 June, 2021; v1 submitted 21 April, 2021;
originally announced April 2021.