-
Accurate, fast, cheap: Choose three. Replacing Multi-Head-Attention with Bidirectional Recurrent Attention for Long-Form ASR
Authors:
Martin Ratajczak,
Jean-Philippe Robichaud,
Jennifer Drexler Fox
Abstract:
Long-form speech recognition is an application area of increasing research focus. ASR models based on multi-head attention (MHA) are ill-suited to long-form ASR because of their quadratic complexity in sequence length. We build on recent work that has investigated linear complexity recurrent attention (RA) layers for ASR. We find that bidirectional RA layers can match the accuracy of MHA for both…
▽ More
Long-form speech recognition is an application area of increasing research focus. ASR models based on multi-head attention (MHA) are ill-suited to long-form ASR because of their quadratic complexity in sequence length. We build on recent work that has investigated linear complexity recurrent attention (RA) layers for ASR. We find that bidirectional RA layers can match the accuracy of MHA for both short- and long-form applications. We present a strong limited-context attention (LCA) baseline, and show that RA layers are just as accurate while being more efficient. We develop a long-form training paradigm which further improves RA performance, leading to better accuracy than LCA with 44% higher throughput. We also present Direction Dropout, a novel regularization method that improves accuracy, provides fine-grained control of the accuracy/throughput trade-off of bidirectional RA, and enables a new alternating directions decoding mode with even higher throughput.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
Style-agnostic evaluation of ASR using multiple reference transcripts
Authors:
Quinten McNamara,
Miguel Ángel del Río Fernández,
Nishchal Bhandari,
Martin Ratajczak,
Danny Chen,
Corey Miller,
Migüel Jetté
Abstract:
Word error rate (WER) as a metric has a variety of limitations that have plagued the field of speech recognition. Evaluation datasets suffer from varying style, formality, and inherent ambiguity of the transcription task. In this work, we attempt to mitigate some of these differences by performing style-agnostic evaluation of ASR systems using multiple references transcribed under opposing style p…
▽ More
Word error rate (WER) as a metric has a variety of limitations that have plagued the field of speech recognition. Evaluation datasets suffer from varying style, formality, and inherent ambiguity of the transcription task. In this work, we attempt to mitigate some of these differences by performing style-agnostic evaluation of ASR systems using multiple references transcribed under opposing style parameters. As a result, we find that existing WER reports are likely significantly over-estimating the number of contentful errors made by state-of-the-art ASR systems. In addition, we have found our multireference method to be a useful mechanism for comparing the quality of ASR models that differ in the stylistic makeup of their training data and target task.
△ Less
Submitted 10 December, 2024;
originally announced December 2024.
-
Reverb: Open-Source ASR and Diarization from Rev
Authors:
Nishchal Bhandari,
Danny Chen,
Miguel Ángel del Río Fernández,
Natalie Delworth,
Jennifer Drexler Fox,
Migüel Jetté,
Quinten McNamara,
Corey Miller,
Ondřej Novotný,
Ján Profant,
Nan Qin,
Martin Ratajczak,
Jean-Philippe Robichaud
Abstract:
Today, we are open-sourcing our core speech recognition and diarization models for non-commercial use. We are releasing both a full production pipeline for developers as well as pared-down research models for experimentation. Rev hopes that these releases will spur research and innovation in the fast-moving domain of voice technology. The speech recognition models released today outperform all exi…
▽ More
Today, we are open-sourcing our core speech recognition and diarization models for non-commercial use. We are releasing both a full production pipeline for developers as well as pared-down research models for experimentation. Rev hopes that these releases will spur research and innovation in the fast-moving domain of voice technology. The speech recognition models released today outperform all existing open source speech recognition models across a variety of long-form speech recognition domains.
△ Less
Submitted 24 February, 2025; v1 submitted 4 October, 2024;
originally announced October 2024.
-
Sum-Product Networks for Sequence Labeling
Authors:
Martin Ratajczak,
Sebastian Tschiatschek,
Franz Pernkopf
Abstract:
We consider higher-order linear-chain conditional random fields (HO-LC-CRFs) for sequence modelling, and use sum-product networks (SPNs) for representing higher-order input- and output-dependent factors. SPNs are a recently introduced class of deep models for which exact and efficient inference can be performed. By combining HO-LC-CRFs with SPNs, expressive models over both the output labels and t…
▽ More
We consider higher-order linear-chain conditional random fields (HO-LC-CRFs) for sequence modelling, and use sum-product networks (SPNs) for representing higher-order input- and output-dependent factors. SPNs are a recently introduced class of deep models for which exact and efficient inference can be performed. By combining HO-LC-CRFs with SPNs, expressive models over both the output labels and the hidden variables are instantiated while still enabling efficient exact inference. Furthermore, the use of higher-order factors allows us to capture relations of multiple input segments and multiple output labels as often present in real-world data. These relations can not be modelled by the commonly used first-order models and higher-order models with local factors including only a single output label. We demonstrate the effectiveness of our proposed models for sequence labeling. In extensive experiments, we outperform other state-of-the-art methods in optical character recognition and achieve competitive results in phone classification.
△ Less
Submitted 6 July, 2018;
originally announced July 2018.