Skip to main content

Showing 1–9 of 9 results for author: van Dalen, R

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.10653  [pdf, ps, other

    eess.AS cs.CL cs.LG

    Robust Unsupervised Adaptation of a Speech Recogniser Using Entropy Minimisation and Speaker Codes

    Authors: Rogier C. van Dalen, Shucong Zhang, Titouan Parcollet, Sourav Bhattacharya

    Abstract: Speech recognisers usually perform optimally only in a specific environment and need to be adapted to work well in another. For adaptation to a new speaker, there is often too little data for fine-tuning to be robust, and that data is usually unlabelled. This paper proposes a combination of approaches to make adaptation to a single minute of data robust. First, instead of estimating the adaptation… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Journal ref: Interspeech 2025

  2. arXiv:2505.22251  [pdf, ps, other

    eess.AS cs.CL

    Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition

    Authors: Yuan Tseng, Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya

    Abstract: Recent work suggests that large language models (LLMs) can improve performance of speech tasks compared to existing systems. To support their claims, results on LibriSpeech and Common Voice are often quoted. However, this work finds that a substantial amount of the LibriSpeech and Common Voice evaluation sets appear in public LLM pretraining corpora. This calls into question the reliability of fin… ▽ More

    Submitted 5 June, 2025; v1 submitted 28 May, 2025; originally announced May 2025.

  3. arXiv:2505.21578  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use

    Authors: Titouan Parcollet, Yuan Tseng, Shucong Zhang, Rogier van Dalen

    Abstract: Automatic speech recognition (ASR) research is driven by the availability of common datasets between industrial researchers and academics, encouraging comparisons and evaluations. LibriSpeech, despite its long success as an ASR benchmark, is now limited by its size and focus on clean, read speech, leading to near-zero word error rates. More recent datasets, including MOSEL, YODAS, Gigaspeech, OWSM… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: Accepted at Interspeech 2025

  4. arXiv:2501.06051  [pdf, ps, other

    cs.CL cs.AI eess.AS

    Benchmarking Rotary Position Embeddings for Automatic Speech Recognition

    Authors: Shucong Zhang, Titouan Parcollet, Rogier van Dalen, Sourav Bhattacharya

    Abstract: Self-attention relies on positional embeddings to encode input order. Relative Position (RelPos) embeddings are widely used in Automatic Speech Recognition (ASR). However, RelPos has quadratic time complexity to input length and is often incompatible with fast GPU implementations of attention. In contrast, Rotary Positional Embedding (RoPE) rotates each input vector based on its absolute position,… ▽ More

    Submitted 15 June, 2025; v1 submitted 10 January, 2025; originally announced January 2025.

  5. arXiv:2409.07165  [pdf, other

    cs.SD cs.AI eess.AS

    Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition

    Authors: Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Batthacharya

    Abstract: Automatic speech recognition (ASR) with an encoder equipped with self-attention, whether streaming or non-streaming, takes quadratic time in the length of the speech utterance. This slows down training and decoding, increase their cost, and limit the deployment of the ASR in constrained devices. SummaryMixing is a promising linear-time complexity alternative to self-attention for non-streaming spe… ▽ More

    Submitted 11 September, 2024; originally announced September 2024.

  6. arXiv:2407.13377  [pdf, other

    cs.CL cs.AI eess.AS

    Linear-Complexity Self-Supervised Learning for Speech Processing

    Authors: Shucong Zhang, Titouan Parcollet, Rogier van Dalen, Sourav Bhattacharya

    Abstract: Self-supervised learning (SSL) models usually require weeks of pre-training with dozens of high-end GPUs. These models typically have a multi-headed self-attention (MHSA) context encoder. However, MHSA takes quadratic time and space in the input length, contributing to the high pre-training cost. Linear-complexity alternatives to MHSA have been proposed. For instance, in supervised training, the S… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Interspeech 2024

  7. arXiv:2307.10975  [pdf, other

    eess.AS cs.LG cs.SD

    Globally Normalising the Transducer for Streaming Speech Recognition

    Authors: Rogier van Dalen

    Abstract: The Transducer (e.g. RNN-Transducer or Conformer-Transducer) generates an output label sequence as it traverses the input sequence. It is straightforward to use in streaming mode, where it generates partial hypotheses before the complete input has been seen. This makes it popular in speech recognition. However, in streaming mode the Transducer has a mathematical flaw which, simply put, restricts t… ▽ More

    Submitted 20 July, 2023; originally announced July 2023.

    Comments: 9 pages plus references and appendices

    MSC Class: 68T10

  8. arXiv:2307.07421  [pdf, other

    cs.CL cs.SD eess.AS

    SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding

    Authors: Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya

    Abstract: Modern speech processing systems rely on self-attention. Unfortunately, token mixing with self-attention takes quadratic time in the length of the speech utterance, slowing down inference and training and increasing memory consumption. Cheaper alternatives to self-attention for ASR have been developed, but they fail to consistently reach the same level of accuracy. This paper, therefore, proposes… ▽ More

    Submitted 11 July, 2024; v1 submitted 12 July, 2023; originally announced July 2023.

    Comments: Interspeech 2024

  9. arXiv:2008.02651  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Improving on-device speaker verification using federated learning with privacy

    Authors: Filip Granqvist, Matt Seigel, Rogier van Dalen, Áine Cahill, Stephen Shum, Matthias Paulik

    Abstract: Information on speaker characteristics can be useful as side information in improving speaker recognition accuracy. However, such information is often private. This paper investigates how privacy-preserving learning can improve a speaker verification system, by enabling the use of privacy-sensitive speaker data to train an auxiliary classification model that predicts vocal characteristics of speak… ▽ More

    Submitted 6 August, 2020; originally announced August 2020.

    Comments: To appear in proceedings of INTERSPEECH 2020