Search | arXiv e-print repository

Unified Learnable 2D Convolutional Feature Extraction for ASR

Authors: Peter Vieting, Benedikt Hilmes, Ralf Schlüter, Hermann Ney

Abstract: Neural front-ends represent a promising approach to feature extraction for automatic speech recognition (ASR) systems as they enable to learn specifically tailored features for different tasks. Yet, many of the existing techniques remain heavily influenced by classical methods. While this inductive bias may ease the system design, our work aims to develop a more generic front-end for feature extra… ▽ More Neural front-ends represent a promising approach to feature extraction for automatic speech recognition (ASR) systems as they enable to learn specifically tailored features for different tasks. Yet, many of the existing techniques remain heavily influenced by classical methods. While this inductive bias may ease the system design, our work aims to develop a more generic front-end for feature extraction. Furthermore, we seek to unify the front-end architecture contrasting with existing approaches that apply a composition of several layer topologies originating from different sources. The experiments systematically show how to reduce the influence of existing techniques to achieve a generic front-end. The resulting 2D convolutional front-end is parameter-efficient and suitable for a scenario with limited computational resources unlike large models pre-trained on unlabeled audio. The results demonstrate that this generic unified approach is not only feasible but also matches the performance of existing supervised learnable feature extractors. △ Less

Submitted 12 September, 2025; originally announced September 2025.

Comments: Accepted at ITG Conference on Speech Communication 2025

arXiv:2506.09804 [pdf, ps, other]

doi 10.21437/Interspeech.2025-694

Regularizing Learnable Feature Extraction for Automatic Speech Recognition

Authors: Peter Vieting, Maximilian Kannen, Benedikt Hilmes, Ralf Schlüter, Hermann Ney

Abstract: Neural front-ends are an appealing alternative to traditional, fixed feature extraction pipelines for automatic speech recognition (ASR) systems since they can be directly trained to fit the acoustic model. However, their performance often falls short compared to classical methods, which we show is largely due to their increased susceptibility to overfitting. This work therefore investigates regul… ▽ More Neural front-ends are an appealing alternative to traditional, fixed feature extraction pipelines for automatic speech recognition (ASR) systems since they can be directly trained to fit the acoustic model. However, their performance often falls short compared to classical methods, which we show is largely due to their increased susceptibility to overfitting. This work therefore investigates regularization methods for training ASR models with learnable feature extraction front-ends. First, we examine audio perturbation methods and show that larger relative improvements can be obtained for learnable features. Additionally, we identify two limitations in the standard use of SpecAugment for these front-ends and propose masking in the short time Fourier transform (STFT)-domain as a simple but effective modification to address these challenges. Finally, integrating both regularization approaches effectively closes the performance gap between traditional and learnable features. △ Less

Submitted 30 September, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

Comments: Proceedings of Interspeech 2025

arXiv:2506.01503 [pdf, ps, other]

Analyzing the Importance of Blank for CTC-Based Knowledge Distillation

Authors: Benedikt Hilmes, Nick Rossenbach, Ralf Schlüter

Abstract: With the rise of large pre-trained foundation models for automatic speech recognition new challenges appear. While the performance of these models is good, runtime and cost of inference increases. One approach to make use of their strength while retaining efficiency is to distill their knowledge to smaller models during training. In this work, we explore different CTC-based distillation variants,… ▽ More With the rise of large pre-trained foundation models for automatic speech recognition new challenges appear. While the performance of these models is good, runtime and cost of inference increases. One approach to make use of their strength while retaining efficiency is to distill their knowledge to smaller models during training. In this work, we explore different CTC-based distillation variants, focusing on blank token handling. We show that common approaches like blank elimination do not always work off the shelf. We explore new blank selection patterns as a potential sweet spot between standard knowledge distillation and blank elimination mechanisms. Through the introduction of a symmetric selection method, we are able to remove the CTC loss during knowledge distillation with minimal to no performance degradation. With this, we make the training independent from target labels, potentially allowing for distillation on untranscribed audio data. △ Less

Submitted 2 June, 2025; originally announced June 2025.

Comments: Accepted for Interspeech 2025

arXiv:2407.17997 [pdf, ps, other]

doi 10.21437/SynData4GenAI.2024-10

On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures

Authors: Benedikt Hilmes, Nick Rossenbach, and Ralf Schlüter

Abstract: In this work we evaluate the utility of synthetic data for training automatic speech recognition (ASR). We use the ASR training data to train a text-to-speech (TTS) system similar to FastSpeech-2. With this TTS we reproduce the original training data, training ASR systems solely on synthetic data. For ASR, we use three different architectures, attention-based encoder-decoder, hybrid deep neural ne… ▽ More In this work we evaluate the utility of synthetic data for training automatic speech recognition (ASR). We use the ASR training data to train a text-to-speech (TTS) system similar to FastSpeech-2. With this TTS we reproduce the original training data, training ASR systems solely on synthetic data. For ASR, we use three different architectures, attention-based encoder-decoder, hybrid deep neural network hidden Markov model and a Gaussian mixture hidden Markov model, showing the different sensitivity of the models to synthetic data generation. In order to extend previous work, we present a number of ablation studies on the effectiveness of synthetic vs. real training data for ASR. In particular we focus on how the gap between training on synthetic and real data changes by varying the speaker embedding or by scaling the model size. For the latter we show that the TTS models generalize well, even when training scores indicate overfitting. △ Less

Submitted 26 October, 2024; v1 submitted 25 July, 2024; originally announced July 2024.

Comments: Published at the SynData4GenAI 2024 workshop

arXiv:2310.08132 [pdf, other]

On the Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition

Authors: Nick Rossenbach, Benedikt Hilmes, Ralf Schlüter

Abstract: Synthetic data generated by text-to-speech (TTS) systems can be used to improve automatic speech recognition (ASR) systems in low-resource or domain mismatch tasks. It has been shown that TTS-generated outputs still do not have the same qualities as real data. In this work we focus on the temporal structure of synthetic data and its relation to ASR training. By using a novel oracle setup we show h… ▽ More Synthetic data generated by text-to-speech (TTS) systems can be used to improve automatic speech recognition (ASR) systems in low-resource or domain mismatch tasks. It has been shown that TTS-generated outputs still do not have the same qualities as real data. In this work we focus on the temporal structure of synthetic data and its relation to ASR training. By using a novel oracle setup we show how much the degradation of synthetic data quality is influenced by duration modeling in non-autoregressive (NAR) TTS. To get reference phoneme durations we use two common alignment methods, a hidden Markov Gaussian-mixture model (HMM-GMM) aligner and a neural connectionist temporal classification (CTC) aligner. Using a simple algorithm based on random walks we shift phoneme duration distributions of the TTS system closer to real durations, resulting in an improvement of an ASR system using synthetic data in a semi-supervised setting. △ Less

Submitted 12 October, 2023; originally announced October 2023.

Comments: To appear at ASRU 2023

Showing 1–5 of 5 results for author: Hilmes, B