Search | arXiv e-print repository

Theoretical guarantees for neural estimators in parametric statistics

Authors: Almut Rödder, Manuel Hentschel, Sebastian Engelke

Abstract: Neural estimators are simulation-based estimators for the parameters of a family of statistical models, which build a direct mapping from the sample to the parameter vector. They benefit from the versatility of available network architectures and efficient training methods developed in the field of deep learning. Neural estimators are amortized in the sense that, once trained, they can be applied… ▽ More Neural estimators are simulation-based estimators for the parameters of a family of statistical models, which build a direct mapping from the sample to the parameter vector. They benefit from the versatility of available network architectures and efficient training methods developed in the field of deep learning. Neural estimators are amortized in the sense that, once trained, they can be applied to any new data set with almost no computational cost. While many papers have shown very good performance of these methods in simulation studies and real-world applications, so far no statistical guarantees are available to support these observations theoretically. In this work, we study the risk of neural estimators by decomposing it into several terms that can be analyzed separately. We formulate easy-to-check assumptions ensuring that each term converges to zero, and we verify them for popular applications of neural estimators. Our results provide a general recipe to derive theoretical guarantees also for broader classes of architectures and estimation problems. △ Less

Submitted 23 June, 2025; originally announced June 2025.

arXiv:2506.01263 [pdf, ps, other]

WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing

Authors: Yu Nakagome, Michael Hentschel

Abstract: Despite recent advances in end-to-end speech recognition methods, the output tends to be biased to the training data's vocabulary, resulting in inaccurate recognition of proper nouns and other unknown terms. To address this issue, we propose a method to improve recognition accuracy of such rare words in CTC-based models without additional training or text-to-speech systems. Specifically, keyword s… ▽ More Despite recent advances in end-to-end speech recognition methods, the output tends to be biased to the training data's vocabulary, resulting in inaccurate recognition of proper nouns and other unknown terms. To address this issue, we propose a method to improve recognition accuracy of such rare words in CTC-based models without additional training or text-to-speech systems. Specifically, keyword spotting is performed using acoustic features of intermediate layers during inference, and a bias is applied to the subsequent layers of the acoustic model for detected keywords. For keyword detection, we adopt a wildcard CTC that is both fast and tolerant of ambiguous matches, allowing flexible handling of words that are difficult to match strictly. Since this method does not require retraining of existing models, it can be easily applied to even large-scale models. In experiments on Japanese speech recognition, the proposed method achieved a 29% improvement in the F1 score for unknown words. △ Less

Submitted 1 June, 2025; originally announced June 2025.

Comments: Accepted to Interspeech 2025

arXiv:2406.14890 [pdf, other]

InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions

Authors: Yu Nakagome, Michael Hentschel

Abstract: Despite recent advances in end-to-end speech recognition methods, their output is biased to the training data's vocabulary, resulting in inaccurate recognition of unknown terms or proper nouns. To improve the recognition accuracy for a given set of such terms, we propose an adaptation parameter-free approach based on Self-conditioned CTC. Our method improves the recognition accuracy of misrecogniz… ▽ More Despite recent advances in end-to-end speech recognition methods, their output is biased to the training data's vocabulary, resulting in inaccurate recognition of unknown terms or proper nouns. To improve the recognition accuracy for a given set of such terms, we propose an adaptation parameter-free approach based on Self-conditioned CTC. Our method improves the recognition accuracy of misrecognized target keywords by substituting their intermediate CTC predictions with corrected labels, which are then passed on to the subsequent layers. First, we create pairs of correct labels and recognition error instances for a keyword list using Text-to-Speech and a recognition model. We use these pairs to replace intermediate prediction errors by the labels. Conditioning the subsequent layers of the encoder on the labels, it is possible to acoustically evaluate the target keywords. Experiments conducted in Japanese demonstrated that our method successfully improved the F1 score for unknown words. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: Accepted to Interspeech 2024

arXiv:2401.11700 [pdf, other]

Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers

Authors: Michael Hentschel, Yuta Nishikawa, Tatsuya Komatsu, Yusuke Fujita

Abstract: This study presents a novel approach for knowledge distillation (KD) from a BERT teacher model to an automatic speech recognition (ASR) model using intermediate layers. To distil the teacher's knowledge, we use an attention decoder that learns from BERT's token probabilities. Our method shows that language model (LM) information can be more effectively distilled into an ASR model using both the in… ▽ More This study presents a novel approach for knowledge distillation (KD) from a BERT teacher model to an automatic speech recognition (ASR) model using intermediate layers. To distil the teacher's knowledge, we use an attention decoder that learns from BERT's token probabilities. Our method shows that language model (LM) information can be more effectively distilled into an ASR model using both the intermediate layers and the final layer. By using the intermediate layers as distillation target, we can more effectively distil LM knowledge into the lower network layers. Using our method, we achieve better recognition accuracy than with shallow fusion of an external LM, allowing us to maintain fast parallel decoding. Experiments on the LibriSpeech dataset demonstrate the effectiveness of our approach in enhancing greedy decoding with connectionist temporal classification (CTC). △ Less

Submitted 22 January, 2024; originally announced January 2024.

Comments: Accepted at ICASSP 2024

arXiv:2310.06372 [pdf, other]

Leveraging Diffusion-Based Image Variations for Robust Training on Poisoned Data

Authors: Lukas Struppek, Martin B. Hentschel, Clifton Poth, Dominik Hintersdorf, Kristian Kersting

Abstract: Backdoor attacks pose a serious security threat for training neural networks as they surreptitiously introduce hidden functionalities into a model. Such backdoors remain silent during inference on clean inputs, evading detection due to inconspicuous behavior. However, once a specific trigger pattern appears in the input data, the backdoor activates, causing the model to execute its concealed funct… ▽ More Backdoor attacks pose a serious security threat for training neural networks as they surreptitiously introduce hidden functionalities into a model. Such backdoors remain silent during inference on clean inputs, evading detection due to inconspicuous behavior. However, once a specific trigger pattern appears in the input data, the backdoor activates, causing the model to execute its concealed function. Detecting such poisoned samples within vast datasets is virtually impossible through manual inspection. To address this challenge, we propose a novel approach that enables model training on potentially poisoned datasets by utilizing the power of recent diffusion models. Specifically, we create synthetic variations of all training samples, leveraging the inherent resilience of diffusion models to potential trigger patterns in the data. By combining this generative approach with knowledge distillation, we produce student models that maintain their general performance on the task while exhibiting robust resistance to backdoor triggers. △ Less

Submitted 13 December, 2023; v1 submitted 10 October, 2023; originally announced October 2023.

Comments: Published at NeurIPS 2023 Workshop on Backdoors in Deep Learning: The Good, the Bad, and the Ugly

arXiv:2202.01405 [pdf, other]

Joint Speech Recognition and Audio Captioning

Authors: Chaitanya Narisetty, Emiru Tsunoo, Xuankai Chang, Yosuke Kashiwagi, Michael Hentschel, Shinji Watanabe

Abstract: Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources. Most end-to-end monaural speech recognition systems either remove these background sounds using speech enhancement or train noise-robust models. For better model interpretability and holistic understanding, we aim to bring together the growing field of automated audio captioning (AA… ▽ More Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources. Most end-to-end monaural speech recognition systems either remove these background sounds using speech enhancement or train noise-robust models. For better model interpretability and holistic understanding, we aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR). The goal of AAC is to generate natural language descriptions of contents in audio samples. We propose several approaches for end-to-end joint modeling of ASR and AAC tasks and demonstrate their advantages over traditional approaches, which model these tasks independently. A major hurdle in evaluating our proposed approach is the lack of labeled audio datasets with both speech transcriptions and audio captions. Therefore we also create a multi-task dataset by mixing the clean speech Wall Street Journal corpus with multiple levels of background noises chosen from the AudioCaps dataset. We also perform extensive experimental evaluation and show improvements of our proposed methods as compared to existing state-of-the-art ASR and AAC methods. △ Less

Submitted 2 February, 2022; originally announced February 2022.

Comments: 5 pages, 2 figures. Accepted for ICASSP 2022

arXiv:2201.10190 [pdf, ps, other]

Run-and-back stitch search: novel block synchronous decoding for streaming encoder-decoder ASR

Authors: Emiru Tsunoo, Chaitanya Narisetty, Michael Hentschel, Yosuke Kashiwagi, Shinji Watanabe

Abstract: A streaming style inference of encoder-decoder automatic speech recognition (ASR) system is important for reducing latency, which is essential for interactive use cases. To this end, we propose a novel blockwise synchronous decoding algorithm with a hybrid approach that combines endpoint prediction and endpoint post-determination. In the endpoint prediction, we compute the expectation of the numbe… ▽ More A streaming style inference of encoder-decoder automatic speech recognition (ASR) system is important for reducing latency, which is essential for interactive use cases. To this end, we propose a novel blockwise synchronous decoding algorithm with a hybrid approach that combines endpoint prediction and endpoint post-determination. In the endpoint prediction, we compute the expectation of the number of tokens that are yet to be emitted in the encoder features of the current blocks using the CTC posterior. Based on the expectation value, the decoder predicts the endpoint to realize continuous block synchronization, as a running stitch. Meanwhile, endpoint post-determination probabilistically detects backward jump of the source-target attention, which is caused by the misprediction of endpoints. Then it resumes decoding by discarding those hypotheses, as back stitch. We combine these methods into a hybrid approach, namely run-and-back stitch search, which reduces the computational cost and latency. Evaluations of various ASR tasks show the efficiency of our proposed decoding algorithm, which achieves a latency reduction, for instance in the Librispeech test set from 1487 ms to 821 ms at the 90th percentile, while maintaining a high recognition accuracy. △ Less

Submitted 25 January, 2022; originally announced January 2022.

Comments: Accepted for ICASSP2022

arXiv:1407.6714 [pdf, other]

CrowdSTAR: A Social Task Routing Framework for Online Communities

Authors: Besmira Nushi, Omar Alonso, Martin Hentschel, Vasileios Kandylas

Abstract: The online communities available on the Web have shown to be significantly interactive and capable of collectively solving difficult tasks. Nevertheless, it is still a challenge to decide how a task should be dispatched through the network due to the high diversity of the communities and the dynamically changing expertise and social availability of their members. We introduce CrowdSTAR, a framewor… ▽ More The online communities available on the Web have shown to be significantly interactive and capable of collectively solving difficult tasks. Nevertheless, it is still a challenge to decide how a task should be dispatched through the network due to the high diversity of the communities and the dynamically changing expertise and social availability of their members. We introduce CrowdSTAR, a framework designed to route tasks across and within online crowds. CrowdSTAR indexes the topic-specific expertise and social features of the crowd contributors and then uses a routing algorithm, which suggests the best sources to ask based on the knowledge vs. availability trade-offs. We experimented with the proposed framework for question and answering scenarios by using two popular social networks as crowd candidates: Twitter and Quora. △ Less

Submitted 24 July, 2014; originally announced July 2014.

ACM Class: H.4.m; H.5.3

Showing 1–8 of 8 results for author: Hentschel, M