Skip to main content

Showing 1–19 of 19 results for author: Strimel, G

Searching in archive eess. Search in all archives.
.
  1. arXiv:2505.13085  [pdf, other

    eess.AS cs.LG

    Universal Semantic Disentangled Privacy-preserving Speech Representation Learning

    Authors: Biel Tura Vecino, Subhadeep Maji, Aravind Varier, Antonio Bonafonte, Ivan Valles, Michael Owen, Leif Rädel, Grant Strimel, Seyi Feyisetan, Roberto Barra Chicote, Ariya Rastrow, Constantinos Papayiannis, Volker Leutnant, Trevor Wood

    Abstract: The use of audio recordings of human speech to train LLMs poses privacy concerns due to these models' potential to generate outputs that closely resemble artifacts in the training data. In this study, we propose a speaker privacy-preserving representation learning method through the Universal Speech Codec (USC), a computationally efficient encoder-decoder model that disentangles speech into: (i) p… ▽ More

    Submitted 20 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

    Comments: Extended report of the article accepted at Interspeech 2025 (v1)

  2. arXiv:2504.09081  [pdf, other

    eess.AS cs.AI cs.CL

    SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning

    Authors: Prabhat Pandey, Rupak Vignesh Swaminathan, K V Vijay Girish, Arunasish Sen, Jian Xie, Grant P. Strimel, Andreas Schwarz

    Abstract: We introduce SIFT (Speech Instruction Fine-Tuning), a 50M-example dataset designed for instruction fine-tuning and pre-training of speech-text large language models (LLMs). SIFT-50M is built from publicly available speech corpora, which collectively contain 14K hours of speech, and leverages LLMs along with off-the-shelf expert models. The dataset spans five languages, encompassing a diverse range… ▽ More

    Submitted 17 April, 2025; v1 submitted 12 April, 2025; originally announced April 2025.

  3. arXiv:2406.09618  [pdf, other

    cs.CL cs.AI cs.IR cs.SD eess.AS

    Multi-Modal Retrieval For Large Language Model Based Speech Recognition

    Authors: Jari Kolehmainen, Aditya Gourav, Prashanth Gurunath Shivakumar, Yile Gu, Ankur Gandhe, Ariya Rastrow, Grant Strimel, Ivan Bulyko

    Abstract: Retrieval is a widely adopted approach for improving language models leveraging external information. As the field moves towards multi-modal large language models, it is important to extend the pure text based methods to incorporate other modalities in retrieval as well for applications across the wide spectrum of machine learning tasks and data types. In this work, we propose multi-modal retrieva… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  4. arXiv:2305.05271  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Robust Acoustic and Semantic Contextual Biasing in Neural Transducers for Speech Recognition

    Authors: Xuandi Fu, Kanthashree Mysore Sathyendra, Ankur Gandhe, Jing Liu, Grant P. Strimel, Ross McGowan, Athanasios Mouchtaris

    Abstract: Attention-based contextual biasing approaches have shown significant improvements in the recognition of generic and/or personal rare-words in End-to-End Automatic Speech Recognition (E2E ASR) systems like neural transducers. These approaches employ cross-attention to bias the model towards specific contextual entities injected as bias-phrases to the model. Prior approaches typically relied on subw… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

    Comments: Accepted at ICASSP 2023

  5. arXiv:2305.04159  [pdf, other

    eess.AS

    Lookahead When It Matters: Adaptive Non-causal Transformers for Streaming Neural Transducers

    Authors: Grant P. Strimel, Yi Xie, Brian King, Martin Radfar, Ariya Rastrow, Athanasios Mouchtaris

    Abstract: Streaming speech recognition architectures are employed for low-latency, real-time applications. Such architectures are often characterized by their causality. Causal architectures emit tokens at each frame, relying only on current and past signal, while non-causal models are exposed to a window of future frames at each step to increase predictive accuracy. This dichotomy amounts to a trade-off fo… ▽ More

    Submitted 9 May, 2023; v1 submitted 6 May, 2023; originally announced May 2023.

    Comments: Accepted to ICML 2023

  6. arXiv:2304.01905  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Dual-Attention Neural Transducers for Efficient Wake Word Spotting in Speech Recognition

    Authors: Saumya Y. Sahai, Jing Liu, Thejaswi Muniyappa, Kanthashree M. Sathyendra, Anastasios Alexandridis, Grant P. Strimel, Ross McGowan, Ariya Rastrow, Feng-Ju Chang, Athanasios Mouchtaris, Siegfried Kunzmann

    Abstract: We present dual-attention neural biasing, an architecture designed to boost Wake Words (WW) recognition and improve inference time latency on speech recognition tasks. This architecture enables a dynamic switch for its runtime compute paths by exploiting WW spotting to select which branch of its attention networks to execute for an input audio frame. With this approach, we effectively improve WW s… ▽ More

    Submitted 4 April, 2023; v1 submitted 2 April, 2023; originally announced April 2023.

    Comments: Accepted to Proc. IEEE ICASSP 2023

  7. arXiv:2303.17799  [pdf, other

    cs.CL cs.SD eess.AS

    Dialog act guided contextual adapter for personalized speech recognition

    Authors: Feng-Ju Chang, Thejaswi Muniyappa, Kanthashree Mysore Sathyendra, Kai Wei, Grant P. Strimel, Ross McGowan

    Abstract: Personalization in multi-turn dialogs has been a long standing challenge for end-to-end automatic speech recognition (E2E ASR) models. Recent work on contextual adapters has tackled rare word recognition using user catalogs. This adaptation, however, does not incorporate an important cue, the dialog act, which is available in a multi-turn dialog scenario. In this work, we propose a dialog act guid… ▽ More

    Submitted 31 March, 2023; originally announced March 2023.

    Comments: Accepted at ICASSP 2023

  8. PROCTER: PROnunciation-aware ConTextual adaptER for personalized speech recognition in neural transducers

    Authors: Rahul Pandey, Roger Ren, Qi Luo, Jing Liu, Ariya Rastrow, Ankur Gandhe, Denis Filimonov, Grant Strimel, Andreas Stolcke, Ivan Bulyko

    Abstract: End-to-End (E2E) automatic speech recognition (ASR) systems used in voice assistants often have difficulties recognizing infrequent words personalized to the user, such as names and places. Rare words often have non-trivial pronunciations, and in such cases, human knowledge in the form of a pronunciation lexicon can be useful. We propose a PROnunCiation-aware conTextual adaptER (PROCTER) that dyna… ▽ More

    Submitted 29 March, 2023; originally announced March 2023.

    Comments: To appear in Proc. IEEE ICASSP

    Journal ref: Proc. IEEE ICASSP, June 2023

  9. arXiv:2210.09188  [pdf, other

    cs.SD cs.LG eess.AS

    Sub-8-bit quantization for on-device speech recognition: a regularization-free approach

    Authors: Kai Zhen, Martin Radfar, Hieu Duy Nguyen, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris

    Abstract: For on-device automatic speech recognition (ASR), quantization aware training (QAT) is ubiquitous to achieve the trade-off between model predictive performance and efficiency. Among existing QAT methods, one major drawback is that the quantization centroids have to be predetermined and fixed. To overcome this limitation, we introduce a regularization-free, "soft-to-hard" compression mechanism with… ▽ More

    Submitted 1 November, 2022; v1 submitted 17 October, 2022; originally announced October 2022.

    Comments: Accepted for publication at IEEE SLT'22

  10. arXiv:2209.14868  [pdf, other

    cs.SD cs.CL eess.AS

    ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition

    Authors: Martin Radfar, Rohit Barnwal, Rupak Vignesh Swaminathan, Feng-Ju Chang, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris

    Abstract: The recurrent neural network transducer (RNN-T) is a prominent streaming end-to-end (E2E) ASR technology. In RNN-T, the acoustic encoder commonly consists of stacks of LSTMs. Very recently, as an alternative to LSTM layers, the Conformer architecture was introduced where the encoder of RNN-T is replaced with a modified Transformer encoder composed of convolutional layers at the frontend and betwee… ▽ More

    Submitted 29 September, 2022; originally announced September 2022.

    Comments: This paper was presented in Interspeech 2022

  11. arXiv:2207.02393  [pdf, other

    cs.CL cs.SD eess.AS

    Compute Cost Amortized Transformer for Streaming ASR

    Authors: Yi Xie, Jonathan Macoskey, Martin Radfar, Feng-Ju Chang, Brian King, Ariya Rastrow, Athanasios Mouchtaris, Grant P. Strimel

    Abstract: We present a streaming, Transformer-based end-to-end automatic speech recognition (ASR) architecture which achieves efficient neural inference through compute cost amortization. Our architecture creates sparse computation pathways dynamically at inference time, resulting in selective use of compute resources throughout decoding, enabling significant reductions in compute with minimal impact on acc… ▽ More

    Submitted 4 July, 2022; originally announced July 2022.

  12. Latency Control for Keyword Spotting

    Authors: Christin Jose, Joseph Wang, Grant P. Strimel, Mohammad Omar Khursheed, Yuriy Mishchenko, Brian Kulis

    Abstract: Conversational agents commonly utilize keyword spotting (KWS) to initiate voice interaction with the user. For user experience and privacy considerations, existing approaches to KWS largely focus on accuracy, which can often come at the expense of introduced latency. To address this tradeoff, we propose a novel approach to control KWS model latency and which generalizes to any loss function withou… ▽ More

    Submitted 14 June, 2022; originally announced June 2022.

    Comments: Proceedings of INTERSPEECH

  13. arXiv:2205.05590  [pdf, other

    cs.CL cs.SD eess.AS

    A neural prosody encoder for end-ro-end dialogue act classification

    Authors: Kai Wei, Dillon Knox, Martin Radfar, Thanh Tran, Markus Muller, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris, Maurizio Omologo

    Abstract: Dialogue act classification (DAC) is a critical task for spoken language understanding in dialogue systems. Prosodic features such as energy and pitch have been shown to be useful for DAC. Despite their importance, little research has explored neural approaches to integrate prosodic features into end-to-end (E2E) DAC models which infer dialogue acts directly from audio signals. In this work, we pr… ▽ More

    Submitted 11 May, 2022; originally announced May 2022.

  14. arXiv:2204.00558  [pdf, other

    cs.CL cs.SD eess.AS

    Multi-task RNN-T with Semantic Decoder for Streamable Spoken Language Understanding

    Authors: Xuandi Fu, Feng-Ju Chang, Martin Radfar, Kai Wei, Jing Liu, Grant P. Strimel, Kanthashree Mysore Sathyendra

    Abstract: End-to-end Spoken Language Understanding (E2E SLU) has attracted increasing interest due to its advantages of joint optimization and low latency when compared to traditionally cascaded pipelines. Existing E2E SLU models usually follow a two-stage configuration where an Automatic Speech Recognition (ASR) network first predicts a transcript which is then passed to a Natural Language Understanding (N… ▽ More

    Submitted 1 April, 2022; originally announced April 2022.

    Comments: Accepted at ICASSP 2022

  15. arXiv:2108.01704  [pdf, other

    eess.AS cs.SD

    Bifocal Neural ASR: Exploiting Keyword Spotting for Inference Optimization

    Authors: Jonathan Macoskey, Grant P. Strimel, Ariya Rastrow

    Abstract: We present Bifocal RNN-T, a new variant of the Recurrent Neural Network Transducer (RNN-T) architecture designed for improved inference time latency on speech recognition tasks. The architecture enables a dynamic pivot for its runtime compute pathway, namely taking advantage of keyword spotting to select which component of the network to execute for a given audio frame. To accomplish this, we leve… ▽ More

    Submitted 3 August, 2021; originally announced August 2021.

    Comments: Accepted at ICASSP 2021

  16. arXiv:2108.01561  [pdf, other

    eess.AS cs.SD

    Learning a Neural Diff for Speech Models

    Authors: Jonathan Macoskey, Grant P. Strimel, Ariya Rastrow

    Abstract: As more speech processing applications execute locally on edge devices, a set of resource constraints must be considered. In this work we address one of these constraints, namely over-the-network data budgets for transferring models from server to device. We present neural update approaches for release of subsequent speech model generations abiding by a data budget. We detail two architecture-agno… ▽ More

    Submitted 17 August, 2021; v1 submitted 3 August, 2021; originally announced August 2021.

    Comments: Accepted at Interspeech 2021

  17. arXiv:2108.01553  [pdf, other

    eess.AS cs.SD

    Amortized Neural Networks for Low-Latency Speech Recognition

    Authors: Jonathan Macoskey, Grant P. Strimel, Jinru Su, Ariya Rastrow

    Abstract: We introduce Amortized Neural Networks (AmNets), a compute cost- and latency-aware network architecture particularly well-suited for sequence modeling tasks. We apply AmNets to the Recurrent Neural Network Transducer (RNN-T) to reduce compute cost and latency for an automatic speech recognition (ASR) task. The AmNets RNN-T architecture enables the network to dynamically switch between encoder bran… ▽ More

    Submitted 3 August, 2021; originally announced August 2021.

    Comments: Accepted at Interspeech 2021

  18. arXiv:2106.07734  [pdf, other

    cs.CL cs.LG eess.AS

    CoDERT: Distilling Encoder Representations with Co-learning for Transducer-based Speech Recognition

    Authors: Rupak Vignesh Swaminathan, Brian King, Grant P. Strimel, Jasha Droppo, Athanasios Mouchtaris

    Abstract: We propose a simple yet effective method to compress an RNN-Transducer (RNN-T) through the well-known knowledge distillation paradigm. We show that the transducer's encoder outputs naturally have a high entropy and contain rich information about acoustically similar word-piece confusions. This rich information is suppressed when combined with the lower entropy decoder outputs to produce the joint… ▽ More

    Submitted 14 June, 2021; originally announced June 2021.

    Comments: Accepted at InterSpeech 2021

  19. arXiv:2008.02858  [pdf, other

    cs.CL cs.SD eess.AS

    Semantic Complexity in End-to-End Spoken Language Understanding

    Authors: Joseph P. McKenna, Samridhi Choudhary, Michael Saxon, Grant P. Strimel, Athanasios Mouchtaris

    Abstract: End-to-end spoken language understanding (SLU) models are a class of model architectures that predict semantics directly from speech. Because of their input and output types, we refer to them as speech-to-interpretation (STI) models. Previous works have successfully applied STI models to targeted use cases, such as recognizing home automation commands, however no study has yet addressed how these… ▽ More

    Submitted 6 August, 2020; originally announced August 2020.

    Comments: Accepted at Interspeech, 2020