-
Attacker's Noise Can Manipulate Your Audio-based LLM in the Real World
Authors:
Vinu Sankar Sadasivan,
Soheil Feizi,
Rajiv Mathews,
Lun Wang
Abstract:
This paper investigates the real-world vulnerabilities of audio-based large language models (ALLMs), such as Qwen2-Audio. We first demonstrate that an adversary can craft stealthy audio perturbations to manipulate ALLMs into exhibiting specific targeted behaviors, such as eliciting responses to wake-keywords (e.g., "Hey Qwen"), or triggering harmful behaviors (e.g. "Change my calendar event"). Sub…
▽ More
This paper investigates the real-world vulnerabilities of audio-based large language models (ALLMs), such as Qwen2-Audio. We first demonstrate that an adversary can craft stealthy audio perturbations to manipulate ALLMs into exhibiting specific targeted behaviors, such as eliciting responses to wake-keywords (e.g., "Hey Qwen"), or triggering harmful behaviors (e.g. "Change my calendar event"). Subsequently, we show that playing adversarial background noise during user interaction with the ALLMs can significantly degrade the response quality. Crucially, our research illustrates the scalability of these attacks to real-world scenarios, impacting other innocent users when these adversarial noises are played through the air. Further, we discuss the transferrability of the attack, and potential defensive measures.
△ Less
Submitted 7 July, 2025;
originally announced July 2025.
-
Parameter-Efficient Transfer Learning under Federated Learning for Automatic Speech Recognition
Authors:
Xuan Kan,
Yonghui Xiao,
Tien-Ju Yang,
Nanxin Chen,
Rajiv Mathews
Abstract:
This work explores the challenge of enhancing Automatic Speech Recognition (ASR) model performance across various user-specific domains while preserving user data privacy. We employ federated learning and parameter-efficient domain adaptation methods to solve the (1) massive data requirement of ASR models from user-specific scenarios and (2) the substantial communication cost between servers and c…
▽ More
This work explores the challenge of enhancing Automatic Speech Recognition (ASR) model performance across various user-specific domains while preserving user data privacy. We employ federated learning and parameter-efficient domain adaptation methods to solve the (1) massive data requirement of ASR models from user-specific scenarios and (2) the substantial communication cost between servers and clients during federated learning. We demonstrate that when equipped with proper adapters, ASR models under federated tuning can achieve similar performance compared with centralized tuning ones, thus providing a potential direction for future privacy-preserved ASR services. Besides, we investigate the efficiency of different adapters and adapter incorporation strategies under the federated learning setting.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
Unintended Memorization in Large ASR Models, and How to Mitigate It
Authors:
Lun Wang,
Om Thakkar,
Rajiv Mathews
Abstract:
It is well-known that neural networks can unintentionally memorize their training examples, causing privacy concerns. However, auditing memorization in large non-auto-regressive automatic speech recognition (ASR) models has been challenging due to the high compute cost of existing methods such as hardness calibration. In this work, we design a simple auditing method to measure memorization in larg…
▽ More
It is well-known that neural networks can unintentionally memorize their training examples, causing privacy concerns. However, auditing memorization in large non-auto-regressive automatic speech recognition (ASR) models has been challenging due to the high compute cost of existing methods such as hardness calibration. In this work, we design a simple auditing method to measure memorization in large ASR models without the extra compute overhead. Concretely, we speed up randomly-generated utterances to create a mapping between vocal and text information that is difficult to learn from typical training examples. Hence, accurate predictions only for sped-up training examples can serve as clear evidence for memorization, and the corresponding accuracy can be used to measure memorization. Using the proposed method, we showcase memorization in the state-of-the-art ASR models. To mitigate memorization, we tried gradient clipping during training to bound the influence of any individual example on the final model. We empirically show that clipping each example's gradient can mitigate memorization for sped-up training examples with up to 16 repetitions in the training set. Furthermore, we show that in large-scale distributed training, clipping the average gradient on each compute core maintains neutral model quality and compute cost while providing strong privacy protection.
△ Less
Submitted 18 October, 2023;
originally announced October 2023.
-
The Gift of Feedback: Improving ASR Model Quality by Learning from User Corrections through Federated Learning
Authors:
Lillian Zhou,
Yuxin Ding,
Mingqing Chen,
Harry Zhang,
Rohit Prabhavalkar,
Dhruv Guliani,
Giovanni Motta,
Rajiv Mathews
Abstract:
Automatic speech recognition (ASR) models are typically trained on large datasets of transcribed speech. As language evolves and new terms come into use, these models can become outdated and stale. In the context of models trained on the server but deployed on edge devices, errors may result from the mismatch between server training data and actual on-device usage. In this work, we seek to continu…
▽ More
Automatic speech recognition (ASR) models are typically trained on large datasets of transcribed speech. As language evolves and new terms come into use, these models can become outdated and stale. In the context of models trained on the server but deployed on edge devices, errors may result from the mismatch between server training data and actual on-device usage. In this work, we seek to continually learn from on-device user corrections through Federated Learning (FL) to address this issue. We explore techniques to target fresh terms that the model has not previously encountered, learn long-tail words, and mitigate catastrophic forgetting. In experimental evaluations, we find that the proposed techniques improve model recognition of fresh terms, while preserving quality on the overall language distribution.
△ Less
Submitted 30 November, 2023; v1 submitted 29 September, 2023;
originally announced October 2023.
-
Large vocabulary speech recognition for languages of Africa: multilingual modeling and self-supervised learning
Authors:
Sandy Ritchie,
You-Chi Cheng,
Mingqing Chen,
Rajiv Mathews,
Daan van Esch,
Bo Li,
Khe Chai Sim
Abstract:
Almost none of the 2,000+ languages spoken in Africa have widely available automatic speech recognition systems, and the required data is also only available for a few languages. We have experimented with two techniques which may provide pathways to large vocabulary speech recognition for African languages: multilingual modeling and self-supervised learning. We gathered available open source data…
▽ More
Almost none of the 2,000+ languages spoken in Africa have widely available automatic speech recognition systems, and the required data is also only available for a few languages. We have experimented with two techniques which may provide pathways to large vocabulary speech recognition for African languages: multilingual modeling and self-supervised learning. We gathered available open source data and collected data for 15 languages, and trained experimental models using these techniques. Our results show that pooling the small amounts of data available in multilingual end-to-end models, and pre-training on unsupervised data can help improve speech recognition quality for many African languages.
△ Less
Submitted 4 October, 2022; v1 submitted 5 August, 2022;
originally announced August 2022.
-
UserLibri: A Dataset for ASR Personalization Using Only Text
Authors:
Theresa Breiner,
Swaroop Ramaswamy,
Ehsan Variani,
Shefali Garg,
Rajiv Mathews,
Khe Chai Sim,
Kilol Gupta,
Mingqing Chen,
Lara McConnaughey
Abstract:
Personalization of speech models on mobile devices (on-device personalization) is an active area of research, but more often than not, mobile devices have more text-only data than paired audio-text data. We explore training a personalized language model on text-only data, used during inference to improve speech recognition performance for that user. We experiment on a user-clustered LibriSpeech co…
▽ More
Personalization of speech models on mobile devices (on-device personalization) is an active area of research, but more often than not, mobile devices have more text-only data than paired audio-text data. We explore training a personalized language model on text-only data, used during inference to improve speech recognition performance for that user. We experiment on a user-clustered LibriSpeech corpus, supplemented with personalized text-only data for each user from Project Gutenberg. We release this User-Specific LibriSpeech (UserLibri) dataset to aid future personalization research. LibriSpeech audio-transcript pairs are grouped into 55 users from the test-clean dataset and 52 users from test-other. We are able to lower the average word error rate per user across both sets in streaming and nonstreaming models, including an improvement of 2.5 for the harder set of test-other users when streaming.
△ Less
Submitted 1 July, 2022;
originally announced July 2022.
-
Detecting Unintended Memorization in Language-Model-Fused ASR
Authors:
W. Ronny Huang,
Steve Chien,
Om Thakkar,
Rajiv Mathews
Abstract:
End-to-end (E2E) models are often being accompanied by language models (LMs) via shallow fusion for boosting their overall quality as well as recognition of rare words. At the same time, several prior works show that LMs are susceptible to unintentionally memorizing rare or unique sequences in the training data. In this work, we design a framework for detecting memorization of random textual seque…
▽ More
End-to-end (E2E) models are often being accompanied by language models (LMs) via shallow fusion for boosting their overall quality as well as recognition of rare words. At the same time, several prior works show that LMs are susceptible to unintentionally memorizing rare or unique sequences in the training data. In this work, we design a framework for detecting memorization of random textual sequences (which we call canaries) in the LM training data when one has only black-box (query) access to LM-fused speech recognizer, as opposed to direct access to the LM. On a production-grade Conformer RNN-T E2E model fused with a Transformer LM, we show that detecting memorization of singly-occurring canaries from the LM training data of 300M examples is possible. Motivated to protect privacy, we also show that such memorization gets significantly reduced by per-example gradient-clipped LM training without compromising overall quality.
△ Less
Submitted 28 June, 2022; v1 submitted 20 April, 2022;
originally announced April 2022.
-
Extracting Targeted Training Data from ASR Models, and How to Mitigate It
Authors:
Ehsan Amid,
Om Thakkar,
Arun Narayanan,
Rajiv Mathews,
Françoise Beaufays
Abstract:
Recent work has designed methods to demonstrate that model updates in ASR training can leak potentially sensitive attributes of the utterances used in computing the updates. In this work, we design the first method to demonstrate information leakage about training data from trained ASR models. We design Noise Masking, a fill-in-the-blank style method for extracting targeted parts of training data…
▽ More
Recent work has designed methods to demonstrate that model updates in ASR training can leak potentially sensitive attributes of the utterances used in computing the updates. In this work, we design the first method to demonstrate information leakage about training data from trained ASR models. We design Noise Masking, a fill-in-the-blank style method for extracting targeted parts of training data from trained ASR models. We demonstrate the success of Noise Masking by using it in four settings for extracting names from the LibriSpeech dataset used for training a state-of-the-art Conformer model. In particular, we show that we are able to extract the correct names from masked training utterances with 11.8% accuracy, while the model outputs some name from the train set 55.2% of the time. Further, we show that even in a setting that uses synthetic audio and partial transcripts from the test set, our method achieves 2.5% correct name accuracy (47.7% any name success rate). Lastly, we design Word Dropout, a data augmentation method that we show when used in training along with Multistyle TRaining (MTR), provides comparable utility as the baseline, along with significantly mitigating extraction via Noise Masking across the four evaluated settings.
△ Less
Submitted 27 June, 2022; v1 submitted 18 April, 2022;
originally announced April 2022.
-
Production federated keyword spotting via distillation, filtering, and joint federated-centralized training
Authors:
Andrew Hard,
Kurt Partridge,
Neng Chen,
Sean Augenstein,
Aishanee Shah,
Hyun Jin Park,
Alex Park,
Sara Ng,
Jessica Nguyen,
Ignacio Lopez Moreno,
Rajiv Mathews,
Françoise Beaufays
Abstract:
We trained a keyword spotting model using federated learning on real user devices and observed significant improvements when the model was deployed for inference on phones. To compensate for data domains that are missing from on-device training caches, we employed joint federated-centralized training. And to learn in the absence of curated labels on-device, we formulated a confidence filtering str…
▽ More
We trained a keyword spotting model using federated learning on real user devices and observed significant improvements when the model was deployed for inference on phones. To compensate for data domains that are missing from on-device training caches, we employed joint federated-centralized training. And to learn in the absence of curated labels on-device, we formulated a confidence filtering strategy based on user-feedback signals for federated distillation. These techniques created models that significantly improved quality metrics in offline evaluations and user-experience metrics in live A/B experiments.
△ Less
Submitted 29 June, 2022; v1 submitted 11 April, 2022;
originally announced April 2022.
-
Unsupervised multi-latent space reinforcement learning framework for video summarization in ultrasound imaging
Authors:
Roshan P Mathews,
Mahesh Raveendranatha Panicker,
Abhilash R Hareendranathan,
Yale Tung Chen,
Jacob L Jaremko,
Brian Buchanan,
Kiran Vishnu Narayan,
Kesavadas C,
Greeta Mathews
Abstract:
The COVID-19 pandemic has highlighted the need for a tool to speed up triage in ultrasound scans and provide clinicians with fast access to relevant information. The proposed video-summarization technique is a step in this direction that provides clinicians access to relevant key-frames from a given ultrasound scan (such as lung ultrasound) while reducing resource, storage and bandwidth requiremen…
▽ More
The COVID-19 pandemic has highlighted the need for a tool to speed up triage in ultrasound scans and provide clinicians with fast access to relevant information. The proposed video-summarization technique is a step in this direction that provides clinicians access to relevant key-frames from a given ultrasound scan (such as lung ultrasound) while reducing resource, storage and bandwidth requirements. We propose a new unsupervised reinforcement learning (RL) framework with novel rewards that facilitates unsupervised learning avoiding tedious and impractical manual labelling for summarizing ultrasound videos to enhance its utility as a triage tool in the emergency department (ED) and for use in telemedicine. Using an attention ensemble of encoders, the high dimensional image is projected into a low dimensional latent space in terms of: a) reduced distance with a normal or abnormal class (classifier encoder), b) following a topology of landmarks (segmentation encoder), and c) the distance or topology agnostic latent representation (convolutional autoencoders). The decoder is implemented using a bi-directional long-short term memory (Bi-LSTM) which utilizes the latent space representation from the encoder. Our new paradigm for video summarization is capable of delivering classification labels and segmentation of key landmarks for each of the summarized keyframes. Validation is performed on lung ultrasound (LUS) dataset, that typically represent potential use cases in telemedicine and ED triage acquired from different medical centers across geographies (India, Spain and Canada).
△ Less
Submitted 3 September, 2021;
originally announced September 2021.
-
Towards Fast Region Adaptive Ultrasound Beamformer for Plane Wave Imaging Using Convolutional Neural Networks
Authors:
Roshan P Mathews,
Mahesh Raveendranatha Panicker
Abstract:
Automatic learning algorithms for improving the image quality of diagnostic B-mode ultrasound (US) images have been gaining popularity in the recent past. In this work, a novel convolutional neural network (CNN) is trained using time of flight corrected in-vivo receiver data of plane wave transmit to produce corresponding high-quality minimum variance distortion less response (MVDR) beamformed ima…
▽ More
Automatic learning algorithms for improving the image quality of diagnostic B-mode ultrasound (US) images have been gaining popularity in the recent past. In this work, a novel convolutional neural network (CNN) is trained using time of flight corrected in-vivo receiver data of plane wave transmit to produce corresponding high-quality minimum variance distortion less response (MVDR) beamformed image. A comprehensive performance comparison in terms of qualitative and quantitative measures for fully connected neural network (FCNN), the proposed CNN architecture, MVDR and Delay and Sum (DAS) using the dataset from Plane wave Imaging Challenge in Ultrasound (PICMUS) is also reported in this work. The CNN architecture could leverage the spatial information and will be more region adaptive during the beamforming process. This is evident from the improvement seen over the baseline FCNN approach and conventional MVDR beamformer, both in resolution and contrast with an improvement of 6 dB in CNR using only zero-angle transmission over the baseline. With the observed reduction in the requirement of number of angles to produce similar image metrics would prove advantageous in providing a possibility for higher frame rates.
△ Less
Submitted 17 August, 2021; v1 submitted 13 June, 2021;
originally announced June 2021.
-
CAD Applications and Emerging Research Potential in Medical Imaging
Authors:
Roshan P. Mathews,
Greeta Mathews
Abstract:
Computer Aided Detection (CAD) is a valuable technique for precisely interpreting medical images and it has a global business opportunity of about USD 1.8 billion. The current aspects with reference to the four sub stages such as image pre-processing, segmentation, feature extraction and classification and the future scope of CAD in medical imaging has been discussed in this paper. Many reviewers…
▽ More
Computer Aided Detection (CAD) is a valuable technique for precisely interpreting medical images and it has a global business opportunity of about USD 1.8 billion. The current aspects with reference to the four sub stages such as image pre-processing, segmentation, feature extraction and classification and the future scope of CAD in medical imaging has been discussed in this paper. Many reviewers have emphasized the need for synergy between engineers and medical professionals for successful development of CAD systems and the current work is a move in that direction. The engineering aspects of the above four stages in four imaging modalities viz. computed tomography, magnetic resonance imaging, mammography and bone scintigraphy used in the diagnosis of five critical diseases have been discussed with a clinical background. Automatic classification of image can play an important role in preliminary screening of very critical ailments bringing down the cost of health care. Another recent advancement is using artificial intelligence and machine learning techniques. This paper reviews these engineering aspects with a view to explore the opportunities to researchers as well as the medical industry to offer affordable medical services with accessibility in even remote locations.
△ Less
Submitted 30 September, 2020;
originally announced September 2020.
-
Training Keyword Spotting Models on Non-IID Data with Federated Learning
Authors:
Andrew Hard,
Kurt Partridge,
Cameron Nguyen,
Niranjan Subrahmanya,
Aishanee Shah,
Pai Zhu,
Ignacio Lopez Moreno,
Rajiv Mathews
Abstract:
We demonstrate that a production-quality keyword-spotting model can be trained on-device using federated learning and achieve comparable false accept and false reject rates to a centrally-trained model. To overcome the algorithmic constraints associated with fitting on-device data (which are inherently non-independent and identically distributed), we conduct thorough empirical studies of optimizat…
▽ More
We demonstrate that a production-quality keyword-spotting model can be trained on-device using federated learning and achieve comparable false accept and false reject rates to a centrally-trained model. To overcome the algorithmic constraints associated with fitting on-device data (which are inherently non-independent and identically distributed), we conduct thorough empirical studies of optimization algorithms and hyperparameter configurations using large-scale federated simulations. To overcome resource constraints, we replace memory intensive MTR data augmentation with SpecAugment, which reduces the false reject rate by 56%. Finally, to label examples (given the zero visibility into on-device data), we explore teacher-student training.
△ Less
Submitted 4 June, 2020; v1 submitted 20 May, 2020;
originally announced May 2020.