-
Laboratory evaluation of a wearable instrumented headband for rotational head kinematics measurement
Authors:
Anu Tripathi,
Yang Wan,
Sushant Malave,
Sheila Turcsanyi,
Alice Lux Fawzi,
Alison Brooks,
Haneesh Kesari,
Traci Snedden,
Peter Ferrazzano,
Christian Franck,
Rika Carlsen
Abstract:
Mild traumatic brain injuries (mTBI) are a highly prevalent condition with heterogeneous outcomes between individuals. A key factor governing brain tissue deformation and the risk of mTBI is the rotational kinematics of the head. Instrumented mouthguards are a widely accepted method for measuring rotational head motions, owing to their robust sensor-skull coupling. However, wearing mouthguards is…
▽ More
Mild traumatic brain injuries (mTBI) are a highly prevalent condition with heterogeneous outcomes between individuals. A key factor governing brain tissue deformation and the risk of mTBI is the rotational kinematics of the head. Instrumented mouthguards are a widely accepted method for measuring rotational head motions, owing to their robust sensor-skull coupling. However, wearing mouthguards is not feasible in all situations, especially for long-term data collection. Therefore, alternative wearable devices are needed. In this study, we present an improved design and data processing scheme for an instrumented headband. Our instrumented headband utilizes an array of inertial measurement units (IMUs) and a new data-processing scheme based on continuous wavelet transforms to address sources of error in the IMU measurements. The headband performance was evaluated in the laboratory on an anthropomorphic test device, which was impacted with a soccer ball to replicate soccer heading. When comparing the measured peak rotational velocities (PRV) and peak rotational accelerations (PRA) between the reference sensors and the headband for impacts to the front of the head, the correlation coefficients (r) were 0.80 and 0.63, and the normalized root mean square error (NRMSE) values were 0.20 and 0.28, respectively. However, when considering all impact locations, r dropped to 0.42 and 0.34 and NRMSE increased to 0.5 and 0.41 for PRV and PRA, respectively. This new instrumented headband improves upon previous headband designs in reconstructing the rotational head kinematics resulting from frontal soccer ball impacts, providing a potential alternative to instrumented mouthguards.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
Deep Learning-based Classification of Dementia using Image Representation of Subcortical Signals
Authors:
Shivani Ranjan,
Ayush Tripathi,
Harshal Shende,
Robin Badal,
Amit Kumar,
Pramod Yadav,
Deepak Joshi,
Lalan Kumar
Abstract:
Dementia is a neurological syndrome marked by cognitive decline. Alzheimer's disease (AD) and Frontotemporal dementia (FTD) are the common forms of dementia, each with distinct progression patterns. EEG, a non-invasive tool for recording brain activity, has shown potential in distinguishing AD from FTD and mild cognitive impairment (MCI). Previous studies have utilized various EEG features, such a…
▽ More
Dementia is a neurological syndrome marked by cognitive decline. Alzheimer's disease (AD) and Frontotemporal dementia (FTD) are the common forms of dementia, each with distinct progression patterns. EEG, a non-invasive tool for recording brain activity, has shown potential in distinguishing AD from FTD and mild cognitive impairment (MCI). Previous studies have utilized various EEG features, such as subband power and connectivity patterns to differentiate these conditions. However, artifacts in EEG signals can obscure crucial information, necessitating advanced signal processing techniques. This study aims to develop a deep learning-based classification system for dementia by analyzing scout time-series signals from deep brain regions, specifically the hippocampus, amygdala, and thalamus. The study utilizes scout time series extracted via the standardized low-resolution brain electromagnetic tomography (sLORETA) technique. The time series is converted to image representations using continuous wavelet transform (CWT) and fed as input to deep learning models. Two high-density EEG datasets are utilized to check for the efficacy of the proposed method: the online BrainLat dataset (comprising AD, FTD, and healthy controls (HC)) and the in-house IITD-AIIA dataset (including subjects with AD, MCI, and HC). Different classification strategies and classifier combinations have been utilized for the accurate mapping of classes on both datasets. The best results were achieved by using a product of probabilities from classifiers for left and right subcortical regions in conjunction with the DenseNet model architecture. It yields accuracies of 94.17$\%$ and 77.72$\%$ on the BrainLat and IITD-AIIA datasets, respectively. This highlights the potential of this approach for early and accurate differentiation of neurodegenerative disorders.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
Clustering and Mining Accented Speech for Inclusive and Fair Speech Recognition
Authors:
Jaeyoung Kim,
Han Lu,
Soheil Khorram,
Anshuman Tripathi,
Qian Zhang,
Hasim Sak
Abstract:
Modern automatic speech recognition (ASR) systems are typically trained on more than tens of thousands hours of speech data, which is one of the main factors for their great success. However, the distribution of such data is typically biased towards common accents or typical speech patterns. As a result, those systems often poorly perform on atypical accented speech. In this paper, we present acce…
▽ More
Modern automatic speech recognition (ASR) systems are typically trained on more than tens of thousands hours of speech data, which is one of the main factors for their great success. However, the distribution of such data is typically biased towards common accents or typical speech patterns. As a result, those systems often poorly perform on atypical accented speech. In this paper, we present accent clustering and mining schemes for fair speech recognition systems which can perform equally well on under-represented accented speech. For accent recognition, we applied three schemes to overcome limited size of supervised accent data: supervised or unsupervised pre-training, distributionally robust optimization (DRO) and unsupervised clustering. Three schemes can significantly improve the accent recognition model especially for unbalanced and small accented speech. Fine-tuning ASR on the mined Indian accent speech using the proposed supervised or unsupervised clustering schemes showed 10.0% and 5.3% relative improvements compared to fine-tuning on the randomly sampled speech, respectively.
△ Less
Submitted 5 August, 2024;
originally announced August 2024.
-
Insights into Age-Related Functional Brain Changes during Audiovisual Integration Tasks: A Comprehensive EEG Source-Based Analysis
Authors:
Prerna Singh,
Ayush Tripathi,
Lalan Kumar,
Tapan Kumar Gandhi
Abstract:
The seamless integration of visual and auditory information is a fundamental aspect of human cognition. Although age-related functional changes in Audio-Visual Integration (AVI) have been extensively explored in the past, thorough studies across various age groups remain insufficient. Previous studies have provided valuable insights into agerelated AVI using EEG-based sensor data. However, these s…
▽ More
The seamless integration of visual and auditory information is a fundamental aspect of human cognition. Although age-related functional changes in Audio-Visual Integration (AVI) have been extensively explored in the past, thorough studies across various age groups remain insufficient. Previous studies have provided valuable insights into agerelated AVI using EEG-based sensor data. However, these studies have been limited in their ability to capture spatial information related to brain source activation and their connectivity. To address these gaps, our study conducted a comprehensive audiovisual integration task with a specific focus on assessing the aging effects in various age groups, particularly middle-aged individuals. We presented visual, auditory, and audio-visual stimuli and recorded EEG data from Young (18-25 years), Transition (26- 33 years), and Middle (34-42 years) age cohort healthy participants. We aimed to understand how aging affects brain activation and functional connectivity among hubs during audio-visual tasks. Our findings revealed delayed brain activation in middleaged individuals, especially for bimodal stimuli. The superior temporal cortex and superior frontal gyrus showed significant changes in neuronal activation with aging. Lower frequency bands (theta and alpha) showed substantial changes with increasing age during AVI. Our findings also revealed that the AVI-associated brain regions can be clustered into five different brain networks using the k-means algorithm. Additionally, we observed increased functional connectivity in middle age, particularly in the frontal, temporal, and occipital regions. These results highlight the compensatory neural mechanisms involved in aging during cognitive tasks.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
Brain Connectivity Features-based Age Group Classification using Temporal Asynchrony Audio-Visual Integration Task
Authors:
Prerna Singh,
Ayush Tripathi,
Lalan Kumar,
Tapan Kumar Gandhi
Abstract:
The process of integration of inputs from several sensory modalities in the human brain is referred to as multisensory integration. Age-related cognitive decline leads to a loss in the ability of the brain to conceive multisensory inputs. There has been considerable work done in the study of such cognitive changes for the old age groups. However, in the case of middle age groups, such analysis is…
▽ More
The process of integration of inputs from several sensory modalities in the human brain is referred to as multisensory integration. Age-related cognitive decline leads to a loss in the ability of the brain to conceive multisensory inputs. There has been considerable work done in the study of such cognitive changes for the old age groups. However, in the case of middle age groups, such analysis is limited. Motivated by this, in the current work, EEG-based functional connectivity during audiovisual temporal asynchrony integration task for middle-aged groups is explored. Investigation has been carried out during different tasks such as: unimodal audio, unimodal visual, and variations of audio-visual stimulus. A correlation-based functional connectivity analysis is done, and the changes among different age groups including: young (18-25 years), transition from young to middle age (25-33 years), and medium (33-41 years), are observed. Furthermore, features extracted from the connectivity graphs have been used to classify among the different age groups. Classification accuracies of $89.4\%$ and $88.4\%$ are obtained for the Audio and Audio-50-Visual stimuli cases with a Random Forest based classifier, thereby validating the efficacy of the proposed method.
△ Less
Submitted 1 May, 2023; v1 submitted 13 April, 2023;
originally announced April 2023.
-
Analysis of EEG frequency bands for Envisioned Speech Recognition
Authors:
Ayush Tripathi
Abstract:
The use of Automatic speech recognition (ASR) interfaces have become increasingly popular in daily life for use in interaction and control of electronic devices. The interfaces currently being used are not feasible for a variety of users such as those suffering from a speech disorder, locked-in syndrome, paralysis or people with utmost privacy requirements. In such cases, an interface that can ide…
▽ More
The use of Automatic speech recognition (ASR) interfaces have become increasingly popular in daily life for use in interaction and control of electronic devices. The interfaces currently being used are not feasible for a variety of users such as those suffering from a speech disorder, locked-in syndrome, paralysis or people with utmost privacy requirements. In such cases, an interface that can identify envisioned speech using electroencephalogram (EEG) signals can be of great benefit. Various works targeting this problem have been done in the past. However, there has been limited work in identifying the frequency bands ($δ, θ, α, β, γ$) of the EEG signal that contribute towards envisioned speech recognition. Therefore, in this work, we aim to analyze the significance of different EEG frequency bands and signals obtained from different lobes of the brain and their contribution towards recognizing envisioned speech. Signals obtained from different lobes and bandpass filtered for different frequency bands are fed to a spatio-temporal deep learning architecture with Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM). The performance is evaluated on a publicly available dataset comprising of three classification tasks - digit, character and images. We obtain a classification accuracy of $85.93\%$, $87.27\%$ and $87.51\%$ for the three tasks respectively. The code for the implementation has been made available at https://github.com/ayushayt/ImaginedSpeechRecognition.
△ Less
Submitted 29 March, 2022;
originally announced March 2022.
-
Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection
Authors:
Wei Xia,
Han Lu,
Quan Wang,
Anshuman Tripathi,
Yiling Huang,
Ignacio Lopez Moreno,
Hasim Sak
Abstract:
In this paper, we present a novel speaker diarization system for streaming on-device applications. In this system, we use a transformer transducer to detect the speaker turns, represent each speaker turn by a speaker embedding, then cluster these embeddings with constraints from the detected speaker turns. Compared with conventional clustering-based diarization systems, our system largely reduces…
▽ More
In this paper, we present a novel speaker diarization system for streaming on-device applications. In this system, we use a transformer transducer to detect the speaker turns, represent each speaker turn by a speaker embedding, then cluster these embeddings with constraints from the detected speaker turns. Compared with conventional clustering-based diarization systems, our system largely reduces the computational cost of clustering due to the sparsity of speaker turns. Unlike other supervised speaker diarization systems which require annotations of time-stamped speaker labels for training, our system only requires including speaker turn tokens during the transcribing process, which largely reduces the human efforts involved in data collection.
△ Less
Submitted 25 January, 2022; v1 submitted 23 September, 2021;
originally announced September 2021.
-
Hybrid Transceiver Design for Tera-Hertz MIMO Systems Relying on Bayesian Learning Aided Sparse Channel Estimation
Authors:
Suraj Srivastava,
Ajeet Tripathi,
Neeraj Varshney,
Aditya K. Jagannatham,
Lajos Hanzo
Abstract:
Hybrid transceiver design in multiple-input multiple-output (MIMO) Tera-Hertz (THz) systems relying on sparse channel state information (CSI) estimation techniques is conceived. To begin with, a practical MIMO channel model is developed for the THz band that incorporates its molecular absorption and reflection losses, as well as its non-line-of-sight (NLoS) rays associated with its diffused compon…
▽ More
Hybrid transceiver design in multiple-input multiple-output (MIMO) Tera-Hertz (THz) systems relying on sparse channel state information (CSI) estimation techniques is conceived. To begin with, a practical MIMO channel model is developed for the THz band that incorporates its molecular absorption and reflection losses, as well as its non-line-of-sight (NLoS) rays associated with its diffused components. Subsequently, a novel CSI estimation model is derived by exploiting the angular-sparsity of the THz MIMO channel. Then an orthogonal matching pursuit (OMP)-based framework is conceived, followed by designing a sophisticated Bayesian learning (BL)-based approach for efficient estimation of the sparse THz MIMO channel. The Bayesian Cramer-Rao Lower Bound (BCRLB) is also determined for benchmarking the performance of the CSI estimation techniques developed. Finally, an optimal hybrid transmit precoder and receiver combiner pair is designed, which directly relies on the beamspace domain CSI estimates and only requires limited feedback. Finally, simulation results are provided for quantifying the improved mean square error (MSE), spectral-efficiency (SE) and bit-error rate (BER) performance for transmission on practical THz MIMO channel obtained from the HIgh resolution TRANsmission (HITRAN)-database.
△ Less
Submitted 10 January, 2022; v1 submitted 20 September, 2021;
originally announced September 2021.
-
Physics Driven Domain Specific Transporter Framework with Attention Mechanism for Ultrasound Imaging
Authors:
Arpan Tripathi,
Abhilash Rakkunedeth,
Mahesh Raveendranatha Panicker,
Jack Zhang,
Naveenjyote Boora,
Jessica Knight,
Jacob Jaremko,
Yale Tung Chen,
Kiran Vishnu Narayan,
Kesavadas C
Abstract:
Most applications of deep learning techniques in medical imaging are supervised and require a large number of labeled data which is expensive and requires many hours of careful annotation by experts. In this paper, we propose an unsupervised, physics driven domain specific transporter framework with an attention mechanism to identify relevant key points with applications in ultrasound imaging. The…
▽ More
Most applications of deep learning techniques in medical imaging are supervised and require a large number of labeled data which is expensive and requires many hours of careful annotation by experts. In this paper, we propose an unsupervised, physics driven domain specific transporter framework with an attention mechanism to identify relevant key points with applications in ultrasound imaging. The proposed framework identifies key points that provide a concise geometric representation highlighting regions with high structural variation in ultrasound videos. We incorporate physics driven domain specific information as a feature probability map and use the radon transform to highlight features in specific orientations. The proposed framework has been trained on130 Lung ultrasound (LUS) videos and 113 Wrist ultrasound (WUS) videos and validated on 100 Lung ultrasound (LUS) videos and 58 Wrist ultrasound (WUS) videos acquired from multiple centers across the globe. Images from both datasets were independently assessed by experts to identify clinically relevant features such as A-lines, B-lines and pleura from LUS and radial metaphysis, radial epiphysis and carpal bones from WUS videos. The key points detected from both datasets showed high sensitivity (LUS = 99\% , WUS = 74\%) in detecting the image landmarks identified by experts. Also, on employing for classification of the given lung image into normal and abnormal classes, the proposed approach, even with no prior training, achieved an average accuracy of 97\% and an average F1-score of 95\% respectively on the task of co-classification with 3 fold cross-validation. With the purely unsupervised nature of the proposed approach, we expect the key point detection approach to increase the applicability of ultrasound in various examination performed in emergency and point of care.
△ Less
Submitted 13 September, 2021;
originally announced September 2021.
-
Learning the Imaging Landmarks: Unsupervised Key point Detection in Lung Ultrasound Videos
Authors:
Arpan Tripathi,
Mahesh Raveendranatha Panicker,
Abhilash R Hareendranathan,
Yale Tung Chen,
Jacob L Jaremko,
Kiran Vishnu Narayan,
Kesavadas C
Abstract:
Lung ultrasound (LUS) is an increasingly popular diagnostic imaging modality for continuous and periodic monitoring of lung infection, given its advantages of non-invasiveness, non-ionizing nature, portability and easy disinfection. The major landmarks assessed by clinicians for triaging using LUS are pleura, A and B lines. There have been many efforts for the automatic detection of these landmark…
▽ More
Lung ultrasound (LUS) is an increasingly popular diagnostic imaging modality for continuous and periodic monitoring of lung infection, given its advantages of non-invasiveness, non-ionizing nature, portability and easy disinfection. The major landmarks assessed by clinicians for triaging using LUS are pleura, A and B lines. There have been many efforts for the automatic detection of these landmarks. However, restricting to a few pre-defined landmarks may not reveal the actual imaging biomarkers particularly in case of new pathologies like COVID-19. Rather, the identification of key landmarks should be driven by data given the availability of a plethora of neural network algorithms. This work is a first of its kind attempt towards unsupervised detection of the key LUS landmarks in LUS videos of COVID-19 subjects during various stages of infection. We adapted the relatively newer approach of transporter neural networks to automatically mark and track pleura, A and B lines based on their periodic motion and relatively stable appearance in the videos. Initial results on unsupervised pleura detection show an accuracy of 91.8% employing 1081 LUS video frames.
△ Less
Submitted 13 June, 2021;
originally announced June 2021.
-
Domain Specific Transporter Framework to Detect Fractures in Ultrasound
Authors:
Arpan Tripathi,
Abhilash Rakkunedeth,
Mahesh Raveendranatha Panicker,
Jack Zhang,
Naveenjyote Boora,
Jacob Jaremko
Abstract:
Ultrasound examination for detecting fractures is ideally suited for Emergency Departments (ED) as it is relatively fast, safe (from ionizing radiation), has dynamic imaging capability and is easily portable. High interobserver variability in manual assessment of ultrasound scans has piqued research interest in automatic assessment techniques using Deep Learning (DL). Most DL techniques are superv…
▽ More
Ultrasound examination for detecting fractures is ideally suited for Emergency Departments (ED) as it is relatively fast, safe (from ionizing radiation), has dynamic imaging capability and is easily portable. High interobserver variability in manual assessment of ultrasound scans has piqued research interest in automatic assessment techniques using Deep Learning (DL). Most DL techniques are supervised and are trained on large numbers of labeled data which is expensive and requires many hours of careful annotation by experts. In this paper, we propose an unsupervised, domain specific transporter framework to identify relevant keypoints from wrist ultrasound scans. Our framework provides a concise geometric representation highlighting regions with high structural variation in a 3D ultrasound (3DUS) sequence. We also incorporate domain specific information represented by instantaneous local phase (LP) which detects bone features from 3DUS. We validate the technique on 3DUS videos obtained from 30 subjects. Each ultrasound scan was independently assessed by three readers to identify fractures along with the corresponding x-ray. Saliency of keypoints detected in the image\ are compared against manual assessment based on distance from relevant features.The transporter neural network was able to accurately detect 180 out of 250 bone regions sampled from wrist ultrasound videos. We expect this technique to increase the applicability of ultrasound in fracture detection.
△ Less
Submitted 9 June, 2021;
originally announced June 2021.
-
Reducing Streaming ASR Model Delay with Self Alignment
Authors:
Jaeyoung Kim,
Han Lu,
Anshuman Tripathi,
Qian Zhang,
Hasim Sak
Abstract:
Reducing prediction delay for streaming end-to-end ASR models with minimal performance regression is a challenging problem. Constrained alignment is a well-known existing approach that penalizes predicted word boundaries using external low-latency acoustic models. On the contrary, recently proposed FastEmit is a sequence-level delay regularization scheme encouraging vocabulary tokens over blanks w…
▽ More
Reducing prediction delay for streaming end-to-end ASR models with minimal performance regression is a challenging problem. Constrained alignment is a well-known existing approach that penalizes predicted word boundaries using external low-latency acoustic models. On the contrary, recently proposed FastEmit is a sequence-level delay regularization scheme encouraging vocabulary tokens over blanks without any reference alignments. Although all these schemes are successful in reducing delay, ASR word error rate (WER) often severely degrades after applying these delay constraining schemes. In this paper, we propose a novel delay constraining method, named self alignment. Self alignment does not require external alignment models. Instead, it utilizes Viterbi forced-alignments from the trained model to find the lower latency alignment direction. From LibriSpeech evaluation, self alignment outperformed existing schemes: 25% and 56% less delay compared to FastEmit and constrained alignment at the similar word error rate. For Voice Search evaluation,12% and 25% delay reductions were achieved compared to FastEmit and constrained alignment with more than 2% WER improvements.
△ Less
Submitted 6 May, 2021;
originally announced May 2021.
-
Automatic Speaker Independent Dysarthric Speech Intelligibility Assessment System
Authors:
Ayush Tripathi,
Swapnil Bhosale,
Sunil Kumar Kopparapu
Abstract:
Dysarthria is a condition which hampers the ability of an individual to control the muscles that play a major role in speech delivery. The loss of fine control over muscles that assist the movement of lips, vocal chords, tongue and diaphragm results in abnormal speech delivery. One can assess the severity level of dysarthria by analyzing the intelligibility of speech spoken by an individual. Conti…
▽ More
Dysarthria is a condition which hampers the ability of an individual to control the muscles that play a major role in speech delivery. The loss of fine control over muscles that assist the movement of lips, vocal chords, tongue and diaphragm results in abnormal speech delivery. One can assess the severity level of dysarthria by analyzing the intelligibility of speech spoken by an individual. Continuous intelligibility assessment helps speech language pathologists not only study the impact of medication but also allows them to plan personalized therapy. It helps the clinicians immensely if the intelligibility assessment system is reliable, automatic, simple for (a) patients to undergo and (b) clinicians to interpret. Lack of availability of dysarthric data has resulted in development of speaker dependent automatic intelligibility assessment systems which requires patients to speak a large number of utterances. In this paper, we propose (a) a cost minimization procedure to select an optimal (small) number of utterances that need to be spoken by the dysarthric patient, (b) four different speaker independent intelligibility assessment systems which require the patient to speak a small number of words, and (c) the assessment score is close to the perceptual score that the Speech Language Pathologist (SLP) can relate to. The need for small number of utterances to be spoken by the patient and the score being relatable to the SLP benefits both the dysarthric patient and the clinician from usability perspective.
△ Less
Submitted 10 March, 2021;
originally announced March 2021.
-
Transformer Transducer: One Model Unifying Streaming and Non-streaming Speech Recognition
Authors:
Anshuman Tripathi,
Jaeyoung Kim,
Qian Zhang,
Han Lu,
Hasim Sak
Abstract:
In this paper we present a Transformer-Transducer model architecture and a training technique to unify streaming and non-streaming speech recognition models into one model. The model is composed of a stack of transformer layers for audio encoding with no lookahead or right context and an additional stack of transformer layers on top trained with variable right context. In inference time, the conte…
▽ More
In this paper we present a Transformer-Transducer model architecture and a training technique to unify streaming and non-streaming speech recognition models into one model. The model is composed of a stack of transformer layers for audio encoding with no lookahead or right context and an additional stack of transformer layers on top trained with variable right context. In inference time, the context length for the variable context layers can be changed to trade off the latency and the accuracy of the model. We also show that we can run this model in a Y-model architecture with the top layers running in parallel in low latency and high latency modes. This allows us to have streaming speech recognition results with limited latency and delayed speech recognition results with large improvements in accuracy (20% relative improvement for voice-search task). We show that with limited right context (1-2 seconds of audio) and small additional latency (50-100 milliseconds) at the end of decoding, we can achieve similar accuracy with models using unlimited audio right context. We also present optimizations for audio and label encoders to speed up the inference in streaming and non-streaming speech decoding.
△ Less
Submitted 7 October, 2020;
originally announced October 2020.
-
Fuzzy Unique Image Transformation: Defense Against Adversarial Attacks On Deep COVID-19 Models
Authors:
Achyut Mani Tripathi,
Ashish Mishra
Abstract:
Early identification of COVID-19 using a deep model trained on Chest X-Ray and CT images has gained considerable attention from researchers to speed up the process of identification of active COVID-19 cases. These deep models act as an aid to hospitals that suffer from the unavailability of specialists or radiologists, specifically in remote areas. Various deep models have been proposed to detect…
▽ More
Early identification of COVID-19 using a deep model trained on Chest X-Ray and CT images has gained considerable attention from researchers to speed up the process of identification of active COVID-19 cases. These deep models act as an aid to hospitals that suffer from the unavailability of specialists or radiologists, specifically in remote areas. Various deep models have been proposed to detect the COVID-19 cases, but few works have been performed to prevent the deep models against adversarial attacks capable of fooling the deep model by using a small perturbation in image pixels. This paper presents an evaluation of the performance of deep COVID-19 models against adversarial attacks. Also, it proposes an efficient yet effective Fuzzy Unique Image Transformation (FUIT) technique that downsamples the image pixels into an interval. The images obtained after the FUIT transformation are further utilized for training the secure deep model that preserves high accuracy of the diagnosis of COVID-19 cases and provides reliable defense against the adversarial attacks. The experiments and results show the proposed model prevents the deep model against the six adversarial attacks and maintains high accuracy to classify the COVID-19 cases from the Chest X-Ray image and CT image Datasets. The results also recommend that a careful inspection is required before practically applying the deep models to diagnose the COVID-19 cases.
△ Less
Submitted 8 September, 2020;
originally announced September 2020.
-
Y-net: Biomedical Image Segmentation and Clustering
Authors:
Sharmin Pathan,
Anant Tripathi
Abstract:
We propose a deep clustering architecture alongside image segmentation for medical image analysis. The main idea is based on unsupervised learning to cluster images on severity of the disease in the subject's sample, and this image is then segmented to highlight and outline regions of interest. We start with training an autoencoder on the images for segmentation. The encoder part from the autoenco…
▽ More
We propose a deep clustering architecture alongside image segmentation for medical image analysis. The main idea is based on unsupervised learning to cluster images on severity of the disease in the subject's sample, and this image is then segmented to highlight and outline regions of interest. We start with training an autoencoder on the images for segmentation. The encoder part from the autoencoder branches out to a clustering node and segmentation node. Deep clustering using Kmeans clustering is performed at the clustering branch and a lightweight model is used for segmentation. Each of the branches use extracted features from the autoencoder. We demonstrate our results on ISIC 2018 Skin Lesion Analysis Towards Melanoma Detection and Cityscapes datasets for segmentation and clustering. The proposed architecture beats UNet and DeepLab results on the two datasets, and has less than half the number of parameters. We use the deep clustering branch for clustering images into four clusters. Our approach can be applied to work with high complexity datasets of medical imaging for analyzing survival prediction for severe diseases or customizing treatment based on how far the disease has propagated. Clustering patients can help understand how binning should be done on real valued features to reduce feature sparsity and improve accuracy on classification tasks. The proposed architecture can provide an early diagnosis and reduce human intervention on labeling as it can become quite costly as the datasets grow larger. The main idea is to propose a one shot approach to segmentation with deep clustering.
△ Less
Submitted 26 May, 2020; v1 submitted 12 April, 2020;
originally announced April 2020.
-
Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss
Authors:
Qian Zhang,
Han Lu,
Hasim Sak,
Anshuman Tripathi,
Erik McDermott,
Stephen Koo,
Shankar Kumar
Abstract:
In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently. The activations from both audio and label encoders are combined with a feed-forward layer to compute a probability distribution ove…
▽ More
In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently. The activations from both audio and label encoders are combined with a feed-forward layer to compute a probability distribution over the label space for every combination of acoustic frame position and label history. This is similar to the Recurrent Neural Network Transducer (RNN-T) model, which uses RNNs for information encoding instead of Transformer encoders. The model is trained with the RNN-T loss well-suited to streaming decoding. We present results on the LibriSpeech dataset showing that limiting the left context for self-attention in the Transformer layers makes decoding computationally tractable for streaming, with only a slight degradation in accuracy. We also show that the full attention version of our model beats the-state-of-the art accuracy on the LibriSpeech benchmarks. Our results also show that we can bridge the gap between full attention and limited attention versions of our model by attending to a limited number of future frames.
△ Less
Submitted 14 February, 2020; v1 submitted 6 February, 2020;
originally announced February 2020.
-
Robust Estimation of Hypernasality in Dysarthria with Acoustic Model Likelihood Features
Authors:
Michael Saxon,
Ayush Tripathi,
Yishan Jiao,
Julie Liss,
Visar Berisha
Abstract:
Hypernasality is a common characteristic symptom across many motor-speech disorders. For voiced sounds, hypernasality introduces an additional resonance in the lower frequencies and, for unvoiced sounds, there is reduced articulatory precision due to air escaping through the nasal cavity. However, the acoustic manifestation of these symptoms is highly variable, making hypernasality estimation very…
▽ More
Hypernasality is a common characteristic symptom across many motor-speech disorders. For voiced sounds, hypernasality introduces an additional resonance in the lower frequencies and, for unvoiced sounds, there is reduced articulatory precision due to air escaping through the nasal cavity. However, the acoustic manifestation of these symptoms is highly variable, making hypernasality estimation very challenging, both for human specialists and automated systems. Previous work in this area relies on either engineered features based on statistical signal processing or machine learning models trained on clinical ratings. Engineered features often fail to capture the complex acoustic patterns associated with hypernasality, whereas metrics based on machine learning are prone to overfitting to the small disease-specific speech datasets on which they are trained. Here we propose a new set of acoustic features that capture these complementary dimensions. The features are based on two acoustic models trained on a large corpus of healthy speech. The first acoustic model aims to measure nasal resonance from voiced sounds, whereas the second acoustic model aims to measure articulatory imprecision from unvoiced sounds. To demonstrate that the features derived from these acoustic models are specific to hypernasal speech, we evaluate them across different dysarthria corpora. Our results show that the features generalize even when training on hypernasal speech from one disease and evaluating on hypernasal speech from another disease (e.g. training on Parkinson's disease, evaluation on Huntington's disease), and when training on neurologically disordered speech but evaluating on cleft palate speech.
△ Less
Submitted 5 August, 2020; v1 submitted 26 November, 2019;
originally announced November 2019.
-
UAV Control in Close Proximities - Ceiling Effect on Battery Lifetime
Authors:
Basaran Bahadir Kocer,
Volkan Kumtepeli,
Tegoeh Tjahjowidodo,
Mahardhika Pratama,
Anshuman Tripathi,
Gerald Seet Gim Lee,
Youyi Wang
Abstract:
With the recent developments in the unmanned aerial vehicles (UAV), it is expected them to interact and collaborate with their surrounding objects, other robots and people in order to wisely plan and execute particular tasks. Although these interaction operations are inherently challenging as compared to free-flight missions, they might bring diverse advantages. One of them is their basic aerodyna…
▽ More
With the recent developments in the unmanned aerial vehicles (UAV), it is expected them to interact and collaborate with their surrounding objects, other robots and people in order to wisely plan and execute particular tasks. Although these interaction operations are inherently challenging as compared to free-flight missions, they might bring diverse advantages. One of them is their basic aerodynamic interaction during the flight in close proximities which can result in a reduction of the controller effort. In this study, by collecting real-time data, we have observed that the current drawn by the battery can be decreased while flying very close to the surroundings with the help of the ceiling effect. For the first time, this phenomenon is analyzed in terms of battery lifetime degradation by using a simple full equivalent cycle counting method. Results show that cycling related effect on battery degradation can be reduced by a 15.77% if the UAV can utilize ceiling effect.
△ Less
Submitted 31 December, 2018;
originally announced December 2018.
-
Toward domain-invariant speech recognition via large scale training
Authors:
Arun Narayanan,
Ananya Misra,
Khe Chai Sim,
Golan Pundak,
Anshuman Tripathi,
Mohamed Elfeky,
Parisa Haghani,
Trevor Strohman,
Michiel Bacchiani
Abstract:
Current state-of-the-art automatic speech recognition systems are trained to work in specific `domains', defined based on factors like application, sampling rate and codec. When such recognizers are used in conditions that do not match the training domain, performance significantly drops. This work explores the idea of building a single domain-invariant model for varied use-cases by combining larg…
▽ More
Current state-of-the-art automatic speech recognition systems are trained to work in specific `domains', defined based on factors like application, sampling rate and codec. When such recognizers are used in conditions that do not match the training domain, performance significantly drops. This work explores the idea of building a single domain-invariant model for varied use-cases by combining large scale training data from multiple application domains. Our final system is trained using 162,000 hours of speech. Additionally, each utterance is artificially distorted during training to simulate effects like background noise, codec distortion, and sampling rates. Our results show that, even at such a scale, a model thus trained works almost as well as those fine-tuned to specific subsets: A single model can be robust to multiple application domains, and variations like codecs and noise. More importantly, such models generalize better to unseen conditions and allow for rapid adaptation -- we show that by using as little as 10 hours of data from a new domain, an adapted domain-invariant model can match performance of a domain-specific model trained from scratch using 70 times as much data. We also highlight some of the limitations of such models and areas that need addressing in future work.
△ Less
Submitted 15 August, 2018;
originally announced August 2018.
-
Adversarial Learning of Raw Speech Features for Domain Invariant Speech Recognition
Authors:
Aditay Tripathi,
Aanchan Mohan,
Saket Anand,
Maneesh Singh
Abstract:
Recent advances in neural network based acoustic modelling have shown significant improvements in automatic speech recognition (ASR) performance. In order for acoustic models to be able to handle large acoustic variability, large amounts of labeled data is necessary, which are often expensive to obtain. This paper explores the application of adversarial training to learn features from raw speech t…
▽ More
Recent advances in neural network based acoustic modelling have shown significant improvements in automatic speech recognition (ASR) performance. In order for acoustic models to be able to handle large acoustic variability, large amounts of labeled data is necessary, which are often expensive to obtain. This paper explores the application of adversarial training to learn features from raw speech that are invariant to acoustic variability. This acoustic variability is referred to as a domain shift in this paper. The experimental study presented in this paper leverages the architecture of Domain Adversarial Neural Networks (DANNs) [1] which uses data from two different domains. The DANN is a Y-shaped network that consists of a multi-layer CNN feature extractor module that is common to a label (senone) classifier and a so-called domain classifier. The utility of DANNs is evaluated on multiple datasets with domain shifts caused due to differences in gender and speaker accents. Promising empirical results indicate the strength of adversarial training for unsupervised domain adaptation in ASR, thereby emphasizing the ability of DANNs to learn domain invariant features from raw speech.
△ Less
Submitted 21 May, 2018;
originally announced May 2018.
-
Speech recognition for medical conversations
Authors:
Chung-Cheng Chiu,
Anshuman Tripathi,
Katherine Chou,
Chris Co,
Navdeep Jaitly,
Diana Jaunzeikare,
Anjuli Kannan,
Patrick Nguyen,
Hasim Sak,
Ananth Sankar,
Justin Tansuwan,
Nathan Wan,
Yonghui Wu,
Xuedong Zhang
Abstract:
In this work we explored building automatic speech recognition models for transcribing doctor patient conversation. We collected a large scale dataset of clinical conversations ($14,000$ hr), designed the task to represent the real word scenario, and explored several alignment approaches to iteratively improve data quality. We explored both CTC and LAS systems for building speech recognition model…
▽ More
In this work we explored building automatic speech recognition models for transcribing doctor patient conversation. We collected a large scale dataset of clinical conversations ($14,000$ hr), designed the task to represent the real word scenario, and explored several alignment approaches to iteratively improve data quality. We explored both CTC and LAS systems for building speech recognition models. The LAS was more resilient to noisy data and CTC required more data clean up. A detailed analysis is provided for understanding the performance for clinical tasks. Our analysis showed the speech recognition models performed well on important medical utterances, while errors occurred in causal conversations. Overall we believe the resulting models can provide reasonable quality in practice.
△ Less
Submitted 20 June, 2018; v1 submitted 20 November, 2017;
originally announced November 2017.