Search | arXiv e-print repository

doi 10.21437/Interspeech.2024-2224

A Multimodal Framework for the Assessment of the Schizophrenia Spectrum

Authors: Gowtham Premananth, Yashish M. Siriwardena, Philip Resnik, Sonia Bansal, Deanna L. Kelly, Carol Espy-Wilson

Abstract: This paper presents a novel multimodal framework to distinguish between different symptom classes of subjects in the schizophrenia spectrum and healthy controls using audio, video, and text modalities. We implemented Convolution Neural Network and Long Short Term Memory based unimodal models and experimented on various multimodal fusion approaches to come up with the proposed framework. We utilize… ▽ More This paper presents a novel multimodal framework to distinguish between different symptom classes of subjects in the schizophrenia spectrum and healthy controls using audio, video, and text modalities. We implemented Convolution Neural Network and Long Short Term Memory based unimodal models and experimented on various multimodal fusion approaches to come up with the proposed framework. We utilized a minimal Gated multimodal unit (mGMU) to obtain a bi-modal intermediate fusion of the features extracted from the input modalities before finally fusing the outputs of the bimodal fusions to perform subject-wise classifications. The use of mGMU units in the multimodal framework improved the performance in both weighted f1-score and weighted AUC-ROC scores. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: Accepted to be presented at Interspeech 2024

arXiv:2406.05947 [pdf, other]

Accent Conversion with Articulatory Representations

Authors: Yashish M. Siriwardena, Nathan Swedlow, Audrey Howard, Evan Gitterman, Dan Darcy, Carol Espy-Wilson, Andrea Fanelli

Abstract: Conversion of non-native accented speech to native (American) English has a wide range of applications such as improving intelligibility of non-native speech. Previous work on this domain has used phonetic posteriograms as the target speech representation to train an acoustic model which is then used to extract a compact representation of input speech for accent conversion. In this work, we introd… ▽ More Conversion of non-native accented speech to native (American) English has a wide range of applications such as improving intelligibility of non-native speech. Previous work on this domain has used phonetic posteriograms as the target speech representation to train an acoustic model which is then used to extract a compact representation of input speech for accent conversion. In this work, we introduce the idea of using an effective articulatory speech representation, extracted from an acoustic-to-articulatory speech inversion system, to improve the acoustic model used in accent conversion. The idea to incorporate articulatory representations originates from their ability to well characterize accents in speech. To incorporate articulatory representations with conventional phonetic posteriograms, a multi-task learning based acoustic model is proposed. Objective and subjective evaluations show that the use of articulatory representations can improve the effectiveness of accent conversion. △ Less

Submitted 9 June, 2024; originally announced June 2024.

Comments: Accepted at INTERSPEECH 2024

arXiv:2309.15136 [pdf, other]

A multi-modal approach for identifying schizophrenia using cross-modal attention

Authors: Gowtham Premananth, Yashish M. Siriwardena, Philip Resnik, Carol Espy-Wilson

Abstract: This study focuses on how different modalities of human communication can be used to distinguish between healthy controls and subjects with schizophrenia who exhibit strong positive symptoms. We developed a multi-modal schizophrenia classification system using audio, video, and text. Facial action units and vocal tract variables were extracted as low-level features from video and audio respectivel… ▽ More This study focuses on how different modalities of human communication can be used to distinguish between healthy controls and subjects with schizophrenia who exhibit strong positive symptoms. We developed a multi-modal schizophrenia classification system using audio, video, and text. Facial action units and vocal tract variables were extracted as low-level features from video and audio respectively, which were then used to compute high-level coordination features that served as the inputs to the audio and video modalities. Context-independent text embeddings extracted from transcriptions of speech were used as the input for the text modality. The multi-modal system is developed by fusing a segment-to-session-level classifier for video and audio modalities with a text model based on a Hierarchical Attention Network (HAN) with cross-modal attention. The proposed multi-modal system outperforms the previous state-of-the-art multi-modal system by 8.53% in the weighted average F1 score. △ Less

Submitted 18 April, 2024; v1 submitted 26 September, 2023; originally announced September 2023.

Comments: Accepted to Annual International Conference of the IEEE Engineering in Medicine and Biology Society 2024

arXiv:2309.09220 [pdf, other]

Improving Speech Inversion Through Self-Supervised Embeddings and Enhanced Tract Variables

Authors: Ahmed Adel Attia, Yashish M. Siriwardena, Carol Espy-Wilson

Abstract: The performance of deep learning models depends significantly on their capacity to encode input features efficiently and decode them into meaningful outputs. Better input and output representation has the potential to boost models' performance and generalization. In the context of acoustic-to-articulatory speech inversion (SI) systems, we study the impact of utilizing speech representations acquir… ▽ More The performance of deep learning models depends significantly on their capacity to encode input features efficiently and decode them into meaningful outputs. Better input and output representation has the potential to boost models' performance and generalization. In the context of acoustic-to-articulatory speech inversion (SI) systems, we study the impact of utilizing speech representations acquired via self-supervised learning (SSL) models, such as HuBERT compared to conventional acoustic features. Additionally, we investigate the incorporation of novel tract variables (TVs) through an improved geometric transformation model. By combining these two approaches, we improve the Pearson product-moment correlation (PPMC) scores which evaluate the accuracy of TV estimation of the SI system from 0.7452 to 0.8141, a 6.9% increase. Our findings underscore the profound influence of rich feature representations from SSL models and improved geometric transformations with target TVs on the enhanced functionality of SI systems. △ Less

Submitted 7 September, 2024; v1 submitted 17 September, 2023; originally announced September 2023.

arXiv:2306.00203 [pdf, ps, other]

Speaker-independent Speech Inversion for Estimation of Nasalance

Authors: Yashish M. Siriwardena, Carol Espy-Wilson, Suzanne Boyce, Mark K. Tiede, Liran Oren

Abstract: The velopharyngeal (VP) valve regulates the opening between the nasal and oral cavities. This valve opens and closes through a coordinated motion of the velum and pharyngeal walls. Nasalance is an objective measure derived from the oral and nasal acoustic signals that correlate with nasality. In this work, we evaluate the degree to which the nasalance measure reflects fine-grained patterns of VP m… ▽ More The velopharyngeal (VP) valve regulates the opening between the nasal and oral cavities. This valve opens and closes through a coordinated motion of the velum and pharyngeal walls. Nasalance is an objective measure derived from the oral and nasal acoustic signals that correlate with nasality. In this work, we evaluate the degree to which the nasalance measure reflects fine-grained patterns of VP movement by comparison with simultaneously collected direct measures of VP opening using high-speed nasopharyngoscopy (HSN). We show that nasalance is significantly correlated with the HSN signal, and that both match expected patterns of nasality. We then train a temporal convolution-based speech inversion system in a speaker-independent fashion to estimate VP movement for nasality, using nasalance as the ground truth. In further experiments, we also show the importance of incorporating source features (from glottal activity) to improve nasality prediction. △ Less

Submitted 31 May, 2023; originally announced June 2023.

Comments: Interspeech 2023

arXiv:2305.16085 [pdf]

doi 10.21437/Interspeech.2023-1924

Acoustic-to-Articulatory Speech Inversion Features for Mispronunciation Detection of /r/ in Child Speech Sound Disorders

Authors: Nina R Benway, Yashish M Siriwardena, Jonathan L Preston, Elaine Hitchcock, Tara McAllister, Carol Espy-Wilson

Abstract: Acoustic-to-articulatory speech inversion could enhance automated clinical mispronunciation detection to provide detailed articulatory feedback unattainable by formant-based mispronunciation detection algorithms; however, it is unclear the extent to which a speech inversion system trained on adult speech performs in the context of (1) child and (2) clinical speech. In the absence of an articulator… ▽ More Acoustic-to-articulatory speech inversion could enhance automated clinical mispronunciation detection to provide detailed articulatory feedback unattainable by formant-based mispronunciation detection algorithms; however, it is unclear the extent to which a speech inversion system trained on adult speech performs in the context of (1) child and (2) clinical speech. In the absence of an articulatory dataset in children with rhotic speech sound disorders, we show that classifiers trained on tract variables from acoustic-to-articulatory speech inversion meet or exceed the performance of state-of-the-art features when predicting clinician judgment of rhoticity. Index Terms: rhotic, speech sound disorder, mispronunciation detection △ Less

Submitted 25 May, 2023; originally announced May 2023.

Comments: *denotes equal contribution. To appear in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2023

Journal ref: Proc. INTERSPEECH 2023, 4568-4572

arXiv:2210.16454 [pdf, ps, other]

Learning to Compute the Articulatory Representations of Speech with the MIRRORNET

Authors: Yashish M. Siriwardena, Carol Espy-Wilson, Shihab Shamma

Abstract: Most organisms including humans function by coordinating and integrating sensory signals with motor actions to survive and accomplish desired tasks. Learning these complex sensorimotor mappings proceeds simultaneously and often in an unsupervised or semi-supervised fashion. An autoencoder architecture (MirrorNet) inspired by this sensorimotor learning paradigm is explored in this work to control a… ▽ More Most organisms including humans function by coordinating and integrating sensory signals with motor actions to survive and accomplish desired tasks. Learning these complex sensorimotor mappings proceeds simultaneously and often in an unsupervised or semi-supervised fashion. An autoencoder architecture (MirrorNet) inspired by this sensorimotor learning paradigm is explored in this work to control an articulatory synthesizer, with minimal exposure to ground-truth articulatory data. The articulatory synthesizer takes as input a set of six vocal Tract Variables (TVs) and source features (voicing indicators and pitch) and is able to synthesize continuous speech for unseen speakers. We show that the MirrorNet, once initialized (with ~30 mins of articulatory data) and further trained in unsupervised fashion (`learning phase'), can learn meaningful articulatory representations with comparable accuracy to articulatory speech-inversion systems trained in a completely supervised fashion. △ Less

Submitted 25 May, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

Comments: Interspeech 2023

Journal ref: Interspeech 2023

arXiv:2210.16450 [pdf, ps, other]

The Secret Source : Incorporating Source Features to Improve Acoustic-to-Articulatory Speech Inversion

Authors: Yashish M. Siriwardena, Carol Espy-Wilson

Abstract: In this work, we incorporated acoustically derived source features, aperiodicity, periodicity and pitch as additional targets to an acoustic-to-articulatory speech inversion (SI) system. We also propose a Temporal Convolution based SI system, which uses auditory spectrograms as the input speech representation, to learn long-range dependencies and complex interactions between the source and vocal t… ▽ More In this work, we incorporated acoustically derived source features, aperiodicity, periodicity and pitch as additional targets to an acoustic-to-articulatory speech inversion (SI) system. We also propose a Temporal Convolution based SI system, which uses auditory spectrograms as the input speech representation, to learn long-range dependencies and complex interactions between the source and vocal tract, to improve the SI task. The experiments are conducted with both the Wisconsin X-ray microbeam (XRMB) and Haskins Production Rate Comparison (HPRC) datasets, with comparisons done with respect to three baseline SI model architectures. The proposed SI system with the HPRC dataset gains an improvement of close to 28% when the source features are used as additional targets. The same SI system outperforms the current best performing SI models by around 9% on the XRMB dataset. △ Less

Submitted 28 October, 2022; originally announced October 2022.

arXiv:2205.13755 [pdf, ps, other]

doi 10.21437/Interspeech.2022-11164

Acoustic-to-articulatory Speech Inversion with Multi-task Learning

Authors: Yashish M. Siriwardena, Ganesh Sivaraman, Carol Espy-Wilson

Abstract: Multi-task learning (MTL) frameworks have proven to be effective in diverse speech related tasks like automatic speech recognition (ASR) and speech emotion recognition. This paper proposes a MTL framework to perform acoustic-to-articulatory speech inversion by simultaneously learning an acoustic to phoneme mapping as a shared task. We use the Haskins Production Rate Comparison (HPRC) database whic… ▽ More Multi-task learning (MTL) frameworks have proven to be effective in diverse speech related tasks like automatic speech recognition (ASR) and speech emotion recognition. This paper proposes a MTL framework to perform acoustic-to-articulatory speech inversion by simultaneously learning an acoustic to phoneme mapping as a shared task. We use the Haskins Production Rate Comparison (HPRC) database which has both the electromagnetic articulography (EMA) data and the corresponding phonetic transcriptions. Performance of the system was measured by computing the correlation between estimated and actual tract variables (TVs) from the acoustic to articulatory speech inversion task. The proposed MTL based Bidirectional Gated Recurrent Neural Network (RNN) model learns to map the input acoustic features to nine TVs while outperforming the baseline model trained to perform only acoustic to articulatory inversion. △ Less

Submitted 26 May, 2022; originally announced May 2022.

Journal ref: Proc. Interspeech 2022

arXiv:2205.13086 [pdf, ps, other]

Audio Data Augmentation for Acoustic-to-articulatory Speech Inversion using Bidirectional Gated RNNs

Authors: Yashish M. Siriwardena, Ahmed Adel Attia, Ganesh Sivaraman, Carol Espy-Wilson

Abstract: Data augmentation has proven to be a promising prospect in improving the performance of deep learning models by adding variability to training data. In previous work with developing a noise robust acoustic-to-articulatory speech inversion system, we have shown the importance of noise augmentation to improve the performance of speech inversion in noisy speech. In this work, we compare and contrast… ▽ More Data augmentation has proven to be a promising prospect in improving the performance of deep learning models by adding variability to training data. In previous work with developing a noise robust acoustic-to-articulatory speech inversion system, we have shown the importance of noise augmentation to improve the performance of speech inversion in noisy speech. In this work, we compare and contrast different ways of doing data augmentation and show how this technique improves the performance of articulatory speech inversion not only on noisy speech, but also on clean speech data. We also propose a Bidirectional Gated Recurrent Neural Network as the speech inversion system instead of the previously used feed forward neural network. The inversion system uses mel-frequency cepstral coefficients (MFCCs) as the input acoustic features and six vocal tract-variables (TVs) as the output articulatory features. The Performance of the system was measured by computing the correlation between estimated and actual TVs on the U. Wisc. X-ray Microbeam database. The proposed speech inversion system shows a 5% relative improvement in correlation over the baseline noise robust system for clean speech data. The pre-trained model, when adapted to each unseen speaker in the test set, improves the average correlation by another 6%. △ Less

Submitted 31 May, 2023; v1 submitted 25 May, 2022; originally announced May 2022.

Comments: EUSIPCO 2023

arXiv:2110.05695 [pdf, ps, other]

doi 10.1109/ICASSP43922.2022.9747358

The Mirrornet : Learning Audio Synthesizer Controls Inspired by Sensorimotor Interaction

Authors: Yashish M. Siriwardena, Guilhem Marion, Shihab Shamma

Abstract: Experiments to understand the sensorimotor neural interactions in the human cortical speech system support the existence of a bidirectional flow of interactions between the auditory and motor regions. Their key function is to enable the brain to `learn' how to control the vocal tract for speech production. This idea is the impetus for the recently proposed "MirrorNet", a constrained autoencoder ar… ▽ More Experiments to understand the sensorimotor neural interactions in the human cortical speech system support the existence of a bidirectional flow of interactions between the auditory and motor regions. Their key function is to enable the brain to `learn' how to control the vocal tract for speech production. This idea is the impetus for the recently proposed "MirrorNet", a constrained autoencoder architecture. In this paper, the MirrorNet is applied to learn, in an unsupervised manner, the controls of a specific audio synthesizer (DIVA) to produce melodies only from their auditory spectrograms. The results demonstrate how the MirrorNet discovers the synthesizer parameters to generate the melodies that closely resemble the original and those of unseen melodies, and even determine the best set parameters to approximate renditions of complex piano melodies generated by a different synthesizer. This generalizability of the MirrorNet illustrates its potential to discover from sensory data the controls of arbitrary motor-plants. △ Less

Submitted 18 February, 2022; v1 submitted 11 October, 2021; originally announced October 2021.

Journal ref: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

arXiv:2110.04440 [pdf, other]

doi 10.1145/3462244.3479967

Multimodal Approach for Assessing Neuromotor Coordination in Schizophrenia Using Convolutional Neural Networks

Authors: Yashish M. Siriwardena, Chris Kitchen, Deanna L. Kelly, Carol Espy-Wilson

Abstract: This study investigates the speech articulatory coordination in schizophrenia subjects exhibiting strong positive symptoms (e.g. hallucinations and delusions), using two distinct channel-delay correlation methods. We show that the schizophrenic subjects with strong positive symptoms and who are markedly ill pose complex articulatory coordination pattern in facial and speech gestures than what is o… ▽ More This study investigates the speech articulatory coordination in schizophrenia subjects exhibiting strong positive symptoms (e.g. hallucinations and delusions), using two distinct channel-delay correlation methods. We show that the schizophrenic subjects with strong positive symptoms and who are markedly ill pose complex articulatory coordination pattern in facial and speech gestures than what is observed in healthy subjects. This distinction in speech coordination pattern is used to train a multimodal convolutional neural network (CNN) which uses video and audio data during speech to distinguish schizophrenic patients with strong positive symptoms from healthy subjects. We also show that the vocal tract variables (TVs) which correspond to place of articulation and glottal source outperform the Mel-frequency Cepstral Coefficients (MFCCs) when fused with Facial Action Units (FAUs) in the proposed multimodal network. For the clinical dataset we collected, our best performing multimodal network improves the mean F1 score for detecting schizophrenia by around 18% with respect to the full vocal tract coordination (FVTC) baseline method implemented with fusing FAUs and MFCCs. △ Less

Submitted 8 October, 2021; originally announced October 2021.

Comments: 5 pages. arXiv admin note: text overlap with arXiv:2102.07054

Journal ref: Proceedings of the 2021 International Conference on Multimodal Interaction

Showing 1–12 of 12 results for author: Siriwardena, Y M