Search | arXiv e-print repository

Integrating Posture Control in Speech Motor Models: A Parallel-Structured Simulation Approach

Authors: Yadong Liu, Sidney Fels, Arian Shamei, Najeeb Khan, Bryan Gick

Abstract: Posture is an essential aspect of motor behavior, necessitating continuous muscle activation to counteract gravity. It remains stable under perturbation, aiding in maintaining bodily balance and enabling movement execution. Similarities have been observed between gross body postures and speech postures, such as those involving the jaw, tongue, and lips, which also exhibit resilience to perturbatio… ▽ More Posture is an essential aspect of motor behavior, necessitating continuous muscle activation to counteract gravity. It remains stable under perturbation, aiding in maintaining bodily balance and enabling movement execution. Similarities have been observed between gross body postures and speech postures, such as those involving the jaw, tongue, and lips, which also exhibit resilience to perturbations and assist in equilibrium and movement. Although postural control is a recognized element of human movement and balance, particularly in broader motor skills, it has not been adequately incorporated into existing speech motor control models, which typically concentrate on the gestures or motor commands associated with specific speech movements, overlooking the influence of postural control and gravity. Here we introduce a model that aligns speech posture and movement, using simulations to explore whether speech posture within this framework mirrors the principles of bodily postural control. Our findings indicate that, akin to body posture, speech posture is also robust to perturbation and plays a significant role in maintaining local segment balance and enhancing speech production. △ Less

Submitted 26 July, 2024; originally announced July 2024.

Comments: 11 pages, 3 figures

arXiv:2309.14586 [pdf, other]

Speech Audio Synthesis from Tagged MRI and Non-Negative Matrix Factorization via Plastic Transformer

Authors: Xiaofeng Liu, Fangxu Xing, Maureen Stone, Jiachen Zhuo, Sidney Fels, Jerry L. Prince, Georges El Fakhri, Jonghye Woo

Abstract: The tongue's intricate 3D structure, comprising localized functional units, plays a crucial role in the production of speech. When measured using tagged MRI, these functional units exhibit cohesive displacements and derived quantities that facilitate the complex process of speech production. Non-negative matrix factorization-based approaches have been shown to estimate the functional units through… ▽ More The tongue's intricate 3D structure, comprising localized functional units, plays a crucial role in the production of speech. When measured using tagged MRI, these functional units exhibit cohesive displacements and derived quantities that facilitate the complex process of speech production. Non-negative matrix factorization-based approaches have been shown to estimate the functional units through motion features, yielding a set of building blocks and a corresponding weighting map. Investigating the link between weighting maps and speech acoustics can offer significant insights into the intricate process of speech production. To this end, in this work, we utilize two-dimensional spectrograms as a proxy representation, and develop an end-to-end deep learning framework for translating weighting maps to their corresponding audio waveforms. Our proposed plastic light transformer (PLT) framework is based on directional product relative position bias and single-level spatial pyramid pooling, thus enabling flexible processing of weighting maps with variable size to fixed-size spectrograms, without input information loss or dimension expansion. Additionally, our PLT framework efficiently models the global correlation of wide matrix input. To improve the realism of our generated spectrograms with relatively limited training samples, we apply pair-wise utterance consistency with Maximum Mean Discrepancy constraint and adversarial training. Experimental results on a dataset of 29 subjects speaking two utterances demonstrated that our framework is able to synthesize speech audio waveforms from weighting maps, outperforming conventional convolution and transformer models. △ Less

Submitted 25 September, 2023; originally announced September 2023.

Comments: MICCAI 2023 (Oral presentation)

arXiv:2102.04588 [pdf, other]

A comparative study of two-dimensional vocal tract acoustic modeling based on Finite-Difference Time-Domain methods

Authors: Debasish Ray Mohapatra, Victor Zappi, Sidney Fels

Abstract: The two-dimensional (2D) numerical approaches for vocal tract (VT) modelling can afford a better balance between the low computational cost and accurate rendering of acoustic wave propagation. However, they require a high spatio-temporal resolution in the numerical scheme for a precise estimation of acoustic formants at the simulation run-time expense. We have recently proposed a new VT acoustic m… ▽ More The two-dimensional (2D) numerical approaches for vocal tract (VT) modelling can afford a better balance between the low computational cost and accurate rendering of acoustic wave propagation. However, they require a high spatio-temporal resolution in the numerical scheme for a precise estimation of acoustic formants at the simulation run-time expense. We have recently proposed a new VT acoustic modelling technique, known as the 2.5D Finite-Difference Time-Domain (2.5D FDTD), which extends the existing 2D FDTD approach by adding tube depth to its acoustic wave solver. In this work, first, the simulated acoustic outputs of our new model are shown to be comparable with the 2D FDTD and a realistic 3D FEM VT model at a low spatio-temporal resolution. Next, a radiation model is developed by including a circular baffle around the VT as head geometry. The transfer functions of the radiation model are analyzed using five different vocal tract shapes for vowel sounds /a/, /e/, /i/, /o/ and /u/. △ Less

Submitted 8 February, 2021; originally announced February 2021.

Comments: 4 pages, 3 figures

arXiv:2102.01640 [pdf, other]

SPEAK WITH YOUR HANDS Using Continuous Hand Gestures to control Articulatory Speech Synthesizer

Authors: Pramit Saha, Debasish Ray Mohapatra, Sidney Fels

Abstract: This work presents our advancements in controlling an articulatory speech synthesis engine, \textit{viz.}, Pink Trombone, with hand gestures. Our interface translates continuous finger movements and wrist flexion into continuous speech using vocal tract area-function based articulatory speech synthesis. We use Cyberglove II with 18 sensors to capture the kinematic information of the wrist and the… ▽ More This work presents our advancements in controlling an articulatory speech synthesis engine, \textit{viz.}, Pink Trombone, with hand gestures. Our interface translates continuous finger movements and wrist flexion into continuous speech using vocal tract area-function based articulatory speech synthesis. We use Cyberglove II with 18 sensors to capture the kinematic information of the wrist and the individual fingers, in order to control a virtual tongue. The coordinates and the bending values of the sensors are then utilized to fit a spline tongue model that smoothens out the noisy values and outliers. Considering the upper palate as fixed and the spline model as the dynamically moving lower surface (tongue) of the vocal tract, we compute 1D area functional values that are fed to the Pink Trombone, generating continuous speech sounds. Therefore, by learning to manipulate one's wrist and fingers, one can learn to produce speech sounds just through one's hands, without the need for using the vocal tract. △ Less

Submitted 2 February, 2021; originally announced February 2021.

Comments: 2 pages, 1 figure

arXiv:2010.14228 [pdf]

doi 10.1145/634067.634348

New interfaces for musical expression

Authors: Ivan Poupyrev, Michael J. Lyons, Sidney Fels, Tina Blaine

Abstract: The rapid evolution of electronics, digital media, advanced materials, and other areas of technology, is opening up unprecedented opportunities for musical interface inventors and designers. The possibilities afforded by these new technologies carry with them the challenges of a complex and often confusing array of choices for musical composers and performers. New musical technologies are at least… ▽ More The rapid evolution of electronics, digital media, advanced materials, and other areas of technology, is opening up unprecedented opportunities for musical interface inventors and designers. The possibilities afforded by these new technologies carry with them the challenges of a complex and often confusing array of choices for musical composers and performers. New musical technologies are at least partly responsible for the current explosion of new musical forms, some of which are controversial and challenge traditional definitions of music. Alternative musical controllers, currently the leading edge of the ongoing dialogue between technology and musical culture, involve many of the issues covered at past CHI meetings. This workshop brings together interface experts interested in musical controllers and musicians and composers involved in the development of new musical interfaces. △ Less

Submitted 27 October, 2020; originally announced October 2020.

Comments: 2 pages, This item describes the CHI'01 workshop which started the International Conference on New Interfaces for Musical Expression

ACM Class: H.5.5

Journal ref: ACM CHI'01 Extended Abstracts on Human Factors in Computing Systems, March 2001 Pages 491-492

arXiv:2006.16367 [pdf, other]

Ultra2Speech -- A Deep Learning Framework for Formant Frequency Estimation and Tracking from Ultrasound Tongue Images

Authors: Pramit Saha, Yadong Liu, Bryan Gick, Sidney Fels

Abstract: Thousands of individuals need surgical removal of their larynx due to critical diseases every year and therefore, require an alternative form of communication to articulate speech sounds after the loss of their voice box. This work addresses the articulatory-to-acoustic mapping problem based on ultrasound (US) tongue images for the development of a silent-speech interface (SSI) that can provide th… ▽ More Thousands of individuals need surgical removal of their larynx due to critical diseases every year and therefore, require an alternative form of communication to articulate speech sounds after the loss of their voice box. This work addresses the articulatory-to-acoustic mapping problem based on ultrasound (US) tongue images for the development of a silent-speech interface (SSI) that can provide them with an assistance in their daily interactions. Our approach targets automatically extracting tongue movement information by selecting an optimal feature set from US images and mapping these features to the acoustic space. We use a novel deep learning architecture to map US tongue images from the US probe placed beneath a subject's chin to formants that we call, Ultrasound2Formant (U2F) Net. It uses hybrid spatio-temporal 3D convolutions followed by feature shuffling, for the estimation and tracking of vowel formants from US images. The formant values are then utilized to synthesize continuous time-varying vowel trajectories, via Klatt Synthesizer. Our best model achieves R-squared (R^2) measure of 99.96% for the regression task. Our network lays the foundation for an SSI as it successfully tracks the tongue contour automatically as an internal representation without any explicit annotation. △ Less

Submitted 29 June, 2020; originally announced June 2020.

Comments: Accepted for publication in MICCAI 2020

arXiv:2005.09463 [pdf, other]

Learning Joint Articulatory-Acoustic Representations with Normalizing Flows

Authors: Pramit Saha, Sidney Fels

Abstract: The articulatory geometric configurations of the vocal tract and the acoustic properties of the resultant speech sound are considered to have a strong causal relationship. This paper aims at finding a joint latent representation between the articulatory and acoustic domain for vowel sounds via invertible neural network models, while simultaneously preserving the respective domain-specific features… ▽ More The articulatory geometric configurations of the vocal tract and the acoustic properties of the resultant speech sound are considered to have a strong causal relationship. This paper aims at finding a joint latent representation between the articulatory and acoustic domain for vowel sounds via invertible neural network models, while simultaneously preserving the respective domain-specific features. Our model utilizes a convolutional autoencoder architecture and normalizing flow-based models to allow both forward and inverse mappings in a semi-supervised manner, between the mid-sagittal vocal tract geometry of a two degrees-of-freedom articulatory synthesizer with 1D acoustic wave model and the Mel-spectrogram representation of the synthesized speech sounds. Our approach achieves satisfactory performance in achieving both articulatory-to-acoustic as well as acoustic-to-articulatory mapping, thereby demonstrating our success in achieving a joint encoding of both the domains. △ Less

Submitted 30 September, 2020; v1 submitted 16 May, 2020; originally announced May 2020.

Comments: 5 pages, 4 figures, accepted for publication in Interspeech 2020

arXiv:1912.03120 [pdf, other]

A Study into Echocardiography View Conversion

Authors: Amir H. Abdi, Mohammad H. Jafari, Sidney Fels, Theresa Tsang, Purang Abolmaesumi

Abstract: Transthoracic echo is one of the most common means of cardiac studies in the clinical routines. During the echo exam, the sonographer captures a set of standard cross sections (echo views) of the heart. Each 2D echo view cuts through the 3D cardiac geometry via a unique plane. Consequently, different views share some limited information. In this work, we investigate the feasibility of generating a… ▽ More Transthoracic echo is one of the most common means of cardiac studies in the clinical routines. During the echo exam, the sonographer captures a set of standard cross sections (echo views) of the heart. Each 2D echo view cuts through the 3D cardiac geometry via a unique plane. Consequently, different views share some limited information. In this work, we investigate the feasibility of generating a 2D echo view using another view based on adversarial generative models. The objective optimized to train the view-conversion model is based on the ideas introduced by LSGAN, PatchGAN and Conditional GAN (cGAN). The size and length of the left ventricle in the generated target echo view is compared against that of the target ground-truth to assess the validity of the echo view conversion. Results show that there is a correlation of 0.50 between the LV areas and 0.49 between the LV lengths of the generated target frames and the real target frames. △ Less

Submitted 5 December, 2019; originally announced December 2019.

Comments: Workshop of Medical Imaging Meets NeurIPS, NeurIPS 2019

arXiv:1910.13859 [pdf, other]

Reinforcement Learning for High-dimensional Continuous Control in Biomechanics: An Intro to ArtiSynth-RL

Authors: Amir H. Abdi, Masoud Malakoutian, Thomas Oxland, Sidney Fels

Abstract: Neural control is an exciting mystery which we instinctively master. Yet, researchers have a hard time explaining the motor control trajectories. Physiologically accurate biomechanical simulations can, to some extent, mimic live subjects and help us form evidence-based hypotheses. In these simulated environments, muscle excitations are typically calculated through inverse dynamic optimizations whi… ▽ More Neural control is an exciting mystery which we instinctively master. Yet, researchers have a hard time explaining the motor control trajectories. Physiologically accurate biomechanical simulations can, to some extent, mimic live subjects and help us form evidence-based hypotheses. In these simulated environments, muscle excitations are typically calculated through inverse dynamic optimizations which do not possess a closed-form solution. Thus, computationally expensive, and occasionally unstable, iterative numerical solvers are the only widely utilized solution. In this work, we introduce ArtiSynth, a 3D modeling platform that supports the combined simulation of multi-body and finite element models, and extended to support reinforcement learning (RL) training. we further use ArtiSynth to investigate whether a deep RL policy can be trained to drive the motor control of a physiologically accurate biomechanical model in a large continuous action space. We run a comprehensive evaluation of its performance and compare the results with the forward dynamics assisted tracking with a quadratic objective function. We assess the two approaches in terms of correctness, stability, energy-efficiency, and temporal consistency. △ Less

Submitted 9 December, 2019; v1 submitted 25 October, 2019; originally announced October 2019.

Comments: Deep Reinforcement Learning Workshop NeurIPS 2019

arXiv:1909.09585 [pdf, other]

doi 10.21437/Interspeech.2019-1764

An extended two-dimensional vocal tract model for fast acoustic simulation of single-axis symmetric three-dimensional tubes

Authors: Debasish Ray Mohapatra, Victor Zappi, Sidney Fels

Abstract: The simulation of two-dimensional (2D) wave propagation is an affordable computational task and its use can potentially improve time performance in vocal tracts' acoustic analysis. Several models have been designed that rely on 2D wave solvers and include 2D representations of three-dimensional (3D) vocal tract-like geometries. However, until now, only the acoustics of straight 3D tubes with circu… ▽ More The simulation of two-dimensional (2D) wave propagation is an affordable computational task and its use can potentially improve time performance in vocal tracts' acoustic analysis. Several models have been designed that rely on 2D wave solvers and include 2D representations of three-dimensional (3D) vocal tract-like geometries. However, until now, only the acoustics of straight 3D tubes with circular cross-sections have been successfully replicated with this approach. Furthermore, the simulation of the resulting 2D shapes requires extremely high spatio-temporal resolutions, dramatically reducing the speed boost deriving from the usage of a 2D wave solver. In this paper, we introduce an in-progress novel vocal tract model that extends the 2D Finite-Difference Time-Domain wave solver (2.5D FDTD) by adding tube depth, derived from the area functions, to the acoustic solver. The model combines the speed of a light 2D numerical scheme with the ability to natively simulate 3D tubes that are symmetric in one dimension, hence relaxing previous resolution requirements. An implementation of the 2.5D FDTD is presented, along with evaluation of its performance in the case of static vowel modeling. The paper discusses the current features and limits of the approach, and the potential impact on computational acoustics applications. △ Less

Submitted 18 September, 2019; originally announced September 2019.

Comments: 5 pages, 2 figures, Interspeech 2019 submission

arXiv:1904.05746 [pdf, other]

SPEAK YOUR MIND! Towards Imagined Speech Recognition With Hierarchical Deep Learning

Authors: Pramit Saha, Muhammad Abdul-Mageed, Sidney Fels

Abstract: Speech-related Brain Computer Interface (BCI) technologies provide effective vocal communication strategies for controlling devices through speech commands interpreted from brain signals. In order to infer imagined speech from active thoughts, we propose a novel hierarchical deep learning BCI system for subject-independent classification of 11 speech tokens including phonemes and words. Our novel… ▽ More Speech-related Brain Computer Interface (BCI) technologies provide effective vocal communication strategies for controlling devices through speech commands interpreted from brain signals. In order to infer imagined speech from active thoughts, we propose a novel hierarchical deep learning BCI system for subject-independent classification of 11 speech tokens including phonemes and words. Our novel approach exploits predicted articulatory information of six phonological categories (e.g., nasal, bilabial) as an intermediate step for classifying the phonemes and words, thereby finding discriminative signal responsible for natural speech synthesis. The proposed network is composed of hierarchical combination of spatial and temporal CNN cascaded with a deep autoencoder. Our best models on the KARA database achieve an average accuracy of 83.42% across the six different binary phonological classification tasks, and 53.36% for the individual token identification task, significantly outperforming our baselines. Ultimately, our work suggests the possible existence of a brain imagery footprint for the underlying articulatory movement related to different sounds that can be used to aid imagined speech decoding. △ Less

Submitted 8 April, 2019; originally announced April 2019.

Comments: Under review in INTERSPEECH 2019. arXiv admin note: text overlap with arXiv:1904.04358

arXiv:1904.04358 [pdf, other]

Deep Learning the EEG Manifold for Phonological Categorization from Active Thoughts

Authors: Pramit Saha, Muhammad Abdul-Mageed, Sidney Fels

Abstract: Speech-related Brain Computer Interfaces (BCI) aim primarily at finding an alternative vocal communication pathway for people with speaking disabilities. As a step towards full decoding of imagined speech from active thoughts, we present a BCI system for subject-independent classification of phonological categories exploiting a novel deep learning based hierarchical feature extraction scheme. To b… ▽ More Speech-related Brain Computer Interfaces (BCI) aim primarily at finding an alternative vocal communication pathway for people with speaking disabilities. As a step towards full decoding of imagined speech from active thoughts, we present a BCI system for subject-independent classification of phonological categories exploiting a novel deep learning based hierarchical feature extraction scheme. To better capture the complex representation of high-dimensional electroencephalography (EEG) data, we compute the joint variability of EEG electrodes into a channel cross-covariance matrix. We then extract the spatio-temporal information encoded within the matrix using a mixed deep neural network strategy. Our model framework is composed of a convolutional neural network (CNN), a long-short term network (LSTM), and a deep autoencoder. We train the individual networks hierarchically, feeding their combined outputs in a final gradient boosting classification step. Our best models achieve an average accuracy of 77.9% across five different binary classification tasks, providing a significant 22.5% improvement over previous methods. As we also show visually, our work demonstrates that the speech imagery EEG possesses significant discriminative information about the intended articulatory movements responsible for natural speech synthesis. △ Less

Submitted 8 April, 2019; originally announced April 2019.

Comments: Accepted for publication in IEEE ICASSP 2019

arXiv:1904.04352 [pdf, other]

Hierarchical Deep Feature Learning For Decoding Imagined Speech From EEG

Authors: Pramit Saha, Sidney Fels

Abstract: We propose a mixed deep neural network strategy, incorporating parallel combination of Convolutional (CNN) and Recurrent Neural Networks (RNN), cascaded with deep autoencoders and fully connected layers towards automatic identification of imagined speech from EEG. Instead of utilizing raw EEG channel data, we compute the joint variability of the channels in the form of a covariance matrix that pro… ▽ More We propose a mixed deep neural network strategy, incorporating parallel combination of Convolutional (CNN) and Recurrent Neural Networks (RNN), cascaded with deep autoencoders and fully connected layers towards automatic identification of imagined speech from EEG. Instead of utilizing raw EEG channel data, we compute the joint variability of the channels in the form of a covariance matrix that provide spatio-temporal representations of EEG. The networks are trained hierarchically and the extracted features are passed onto the next network hierarchy until the final classification. Using a publicly available EEG based speech imagery database we demonstrate around 23.45% improvement of accuracy over the baseline method. Our approach demonstrates the promise of a mixed DNN approach for complex spatial-temporal classification problems. △ Less

Submitted 8 April, 2019; originally announced April 2019.

Comments: Accepted in AAAI 2019 under Student Abstract and Poster Program

arXiv:1811.08029 [pdf, other]

Sound-Stream II: Towards Real-Time Gesture Controlled Articulatory Sound Synthesis

Authors: Pramit Saha, Debasish Ray Mohapatra, Praneeth SV, Sidney Fels

Abstract: We present an interface involving four degrees-of-freedom (DOF) mechanical control of a two dimensional, mid-sagittal tongue through a biomechanical toolkit called ArtiSynth and a sound synthesis engine called JASS towards articulatory sound synthesis. As a demonstration of the project, the user will learn to produce a range of JASS vocal sounds, by varying the shape and position of the ArtiSynth… ▽ More We present an interface involving four degrees-of-freedom (DOF) mechanical control of a two dimensional, mid-sagittal tongue through a biomechanical toolkit called ArtiSynth and a sound synthesis engine called JASS towards articulatory sound synthesis. As a demonstration of the project, the user will learn to produce a range of JASS vocal sounds, by varying the shape and position of the ArtiSynth tongue in 2D space through a set of four force-based sensors. In other words, the user will be able to physically play around with these four sensors, thereby virtually controlling the magnitude of four selected muscle excitations of the tongue to vary articulatory structure. This variation is computed in terms of Area Functions in ArtiSynth environment and communicated to the JASS based audio-synthesizer coupled with two-mass glottal excitation model to complete this end-to-end gesture-to-sound mapping. △ Less

Submitted 19 November, 2018; originally announced November 2018.

arXiv:1811.07435 [pdf, other]

doi 10.1121/1.5068357

Limitations of Source-Filter Coupling In Phonation

Authors: Debasish Ray Mohapatra, Sidney Fels

Abstract: The coupling of vocal fold (source) and vocal tract (filter) is one of the most critical factors in source-filter articulation theory. The traditional linear source-filter theory has been challenged by current research which clearly shows the impact of acoustic loading on the dynamic behavior of the vocal fold vibration as well as the variations in the glottal flow pulses shape. This paper outline… ▽ More The coupling of vocal fold (source) and vocal tract (filter) is one of the most critical factors in source-filter articulation theory. The traditional linear source-filter theory has been challenged by current research which clearly shows the impact of acoustic loading on the dynamic behavior of the vocal fold vibration as well as the variations in the glottal flow pulses shape. This paper outlines the underlying mechanism of source-filter interactions; demonstrates the design and working principles of coupling for the various existing vocal cord and vocal tract biomechanical models. For our study, we have considered self-oscillating lumped-element models of the acoustic source and computational models of the vocal tract as articulators. To understand the limitations of source-filter interactions which are associated with each of those models, we compare them concerning their mechanical design, acoustic and physiological characteristics and aerodynamic simulation. △ Less

Submitted 18 November, 2018; originally announced November 2018.

Comments: 2 pages, 2 figures

arXiv:1807.11089 [pdf, other]

Towards Automatic Speech Identification from Vocal Tract Shape Dynamics in Real-time MRI

Authors: Pramit Saha, Praneeth Srungarapu, Sidney Fels

Abstract: Vocal tract configurations play a vital role in generating distinguishable speech sounds, by modulating the airflow and creating different resonant cavities in speech production. They contain abundant information that can be utilized to better understand the underlying speech production mechanism. As a step towards automatic mapping of vocal tract shape geometry to acoustics, this paper employs ef… ▽ More Vocal tract configurations play a vital role in generating distinguishable speech sounds, by modulating the airflow and creating different resonant cavities in speech production. They contain abundant information that can be utilized to better understand the underlying speech production mechanism. As a step towards automatic mapping of vocal tract shape geometry to acoustics, this paper employs effective video action recognition techniques, like Long-term Recurrent Convolutional Networks (LRCN) models, to identify different vowel-consonant-vowel (VCV) sequences from dynamic shaping of the vocal tract. Such a model typically combines a CNN based deep hierarchical visual feature extractor with Recurrent Networks, that ideally makes the network spatio-temporally deep enough to learn the sequential dynamics of a short video clip for video classification tasks. We use a database consisting of 2D real-time MRI of vocal tract shaping during VCV utterances by 17 speakers. The comparative performances of this class of algorithms under various parameter settings and for various classification tasks are discussed. Interestingly, the results show a marked difference in the model performance in the context of speech classification with respect to generic sequence or video classification tasks. △ Less

Submitted 29 July, 2018; originally announced July 2018.

Comments: To appear in the INTERSPEECH 2018 Proceedings

Showing 1–16 of 16 results for author: Fels, S