-
LookWhere? Efficient Visual Recognition by Learning Where to Look and What to See from Self-Supervision
Authors:
Anthony Fuller,
Yousef Yassin,
Junfeng Wen,
Daniel G. Kyrollos,
Tarek Ibrahim,
James R. Green,
Evan Shelhamer
Abstract:
Vision transformers are ever larger, more accurate, and more expensive to compute. The expense is even more extreme at high resolution as the number of tokens grows quadratically with the image size. We turn to adaptive computation to cope with this cost by learning to predict where to compute. Our LookWhere method divides the computation between a low-resolution selector and a high-resolution ext…
▽ More
Vision transformers are ever larger, more accurate, and more expensive to compute. The expense is even more extreme at high resolution as the number of tokens grows quadratically with the image size. We turn to adaptive computation to cope with this cost by learning to predict where to compute. Our LookWhere method divides the computation between a low-resolution selector and a high-resolution extractor without ever processing the full high-resolution input. We jointly pretrain the selector and extractor without task supervision by distillation from a self-supervised teacher, in effect, learning where and what to compute simultaneously. Unlike prior token reduction methods, which pay to save by pruning already-computed tokens, and prior token selection methods, which require complex and expensive per-task optimization, LookWhere economically and accurately selects and extracts transferrable representations of images. We show that LookWhere excels at sparse recognition on high-resolution inputs (Traffic Signs), maintaining accuracy while reducing FLOPs by up to 34x and time by 6x. It also excels at standard recognition tasks that are global (ImageNet classification) or local (ADE20K segmentation), improving accuracy while reducing time by 1.36x.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Simpler Fast Vision Transformers with a Jumbo CLS Token
Authors:
Anthony Fuller,
Yousef Yassin,
Daniel G. Kyrollos,
Evan Shelhamer,
James R. Green
Abstract:
We introduce a simple enhancement of vision transformers (ViTs) to improve accuracy while maintaining throughput. Our approach, Jumbo, creates a wider CLS token, which is split to match the patch token width before attention, processed with self-attention, and reassembled. After attention, Jumbo applies a dedicated, wider FFN to this token. Since there is only one Jumbo token, its cost is minimal,…
▽ More
We introduce a simple enhancement of vision transformers (ViTs) to improve accuracy while maintaining throughput. Our approach, Jumbo, creates a wider CLS token, which is split to match the patch token width before attention, processed with self-attention, and reassembled. After attention, Jumbo applies a dedicated, wider FFN to this token. Since there is only one Jumbo token, its cost is minimal, and because we share this FFN across layers, its parameter count is controlled. Jumbo significantly improves over ViT+Registers on ImageNet-1K and ImageNet-21K. These gains are largest at small sizes / high speeds, e.g., ViT-nano+Jumbo outperforms ViT-nano+Registers by 13%. In fact, our Jumbo models are so efficient that they outperform specialized compute-efficient models while preserving the architectural advantages of plain ViTs, such as support for token dropping and other modalities. Accordingly, we demonstrate that Jumbo excels in these two settings via masked autoencoding and on a suite of time series benchmarks. Code and weights available: https://github.com/antofuller/jumbo
△ Less
Submitted 23 May, 2025; v1 submitted 20 February, 2025;
originally announced February 2025.
-
Galileo: Learning Global & Local Features of Many Remote Sensing Modalities
Authors:
Gabriel Tseng,
Anthony Fuller,
Marlena Reil,
Henry Herzog,
Patrick Beukema,
Favyen Bastani,
James R. Green,
Evan Shelhamer,
Hannah Kerner,
David Rolnick
Abstract:
We introduce a highly multimodal transformer to represent many remote sensing modalities - multispectral optical, synthetic aperture radar, elevation, weather, pseudo-labels, and more - across space and time. These inputs are useful for diverse remote sensing tasks, such as crop mapping and flood detection. However, learning shared representations of remote sensing data is challenging, given the d…
▽ More
We introduce a highly multimodal transformer to represent many remote sensing modalities - multispectral optical, synthetic aperture radar, elevation, weather, pseudo-labels, and more - across space and time. These inputs are useful for diverse remote sensing tasks, such as crop mapping and flood detection. However, learning shared representations of remote sensing data is challenging, given the diversity of relevant data modalities, and because objects of interest vary massively in scale, from small boats (1-2 pixels and fast) to glaciers (thousands of pixels and slow). We present a novel self-supervised learning algorithm that extracts multi-scale features across a flexible set of input modalities through masked modeling. Our dual global and local contrastive losses differ in their targets (deep representations vs. shallow input projections) and masking strategies (structured vs. not). Our Galileo is a single generalist model that outperforms SoTA specialist models for satellite images and pixel time series across eleven benchmarks and multiple tasks.
△ Less
Submitted 4 June, 2025; v1 submitted 13 February, 2025;
originally announced February 2025.
-
Learnings from curating a trustworthy, well-annotated, and useful dataset of disordered English speech
Authors:
Pan-Pan Jiang,
Jimmy Tobin,
Katrin Tomanek,
Robert L. MacDonald,
Katie Seaver,
Richard Cave,
Marilyn Ladewig,
Rus Heywood,
Jordan R. Green
Abstract:
Project Euphonia, a Google initiative, is dedicated to improving automatic speech recognition (ASR) of disordered speech. A central objective of the project is to create a large, high-quality, and diverse speech corpus. This report describes the project's latest advancements in data collection and annotation methodologies, such as expanding speaker diversity in the database, adding human-reviewed…
▽ More
Project Euphonia, a Google initiative, is dedicated to improving automatic speech recognition (ASR) of disordered speech. A central objective of the project is to create a large, high-quality, and diverse speech corpus. This report describes the project's latest advancements in data collection and annotation methodologies, such as expanding speaker diversity in the database, adding human-reviewed transcript corrections and audio quality tags to 350K (of the 1.2M total) audio recordings, and amassing a comprehensive set of metadata (including more than 40 speech characteristic labels) for over 75\% of the speakers in the database. We report on the impact of transcript corrections on our machine-learning (ML) research, inter-rater variability of assessments of disordered speech patterns, and our rationale for gathering speech metadata. We also consider the limitations of using automated off-the-shelf annotation methods for assessing disordered speech.
△ Less
Submitted 13 September, 2024;
originally announced September 2024.
-
LookHere: Vision Transformers with Directed Attention Generalize and Extrapolate
Authors:
Anthony Fuller,
Daniel G. Kyrollos,
Yousef Yassin,
James R. Green
Abstract:
High-resolution images offer more information about scenes that can improve model accuracy. However, the dominant model architecture in computer vision, the vision transformer (ViT), cannot effectively leverage larger images without finetuning -- ViTs poorly extrapolate to more patches at test time, although transformers offer sequence length flexibility. We attribute this shortcoming to the curre…
▽ More
High-resolution images offer more information about scenes that can improve model accuracy. However, the dominant model architecture in computer vision, the vision transformer (ViT), cannot effectively leverage larger images without finetuning -- ViTs poorly extrapolate to more patches at test time, although transformers offer sequence length flexibility. We attribute this shortcoming to the current patch position encoding methods, which create a distribution shift when extrapolating.
We propose a drop-in replacement for the position encoding of plain ViTs that restricts attention heads to fixed fields of view, pointed in different directions, using 2D attention masks. Our novel method, called LookHere, provides translation-equivariance, ensures attention head diversity, and limits the distribution shift that attention heads face when extrapolating. We demonstrate that LookHere improves performance on classification (avg. 1.6%), against adversarial attack (avg. 5.4%), and decreases calibration error (avg. 1.5%) -- on ImageNet without extrapolation. With extrapolation, LookHere outperforms the current SoTA position encoding method, 2D-RoPE, by 21.7% on ImageNet when trained at $224^2$ px and tested at $1024^2$ px. Additionally, we release a high-resolution test set to improve the evaluation of high-resolution image classifiers, called ImageNet-HR.
△ Less
Submitted 29 October, 2024; v1 submitted 22 May, 2024;
originally announced May 2024.
-
CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders
Authors:
Anthony Fuller,
Koreen Millard,
James R. Green
Abstract:
A vital and rapidly growing application, remote sensing offers vast yet sparsely labeled, spatially aligned multimodal data; this makes self-supervised learning algorithms invaluable. We present CROMA: a framework that combines contrastive and reconstruction self-supervised objectives to learn rich unimodal and multimodal representations. Our method separately encodes masked-out multispectral opti…
▽ More
A vital and rapidly growing application, remote sensing offers vast yet sparsely labeled, spatially aligned multimodal data; this makes self-supervised learning algorithms invaluable. We present CROMA: a framework that combines contrastive and reconstruction self-supervised objectives to learn rich unimodal and multimodal representations. Our method separately encodes masked-out multispectral optical and synthetic aperture radar samples -- aligned in space and time -- and performs cross-modal contrastive learning. Another encoder fuses these sensors, producing joint multimodal encodings that are used to predict the masked patches via a lightweight decoder. We show that these objectives are complementary when leveraged on spatially aligned multimodal data. We also introduce X- and 2D-ALiBi, which spatially biases our cross- and self-attention matrices. These strategies improve representations and allow our models to effectively extrapolate to images up to 17.6x larger at test-time. CROMA outperforms the current SoTA multispectral model, evaluated on: four classification benchmarks -- finetuning (avg. 1.8%), linear (avg. 2.4%) and nonlinear (avg. 1.4%) probing, kNN classification (avg. 3.5%), and K-means clustering (avg. 8.4%); and three segmentation benchmarks (avg. 6.4%). CROMA's rich, optionally multimodal representations can be widely leveraged across remote sensing applications.
△ Less
Submitted 1 November, 2023;
originally announced November 2023.
-
An analysis of degenerating speech due to progressive dysarthria on ASR performance
Authors:
Katrin Tomanek,
Katie Seaver,
Pan-Pan Jiang,
Richard Cave,
Lauren Harrel,
Jordan R. Green
Abstract:
Although personalized automatic speech recognition (ASR) models have recently been designed to recognize even severely impaired speech, model performance may degrade over time for persons with degenerating speech. The aims of this study were to (1) analyze the change of performance of ASR over time in individuals with degrading speech, and (2) explore mitigation strategies to optimize recognition…
▽ More
Although personalized automatic speech recognition (ASR) models have recently been designed to recognize even severely impaired speech, model performance may degrade over time for persons with degenerating speech. The aims of this study were to (1) analyze the change of performance of ASR over time in individuals with degrading speech, and (2) explore mitigation strategies to optimize recognition throughout disease progression. Speech was recorded by four individuals with degrading speech due to amyotrophic lateral sclerosis (ALS). Word error rates (WER) across recording sessions were computed for three ASR models: Unadapted Speaker Independent (U-SI), Adapted Speaker Independent (A-SI), and Adapted Speaker Dependent (A-SD or personalized). The performance of all three models degraded significantly over time as speech became more impaired, but the performance of the A-SD model improved markedly when it was updated with recordings from the severe stages of speech progression. Recording additional utterances early in the disease before speech degraded significantly did not improve the performance of A-SD models. Overall, our findings emphasize the importance of continuous recording (and model retraining) when providing personalized models for individuals with progressive speech impairments.
△ Less
Submitted 31 October, 2022;
originally announced November 2022.
-
Generalized Reciprocal Perspective
Authors:
Kevin Dick,
Daniel G. Kyrollos,
James R. Green
Abstract:
Across many domains, real-world problems can be represented as a network. Nodes represent domain-specific elements and edges capture the relationship between elements. Leveraging high-performance computing and optimized link prediction algorithms, it is increasingly possible to evaluate every possible combination of nodal pairs enabling the generation of a comprehensive prediction matrix (CPM) tha…
▽ More
Across many domains, real-world problems can be represented as a network. Nodes represent domain-specific elements and edges capture the relationship between elements. Leveraging high-performance computing and optimized link prediction algorithms, it is increasingly possible to evaluate every possible combination of nodal pairs enabling the generation of a comprehensive prediction matrix (CPM) that places an individual link prediction score in the context of all possible links involving either node (providing data-driven context). Historically, this contextual information has been ignored given exponentially growing problem sizes resulting in computational intractability; however, we demonstrate that expending high-performance compute resources to generate CPMs is a worthwhile investment given the improvement in predictive performance. In this work, we generalize for all pairwise link-prediction tasks our novel semi-supervised machine learning method, denoted Reciprocal Perspective (RP). We demonstrate that RP significantly improves link prediction accuracy by leveraging the wealth of information in a CPM. Context-based features are extracted from the CPM for use in a stacked classifier and we demonstrate that the application of RP in a cascade almost always results in significantly (p < 0.05) improved predictions. These results on RS-type problems suggest that RP is applicable to a broad range of link prediction problems.
△ Less
Submitted 20 October, 2022;
originally announced October 2022.
-
Under the Cover Infant Pose Estimation using Multimodal Data
Authors:
Daniel G. Kyrollos,
Anthony Fuller,
Kim Greenwood,
JoAnn Harrold,
James R. Green
Abstract:
Infant pose monitoring during sleep has multiple applications in both healthcare and home settings. In a healthcare setting, pose detection can be used for region of interest detection and movement detection for noncontact based monitoring systems. In a home setting, pose detection can be used to detect sleep positions which has shown to have a strong influence on multiple health factors. However,…
▽ More
Infant pose monitoring during sleep has multiple applications in both healthcare and home settings. In a healthcare setting, pose detection can be used for region of interest detection and movement detection for noncontact based monitoring systems. In a home setting, pose detection can be used to detect sleep positions which has shown to have a strong influence on multiple health factors. However, pose monitoring during sleep is challenging due to heavy occlusions from blanket coverings and low lighting. To address this, we present a novel dataset, Simultaneously-collected multimodal Mannequin Lying pose (SMaL) dataset, for under the cover infant pose estimation. We collect depth and pressure imagery of an infant mannequin in different poses under various cover conditions. We successfully infer full body pose under the cover by training state-of-art pose estimation methods and leveraging existing multimodal adult pose datasets for transfer learning. We demonstrate a hierarchical pretraining strategy for transformer-based models to significantly improve performance on our dataset. Our best performing model was able to detect joints under the cover within 25mm 86% of the time with an overall mean error of 16.9mm. Data, code and models publicly available at https://github.com/DanielKyr/SMaL
△ Less
Submitted 15 February, 2023; v1 submitted 2 October, 2022;
originally announced October 2022.
-
Transfer Learning with Pretrained Remote Sensing Transformers
Authors:
Anthony Fuller,
Koreen Millard,
James R. Green
Abstract:
Although the remote sensing (RS) community has begun to pretrain transformers (intended to be fine-tuned on RS tasks), it is unclear how these models perform under distribution shifts. Here, we pretrain a new RS transformer--called SatViT-V2--on 1.3 million satellite-derived RS images, then fine-tune it (along with five other models) to investigate how it performs on distributions not seen during…
▽ More
Although the remote sensing (RS) community has begun to pretrain transformers (intended to be fine-tuned on RS tasks), it is unclear how these models perform under distribution shifts. Here, we pretrain a new RS transformer--called SatViT-V2--on 1.3 million satellite-derived RS images, then fine-tune it (along with five other models) to investigate how it performs on distributions not seen during training. We split an expertly labeled land cover dataset into 14 datasets based on source biome. We train each model on each biome separately and test them on all other biomes. In all, this amounts to 1638 biome transfer experiments. After fine-tuning, we find that SatViT-V2 outperforms SatViT-V1 by 3.1% on in-distribution (matching biomes) and 2.8% on out-of-distribution (mismatching biomes) data. Additionally, we find that initializing fine-tuning from the linear probed solution (i.e., leveraging LPFT [1]) improves SatViT-V2's performance by another 1.2% on in-distribution and 2.4% on out-of-distribution data. Next, we find that pretrained RS transformers are better calibrated under distribution shifts than non-pretrained models and leveraging LPFT results in further improvements in model calibration. Lastly, we find that five measures of distribution shift are moderately correlated with biome transfer performance. We share code and pretrained model weights. (https://github.com/antofuller/SatViT)
△ Less
Submitted 28 September, 2022;
originally announced September 2022.
-
Longitudinal Acoustic Speech Tracking Following Pediatric Traumatic Brain Injury
Authors:
Camille Noufi,
Adam C. Lammert,
Daryush D. Mehta,
James R. Williamson,
Gregory Ciccarelli,
Douglas Sturim,
Jordan R. Green,
Thomas F. Quatieri,
Thomas F. Campbell
Abstract:
Recommendations for common outcome measures following pediatric traumatic brain injury (TBI) support the integration of instrumental measurements alongside perceptual assessment in recovery and treatment plans. A comprehensive set of sensitive, robust and non-invasive measurements is therefore essential in assessing variations in speech characteristics over time following pediatric TBI. In this ar…
▽ More
Recommendations for common outcome measures following pediatric traumatic brain injury (TBI) support the integration of instrumental measurements alongside perceptual assessment in recovery and treatment plans. A comprehensive set of sensitive, robust and non-invasive measurements is therefore essential in assessing variations in speech characteristics over time following pediatric TBI. In this article, we study the changes in the acoustic speech patterns of a pediatric cohort of ten subjects diagnosed with severe TBI. We extract a diverse set of both well-known and novel acoustic features from child speech recorded throughout the year after the child produced intelligible words. These features are analyzed individually and by speech subsystem, within-subject and across the cohort. As a group, older children exhibit highly significant (p<0.01) increases in pitch variation and phoneme diversity, shortened pause length, and steadying articulation rate variability. Younger children exhibit similar steadied rate variability alongside an increase in formant-based articulation complexity. Correlation analysis of the feature set with age and comparisons to normative developmental data confirm that age at injury plays a significant role in framing the recovery trajectory. Nearly all speech features significantly change (p<0.05) for the cohort as a whole, confirming that acoustic measures supplementing perceptual assessment are needed to identify efficacious treatment targets for speech therapy following TBI.
△ Less
Submitted 9 September, 2022;
originally announced September 2022.
-
Audiogram Digitization Tool for Audiological Reports
Authors:
François Charih,
James R. Green
Abstract:
A number of private and public insurers compensate workers whose hearing loss can be directly attributed to excessive exposure to noise in the workplace. The claim assessment process is typically lengthy and requires significant effort from human adjudicators who must interpret hand-recorded audiograms, often sent via fax or equivalent. In this work, we present a solution developed in partnership…
▽ More
A number of private and public insurers compensate workers whose hearing loss can be directly attributed to excessive exposure to noise in the workplace. The claim assessment process is typically lengthy and requires significant effort from human adjudicators who must interpret hand-recorded audiograms, often sent via fax or equivalent. In this work, we present a solution developed in partnership with the Workplace Safety Insurance Board of Ontario to streamline the adjudication process. In particular, we present the first audiogram digitization algorithm capable of automatically extracting the hearing thresholds from a scanned or faxed audiology report as a proof-of-concept. The algorithm extracts most thresholds within 5 dB accuracy, allowing to substantially lessen the time required to convert an audiogram into digital format in a semi-supervised fashion, and is a first step towards the automation of the adjudication process. The source code for the digitization algorithm and a desktop-based implementation of our NIHL annotation portal is publicly available on GitHub (https://github.com/GreenCUBIC/AudiogramDigitization).
△ Less
Submitted 13 September, 2022; v1 submitted 30 August, 2022;
originally announced August 2022.
-
Comparing Supervised Models And Learned Speech Representations For Classifying Intelligibility Of Disordered Speech On Selected Phrases
Authors:
Subhashini Venugopalan,
Joel Shor,
Manoj Plakal,
Jimmy Tobin,
Katrin Tomanek,
Jordan R. Green,
Michael P. Brenner
Abstract:
Automatic classification of disordered speech can provide an objective tool for identifying the presence and severity of speech impairment. Classification approaches can also help identify hard-to-recognize speech samples to teach ASR systems about the variable manifestations of impaired speech. Here, we develop and compare different deep learning techniques to classify the intelligibility of diso…
▽ More
Automatic classification of disordered speech can provide an objective tool for identifying the presence and severity of speech impairment. Classification approaches can also help identify hard-to-recognize speech samples to teach ASR systems about the variable manifestations of impaired speech. Here, we develop and compare different deep learning techniques to classify the intelligibility of disordered speech on selected phrases. We collected samples from a diverse set of 661 speakers with a variety of self-reported disorders speaking 29 words or phrases, which were rated by speech-language pathologists for their overall intelligibility using a five-point Likert scale. We then evaluated classifiers developed using 3 approaches: (1) a convolutional neural network (CNN) trained for the task, (2) classifiers trained on non-semantic speech representations from CNNs that used an unsupervised objective [1], and (3) classifiers trained on the acoustic (encoder) embeddings from an ASR system trained on typical speech [2]. We found that the ASR encoder's embeddings considerably outperform the other two on detecting and classifying disordered speech. Further analysis shows that the ASR embeddings cluster speech by the spoken phrase, while the non-semantic embeddings cluster speech by speaker. Also, longer phrases are more indicative of intelligibility deficits than single words.
△ Less
Submitted 8 July, 2021;
originally announced July 2021.
-
Investigating the Utility of Multimodal Conversational Technology and Audiovisual Analytic Measures for the Assessment and Monitoring of Amyotrophic Lateral Sclerosis at Scale
Authors:
Michael Neumann,
Oliver Roesler,
Jackson Liscombe,
Hardik Kothare,
David Suendermann-Oeft,
David Pautler,
Indu Navar,
Aria Anvar,
Jochen Kumm,
Raquel Norel,
Ernest Fraenkel,
Alexander V. Sherman,
James D. Berry,
Gary L. Pattee,
Jun Wang,
Jordan R. Green,
Vikram Ramanarayanan
Abstract:
We propose a cloud-based multimodal dialog platform for the remote assessment and monitoring of Amyotrophic Lateral Sclerosis (ALS) at scale. This paper presents our vision, technology setup, and an initial investigation of the efficacy of the various acoustic and visual speech metrics automatically extracted by the platform. 82 healthy controls and 54 people with ALS (pALS) were instructed to int…
▽ More
We propose a cloud-based multimodal dialog platform for the remote assessment and monitoring of Amyotrophic Lateral Sclerosis (ALS) at scale. This paper presents our vision, technology setup, and an initial investigation of the efficacy of the various acoustic and visual speech metrics automatically extracted by the platform. 82 healthy controls and 54 people with ALS (pALS) were instructed to interact with the platform and completed a battery of speaking tasks designed to probe the acoustic, articulatory, phonatory, and respiratory aspects of their speech. We find that multiple acoustic (rate, duration, voicing) and visual (higher order statistics of the jaw and lip) speech metrics show statistically significant differences between controls, bulbar symptomatic and bulbar pre-symptomatic patients. We report on the sensitivity and specificity of these metrics using five-fold cross-validation. We further conducted a LASSO-LARS regression analysis to uncover the relative contributions of various acoustic and visual features in predicting the severity of patients' ALS (as measured by their self-reported ALSFRS-R scores). Our results provide encouraging evidence of the utility of automatically extracted audiovisual analytics for scalable remote patient assessment and monitoring in ALS.
△ Less
Submitted 15 April, 2021;
originally announced April 2021.
-
A Sparse Non-negative Matrix Factorization Framework for Identifying Functional Units of Tongue Behavior from MRI
Authors:
Jonghye Woo,
Jerry L. Prince,
Maureen Stone,
Fangxu Xing,
Arnold Gomez,
Jordan R. Green,
Christopher J. Hartnick,
Thomas J. Brady,
Timothy G. Reese,
Van J. Wedeen,
Georges El Fakhri
Abstract:
Muscle coordination patterns of lingual behaviors are synergies generated by deforming local muscle groups in a variety of ways. Functional units are functional muscle groups of local structural elements within the tongue that compress, expand, and move in a cohesive and consistent manner. Identifying the functional units using tagged-Magnetic Resonance Imaging (MRI) sheds light on the mechanisms…
▽ More
Muscle coordination patterns of lingual behaviors are synergies generated by deforming local muscle groups in a variety of ways. Functional units are functional muscle groups of local structural elements within the tongue that compress, expand, and move in a cohesive and consistent manner. Identifying the functional units using tagged-Magnetic Resonance Imaging (MRI) sheds light on the mechanisms of normal and pathological muscle coordination patterns, yielding improvement in surgical planning, treatment, or rehabilitation procedures. Here, to mine this information, we propose a matrix factorization and probabilistic graphical model framework to produce building blocks and their associated weighting map using motion quantities extracted from tagged-MRI. Our tagged-MRI imaging and accurate voxel-level tracking provide previously unavailable internal tongue motion patterns, thus revealing the inner workings of the tongue during speech or other lingual behaviors. We then employ spectral clustering on the weighting map to identify the cohesive regions defined by the tongue motion that may involve multiple or undocumented regions. To evaluate our method, we perform a series of experiments. We first use two-dimensional images and synthetic data to demonstrate the accuracy of our method. We then use three-dimensional synthetic and \textit{in vivo} tongue motion data using protrusion and simple speech tasks to identify subject-specific and data-driven functional units of the tongue in localized regions.
△ Less
Submitted 29 September, 2018; v1 submitted 15 April, 2018;
originally announced April 2018.