-
U-R-VEDA: Integrating UNET, Residual Links, Edge and Dual Attention, and Vision Transformer for Accurate Semantic Segmentation of CMRs
Authors:
Racheal Mukisa,
Arvind K. Bansal
Abstract:
Artificial intelligence, including deep learning models, will play a transformative role in automated medical image analysis for the diagnosis of cardiac disorders and their management. Automated accurate delineation of cardiac images is the first necessary initial step for the quantification and automated diagnosis of cardiac disorders. In this paper, we propose a deep learning based enhanced UNe…
▽ More
Artificial intelligence, including deep learning models, will play a transformative role in automated medical image analysis for the diagnosis of cardiac disorders and their management. Automated accurate delineation of cardiac images is the first necessary initial step for the quantification and automated diagnosis of cardiac disorders. In this paper, we propose a deep learning based enhanced UNet model, U-R-Veda, which integrates convolution transformations, vision transformer, residual links, channel-attention, and spatial attention, together with edge-detection based skip-connections for an accurate fully-automated semantic segmentation of cardiac magnetic resonance (CMR) images. The model extracts local-features and their interrelationships using a stack of combination convolution blocks, with embedded channel and spatial attention in the convolution block, and vision transformers. Deep embedding of channel and spatial attention in the convolution block identifies important features and their spatial localization. The combined edge information with channel and spatial attention as skip connection reduces information-loss during convolution transformations. The overall model significantly improves the semantic segmentation of CMR images necessary for improved medical image analysis. An algorithm for the dual attention module (channel and spatial attention) has been presented. Performance results show that U-R-Veda achieves an average accuracy of 95.2%, based on DSC metrics. The model outperforms the accuracy attained by other models, based on DSC and HD metrics, especially for the delineation of right-ventricle and left-ventricle-myocardium.
△ Less
Submitted 25 June, 2025;
originally announced June 2025.
-
Time-Modulated EM Skins for Integrated Sensing and Communications
Authors:
Lorenzo Poli,
Aakash Bansal,
Giacomo Oliveri,
Aaron Angel Salas-Sanchez,
Will Whittow,
Andrea Massa
Abstract:
An innovative solution, based on the exploitation of the harmonic beams generated by time-modulated electromagnetic skins (TM-EMSs), is proposed for the implementation of integrated sensing and communication (ISAC) functionalities in a Smart Electromagnetic Environment (SEME) scenario. More in detail, the field radiated by a user terminal, located at an unknown position, is assumed to illuminate a…
▽ More
An innovative solution, based on the exploitation of the harmonic beams generated by time-modulated electromagnetic skins (TM-EMSs), is proposed for the implementation of integrated sensing and communication (ISAC) functionalities in a Smart Electromagnetic Environment (SEME) scenario. More in detail, the field radiated by a user terminal, located at an unknown position, is assumed to illuminate a passive TM-EMS that, thanks to a suitable modulation of the local reflection coefficients at the meta-atom level of the EMS surface, simultaneously reflects towards a receiving base station (BS) a "sum" beam and a "difference" one at slightly different frequencies. By processing the received signals and exploiting monopulse radar tracking concepts, the BS both localizes the user terminal and, as a by-product, establishes a communication link with it by leveraging on the "sum" reflected beam. Towards this purpose, the arising harmonic beam control problem is reformulated as a global optimization one, which is successively solved by means of an evolutionary iterative approach to determine the desired TM-EMS modulation sequence. The results from selected numerical and experimental tests are reported to assess the effectiveness and the reliability of the proposed approach.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
DADU: Dual Attention-based Deep Supervised UNet for Automated Semantic Segmentation of Cardiac Images
Authors:
Racheal Mukisa,
Arvind K. Bansal
Abstract:
We propose an enhanced deep learning-based model for image segmentation of the left and right ventricles and myocardium scar tissue from cardiac magnetic resonance (CMR) images. The proposed technique integrates UNet, channel and spatial attention, edge-detection based skip-connection and deep supervised learning to improve the accuracy of the CMR image-segmentation. Images are processed using mul…
▽ More
We propose an enhanced deep learning-based model for image segmentation of the left and right ventricles and myocardium scar tissue from cardiac magnetic resonance (CMR) images. The proposed technique integrates UNet, channel and spatial attention, edge-detection based skip-connection and deep supervised learning to improve the accuracy of the CMR image-segmentation. Images are processed using multiple channels to generate multiple feature-maps. We built a dual attention-based model to integrate channel and spatial attention. The use of extracted edges in skip connection improves the reconstructed images from feature-maps. The use of deep supervision reduces vanishing gradient problems inherent in classification based on deep neural networks. The algorithms for dual attention-based model, corresponding implementation and performance results are described. The performance results show that this approach has attained high accuracy: 98% Dice Similarity Score (DSC) and significantly lower Hausdorff Distance (HD). The performance results outperform other leading techniques both in DSC and HD.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
Cardiac MRI Semantic Segmentation for Ventricles and Myocardium using Deep Learning
Authors:
Racheal Mukisa,
Arvind K. Bansal
Abstract:
Automated noninvasive cardiac diagnosis plays a critical role in the early detection of cardiac disorders and cost-effective clinical management. Automated diagnosis involves the automated segmentation and analysis of cardiac images. Precise delineation of cardiac substructures and extraction of their morphological attributes are essential for evaluating the cardiac function, and diagnosing cardio…
▽ More
Automated noninvasive cardiac diagnosis plays a critical role in the early detection of cardiac disorders and cost-effective clinical management. Automated diagnosis involves the automated segmentation and analysis of cardiac images. Precise delineation of cardiac substructures and extraction of their morphological attributes are essential for evaluating the cardiac function, and diagnosing cardiovascular disease such as cardiomyopathy, valvular diseases, abnormalities related to septum perforations, and blood-flow rate. Semantic segmentation labels the CMR image at the pixel level, and localizes its subcomponents to facilitate the detection of abnormalities, including abnormalities in cardiac wall motion in an aging heart with muscle abnormalities, vascular abnormalities, and valvular abnormalities. In this paper, we describe a model to improve semantic segmentation of CMR images. The model extracts edge-attributes and context information during down-sampling of the U-Net and infuses this information during up-sampling to localize three major cardiac structures: left ventricle cavity (LV); right ventricle cavity (RV); and LV myocardium (LMyo). We present an algorithm and performance results. A comparison of our model with previous leading models, using similarity metrics between actual image and segmented image, shows that our approach improves Dice similarity coefficient (DSC) by 2%-11% and lowers Hausdorff distance (HD) by 1.6 to 5.7 mm.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
Analysis of the MICCAI Brain Tumor Segmentation -- Metastases (BraTS-METS) 2025 Lighthouse Challenge: Brain Metastasis Segmentation on Pre- and Post-treatment MRI
Authors:
Nazanin Maleki,
Raisa Amiruddin,
Ahmed W. Moawad,
Nikolay Yordanov,
Athanasios Gkampenis,
Pascal Fehringer,
Fabian Umeh,
Crystal Chukwurah,
Fatima Memon,
Bojan Petrovic,
Justin Cramer,
Mark Krycia,
Elizabeth B. Shrickel,
Ichiro Ikuta,
Gerard Thompson,
Lorenna Vidal,
Vilma Kosovic,
Adam E. Goldman-Yassen,
Virginia Hill,
Tiffany So,
Sedra Mhana,
Albara Alotaibi,
Nathan Page,
Prisha Bhatia,
Melisa S. Guelen
, et al. (219 additional authors not shown)
Abstract:
Despite continuous advancements in cancer treatment, brain metastatic disease remains a significant complication of primary cancer and is associated with an unfavorable prognosis. One approach for improving diagnosis, management, and outcomes is to implement algorithms based on artificial intelligence for the automated segmentation of both pre- and post-treatment MRI brain images. Such algorithms…
▽ More
Despite continuous advancements in cancer treatment, brain metastatic disease remains a significant complication of primary cancer and is associated with an unfavorable prognosis. One approach for improving diagnosis, management, and outcomes is to implement algorithms based on artificial intelligence for the automated segmentation of both pre- and post-treatment MRI brain images. Such algorithms rely on volumetric criteria for lesion identification and treatment response assessment, which are still not available in clinical practice. Therefore, it is critical to establish tools for rapid volumetric segmentations methods that can be translated to clinical practice and that are trained on high quality annotated data. The BraTS-METS 2025 Lighthouse Challenge aims to address this critical need by establishing inter-rater and intra-rater variability in dataset annotation by generating high quality annotated datasets from four individual instances of segmentation by neuroradiologists while being recorded on video (two instances doing "from scratch" and two instances after AI pre-segmentation). This high-quality annotated dataset will be used for testing phase in 2025 Lighthouse challenge and will be publicly released at the completion of the challenge. The 2025 Lighthouse challenge will also release the 2023 and 2024 segmented datasets that were annotated using an established pipeline of pre-segmentation, student annotation, two neuroradiologists checking, and one neuroradiologist finalizing the process. It builds upon its previous edition by including post-treatment cases in the dataset. Using these high-quality annotated datasets, the 2025 Lighthouse challenge plans to test benchmark algorithms for automated segmentation of pre-and post-treatment brain metastases (BM), trained on diverse and multi-institutional datasets of MRI images obtained from patients with brain metastases.
△ Less
Submitted 10 July, 2025; v1 submitted 16 April, 2025;
originally announced April 2025.
-
An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems
Authors:
Hitesh Tulsiani,
David M. Chan,
Shalini Ghosh,
Garima Lalwani,
Prabhat Pandey,
Ankish Bansal,
Sri Garimella,
Ariya Rastrow,
Björn Hoffmeister
Abstract:
Dialog systems, such as voice assistants, are expected to engage with users in complex, evolving conversations. Unfortunately, traditional automatic speech recognition (ASR) systems deployed in such applications are usually trained to recognize each turn independently and lack the ability to adapt to the conversational context or incorporate user feedback. In this work, we introduce a general fram…
▽ More
Dialog systems, such as voice assistants, are expected to engage with users in complex, evolving conversations. Unfortunately, traditional automatic speech recognition (ASR) systems deployed in such applications are usually trained to recognize each turn independently and lack the ability to adapt to the conversational context or incorporate user feedback. In this work, we introduce a general framework for ASR in dialog systems that can go beyond learning from single-turn utterances and learn over time how to adapt to both explicit supervision and implicit user feedback present in multi-turn conversations. We accomplish that by leveraging advances in student-teacher learning and context-aware dialog processing, and designing contrastive self-supervision approaches with Ohm, a new online hard-negative mining approach. We show that leveraging our new framework compared to traditional training leads to relative WER reductions of close to 10% in real-world dialog systems, and up to 26% on public synthetic data.
△ Less
Submitted 16 September, 2024;
originally announced September 2024.
-
Joint Transmit and Jamming Power Optimization for Secrecy in Energy Harvesting Networks: A Reinforcement Learning Approach
Authors:
Shalini Tripathi,
Chinmoy Kundu,
Animesh Yadav,
Ankur Bansal,
Holger Claussen,
Lester Ho
Abstract:
In this paper, we address the problem of joint allocation of transmit and jamming power at the source and destination, respectively, to enhance the long-term cumulative secrecy performance of an energy-harvesting wireless communication system until it stops functioning in the presence of an eavesdropper. The source and destination have energy-harvesting devices with limited battery capacities. The…
▽ More
In this paper, we address the problem of joint allocation of transmit and jamming power at the source and destination, respectively, to enhance the long-term cumulative secrecy performance of an energy-harvesting wireless communication system until it stops functioning in the presence of an eavesdropper. The source and destination have energy-harvesting devices with limited battery capacities. The destination also has a full-duplex transceiver to transmit jamming signals for secrecy. We frame the problem as an infinite-horizon Markov decision process (MDP) problem and propose a reinforcement learning-based optimal joint power allocation (OJPA) algorithm that employs a policy iteration (PI) algorithm. Since the optimal algorithm is computationally expensive, we develop a low-complexity sub-optimal joint power allocation (SJPA) algorithm, namely, reduced state joint power allocation (RSJPA). Two other SJPA algorithms, the greedy algorithm (GA) and the naive algorithm (NA), are implemented as benchmarks. In addition, the OJPA algorithm outperforms the individual power allocation (IPA) algorithms termed individual transmit power allocation (ITPA) and individual jamming power allocation (IJPA), where the transmit and jamming powers, respectively, are optimized individually. The results show that the OJPA algorithm is also more energy efficient. Simulation results show that the OJPA algorithm significantly improves the secrecy performance compared to all SJPA algorithms. The proposed RSJPA algorithm achieves nearly optimal performance with significantly less computational complexity marking it the balanced choice between the complexity and the performance. We find that the computational time for the RSJPA algorithm is around 75 percent less than the OJPA algorithm.
△ Less
Submitted 24 July, 2024;
originally announced July 2024.
-
Synergistic Perception and Control Simplex for Verifiable Safe Vertical Landing
Authors:
Ayoosh Bansal,
Yang Zhao,
James Zhu,
Sheng Cheng,
Yuliang Gu,
Hyung-Jin Yoon,
Hunmin Kim,
Naira Hovakimyan,
Lui Sha
Abstract:
Perception, Planning, and Control form the essential components of autonomy in advanced air mobility. This work advances the holistic integration of these components to enhance the performance and robustness of the complete cyber-physical system. We adapt Perception Simplex, a system for verifiable collision avoidance amidst obstacle detection faults, to the vertical landing maneuver for autonomou…
▽ More
Perception, Planning, and Control form the essential components of autonomy in advanced air mobility. This work advances the holistic integration of these components to enhance the performance and robustness of the complete cyber-physical system. We adapt Perception Simplex, a system for verifiable collision avoidance amidst obstacle detection faults, to the vertical landing maneuver for autonomous air mobility vehicles. We improve upon this system by replacing static assumptions of control capabilities with dynamic confirmation, i.e., real-time confirmation of control limitations of the system, ensuring reliable fulfillment of safety maneuvers and overrides, without dependence on overly pessimistic assumptions. Parameters defining control system capabilities and limitations, e.g., maximum deceleration, are continuously tracked within the system and used to make safety-critical decisions. We apply these techniques to propose a verifiable collision avoidance solution for autonomous aerial mobility vehicles operating in cluttered and potentially unsafe environments.
△ Less
Submitted 5 December, 2023;
originally announced December 2023.
-
The Brain Tumor Segmentation (BraTS-METS) Challenge 2023: Brain Metastasis Segmentation on Pre-treatment MRI
Authors:
Ahmed W. Moawad,
Anastasia Janas,
Ujjwal Baid,
Divya Ramakrishnan,
Rachit Saluja,
Nader Ashraf,
Nazanin Maleki,
Leon Jekel,
Nikolay Yordanov,
Pascal Fehringer,
Athanasios Gkampenis,
Raisa Amiruddin,
Amirreza Manteghinejad,
Maruf Adewole,
Jake Albrecht,
Udunna Anazodo,
Sanjay Aneja,
Syed Muhammad Anwar,
Timothy Bergquist,
Veronica Chiang,
Verena Chung,
Gian Marco Conte,
Farouk Dako,
James Eddy,
Ivan Ezhov
, et al. (207 additional authors not shown)
Abstract:
The translation of AI-generated brain metastases (BM) segmentation into clinical practice relies heavily on diverse, high-quality annotated medical imaging datasets. The BraTS-METS 2023 challenge has gained momentum for testing and benchmarking algorithms using rigorously annotated internationally compiled real-world datasets. This study presents the results of the segmentation challenge and chara…
▽ More
The translation of AI-generated brain metastases (BM) segmentation into clinical practice relies heavily on diverse, high-quality annotated medical imaging datasets. The BraTS-METS 2023 challenge has gained momentum for testing and benchmarking algorithms using rigorously annotated internationally compiled real-world datasets. This study presents the results of the segmentation challenge and characterizes the challenging cases that impacted the performance of the winning algorithms. Untreated brain metastases on standard anatomic MRI sequences (T1, T2, FLAIR, T1PG) from eight contributed international datasets were annotated in stepwise method: published UNET algorithms, student, neuroradiologist, final approver neuroradiologist. Segmentations were ranked based on lesion-wise Dice and Hausdorff distance (HD95) scores. False positives (FP) and false negatives (FN) were rigorously penalized, receiving a score of 0 for Dice and a fixed penalty of 374 for HD95. Eight datasets comprising 1303 studies were annotated, with 402 studies (3076 lesions) released on Synapse as publicly available datasets to challenge competitors. Additionally, 31 studies (139 lesions) were held out for validation, and 59 studies (218 lesions) were used for testing. Segmentation accuracy was measured as rank across subjects, with the winning team achieving a LesionWise mean score of 7.9. Common errors among the leading teams included false negatives for small lesions and misregistration of masks in space.The BraTS-METS 2023 challenge successfully curated well-annotated, diverse datasets and identified common errors, facilitating the translation of BM segmentation across varied clinical environments and providing personalized volumetric reports to patients undergoing BM treatment.
△ Less
Submitted 8 December, 2024; v1 submitted 1 June, 2023;
originally announced June 2023.
-
Diversity Analysis of Multi-Aperture UWOC System over EGG Channel with Pointing Errors
Authors:
Ziyaur Rahman,
Ankur Bansal,
S. M. Zafaruddin
Abstract:
Single aperture reception for underwater wireless optical communication (UWOC) is insufficient to deal with oceanic turbulence caused by the combined effect of temperature gradient and air bubbles. This paper analyzes the performance of multi-aperture reception for UWOC under channel irradiance fluctuations characterized by the mixture exponential generalized gamma (EGG) distribution. We analyze t…
▽ More
Single aperture reception for underwater wireless optical communication (UWOC) is insufficient to deal with oceanic turbulence caused by the combined effect of temperature gradient and air bubbles. This paper analyzes the performance of multi-aperture reception for UWOC under channel irradiance fluctuations characterized by the mixture exponential generalized gamma (EGG) distribution. We analyze the system performance by employing both selection combining (SC) and maximum ratio combining (MRC) receivers. In particular, we derive the exact outage probability expression for the SC-based multi-aperture UWOC receiver and obtain an upper bound on the outage probability for the MRC-based multi-aperture UWOC receiver. With the help of the derived results, we analytically obtain the diversity order of the considered multi-aperture UWOC system.
△ Less
Submitted 19 December, 2022;
originally announced December 2022.
-
Perception Simplex: Verifiable Collision Avoidance in Autonomous Vehicles Amidst Obstacle Detection Faults
Authors:
Ayoosh Bansal,
Hunmin Kim,
Simon Yu,
Bo Li,
Naira Hovakimyan,
Marco Caccamo,
Lui Sha
Abstract:
Advances in deep learning have revolutionized cyber-physical applications, including the development of Autonomous Vehicles. However, real-world collisions involving autonomous control of vehicles have raised significant safety concerns regarding the use of Deep Neural Networks (DNN) in safety-critical tasks, particularly Perception. The inherent unverifiability of DNNs poses a key challenge in en…
▽ More
Advances in deep learning have revolutionized cyber-physical applications, including the development of Autonomous Vehicles. However, real-world collisions involving autonomous control of vehicles have raised significant safety concerns regarding the use of Deep Neural Networks (DNN) in safety-critical tasks, particularly Perception. The inherent unverifiability of DNNs poses a key challenge in ensuring their safe and reliable operation.
In this work, we propose Perception Simplex (PS), a fault-tolerant application architecture designed for obstacle detection and collision avoidance. We analyze an existing LiDAR-based classical obstacle detection algorithm to establish strict bounds on its capabilities and limitations. Such analysis and verification have not been possible for deep learning-based perception systems yet. By employing verifiable obstacle detection algorithms, PS identifies obstacle existence detection faults in the output of unverifiable DNN-based object detectors. When faults with potential collision risks are detected, appropriate corrective actions are initiated. Through extensive analysis and software-in-the-loop simulations, we demonstrate that PS provides predictable and deterministic fault tolerance against obstacle existence detection faults, establishing a robust safety guarantee.
△ Less
Submitted 28 November, 2023; v1 submitted 4 September, 2022;
originally announced September 2022.
-
Verifiable Obstacle Detection
Authors:
Ayoosh Bansal,
Hunmin Kim,
Simon Yu,
Bo Li,
Naira Hovakimyan,
Marco Caccamo,
Lui Sha
Abstract:
Perception of obstacles remains a critical safety concern for autonomous vehicles. Real-world collisions have shown that the autonomy faults leading to fatal collisions originate from obstacle existence detection. Open source autonomous driving implementations show a perception pipeline with complex interdependent Deep Neural Networks. These networks are not fully verifiable, making them unsuitabl…
▽ More
Perception of obstacles remains a critical safety concern for autonomous vehicles. Real-world collisions have shown that the autonomy faults leading to fatal collisions originate from obstacle existence detection. Open source autonomous driving implementations show a perception pipeline with complex interdependent Deep Neural Networks. These networks are not fully verifiable, making them unsuitable for safety-critical tasks.
In this work, we present a safety verification of an existing LiDAR based classical obstacle detection algorithm. We establish strict bounds on the capabilities of this obstacle detection algorithm. Given safety standards, such bounds allow for determining LiDAR sensor properties that would reliably satisfy the standards. Such analysis has as yet been unattainable for neural network based perception systems. We provide a rigorous analysis of the obstacle detection system with empirical results based on real-world sensor data.
△ Less
Submitted 30 August, 2022;
originally announced August 2022.
-
Multiple Antenna Selection and Successive Signal Detection for SM-based IRS-aided Communication
Authors:
Hasan Albinsaid,
Keshav Singh,
Ankur Bansal,
Sudip Biswas,
Chih-Peng Li,
Zygmunt J. Haas
Abstract:
Intelligent reflecting surface (IRS) is being considered as a prospective candidate for next-generation wireless communication due to its ability to significantly improve coverage and spectral efficiency by controlling the propagation environment. One of the ways IRS increases spectral efficiency is by adjusting phase shifts to perform passive beamforming. In this letter, we integrate the concept…
▽ More
Intelligent reflecting surface (IRS) is being considered as a prospective candidate for next-generation wireless communication due to its ability to significantly improve coverage and spectral efficiency by controlling the propagation environment. One of the ways IRS increases spectral efficiency is by adjusting phase shifts to perform passive beamforming. In this letter, we integrate the concept of IRS-aided communication to the domain of multi-direction beamforming, whereby multiple receive antennas are selected to convey more information bits than existing spatial modulation (SM) techniques at any specific time. To complement this system, we also propose a successive signal detection (SSD) technique at the receiver. Numerical results show that the proposed design is able to improve the average successful bits transmitted (ASBT) by the system, which outperforms other state-of-the-art methods proposed in the literature.
△ Less
Submitted 28 April, 2021;
originally announced April 2021.
-
Recurrent Neural Network Assisted Transmitter Selection for Secrecy in Cognitive Radio Network
Authors:
Shalini Tripathi,
Chinmoy Kundu,
Octavia A. Dobre,
Ankur Bansal,
Mark F. Flanagan
Abstract:
In this paper, we apply the long short-term memory (LSTM), an advanced recurrent neural network based machine learning (ML) technique, to the problem of transmitter selection (TS) for secrecy in an underlay small-cell cognitive radio network with unreliable backhaul connections. The cognitive communication scenario under consideration has a secondary small-cell network that shares the same spectru…
▽ More
In this paper, we apply the long short-term memory (LSTM), an advanced recurrent neural network based machine learning (ML) technique, to the problem of transmitter selection (TS) for secrecy in an underlay small-cell cognitive radio network with unreliable backhaul connections. The cognitive communication scenario under consideration has a secondary small-cell network that shares the same spectrum of the primary network with an agreement to always maintain a desired outage probability constraint in the primary network. Due to the interference from the secondary transmitter common to all primary transmissions, the secrecy rates for the different transmitters are correlated. LSTM exploits this correlation and matches the performance of the conventional technique when the number of transmitters is small. As the number grows, the performance degrades in the same manner as other ML techniques such as support vector machine, $k$-nearest neighbors, naive Bayes, and deep neural network. However, LSTM still significantly outperforms these techniques in misclassification ratio and secrecy outage probability. It also reduces the feedback overhead against conventional TS.
△ Less
Submitted 16 February, 2021;
originally announced February 2021.
-
Streaming End-to-End Bilingual ASR Systems with Joint Language Identification
Authors:
Surabhi Punjabi,
Harish Arsikere,
Zeynab Raeesy,
Chander Chandak,
Nikhil Bhave,
Ankish Bansal,
Markus Müller,
Sergio Murillo,
Ariya Rastrow,
Sri Garimella,
Roland Maas,
Mat Hans,
Athanasios Mouchtaris,
Siegfried Kunzmann
Abstract:
Multilingual ASR technology simplifies model training and deployment, but its accuracy is known to depend on the availability of language information at runtime. Since language identity is seldom known beforehand in real-world scenarios, it must be inferred on-the-fly with minimum latency. Furthermore, in voice-activated smart assistant systems, language identity is also required for downstream pr…
▽ More
Multilingual ASR technology simplifies model training and deployment, but its accuracy is known to depend on the availability of language information at runtime. Since language identity is seldom known beforehand in real-world scenarios, it must be inferred on-the-fly with minimum latency. Furthermore, in voice-activated smart assistant systems, language identity is also required for downstream processing of ASR output. In this paper, we introduce streaming, end-to-end, bilingual systems that perform both ASR and language identification (LID) using the recurrent neural network transducer (RNN-T) architecture. On the input side, embeddings from pretrained acoustic-only LID classifiers are used to guide RNN-T training and inference, while on the output side, language targets are jointly modeled with ASR targets. The proposed method is applied to two language pairs: English-Spanish as spoken in the United States, and English-Hindi as spoken in India. Experiments show that for English-Spanish, the bilingual joint ASR-LID architecture matches monolingual ASR and acoustic-only LID accuracies. For the more challenging (owing to within-utterance code switching) case of English-Hindi, English ASR and LID metrics show degradation. Overall, in scenarios where users switch dynamically between languages, the proposed architecture offers a promising simplification over running multiple monolingual ASR models and an LID classifier in parallel.
△ Less
Submitted 8 July, 2020;
originally announced July 2020.
-
Unsupervised Audiovisual Synthesis via Exemplar Autoencoders
Authors:
Kangle Deng,
Aayush Bansal,
Deva Ramanan
Abstract:
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers. Our approach builds on simple autoencoders that project out-of-sample data onto the distribution of the training set. We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target exemplar spee…
▽ More
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers. Our approach builds on simple autoencoders that project out-of-sample data onto the distribution of the training set. We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target exemplar speech. In contrast to existing methods, the proposed approach can be easily extended to an arbitrarily large number of speakers and styles using only 3 minutes of target audio-video data, without requiring {\em any} training data for the input speaker. To do so, we learn audiovisual bottleneck representations that capture the structured linguistic content of speech. We outperform prior approaches on both audio and video synthesis, and provide extensive qualitative analysis on our project page -- https://www.cs.cmu.edu/~exemplar-ae/.
△ Less
Submitted 3 July, 2021; v1 submitted 13 January, 2020;
originally announced January 2020.