Search | arXiv e-print repository

mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks

Authors: Luel Hagos Beyene, Vivek Verma, Min Ma, Jesujoba O. Alabi, Fabian David Schmidt, Joyce Nakatumba-Nabende, David Ifeoluwa Adelani

Abstract: Large Language models (LLMs) have demonstrated impressive performance on a wide range of tasks, including in multimodal settings such as speech. However, their evaluation is often limited to English and a few high-resource languages. For low-resource languages, there is no standardized evaluation benchmark. In this paper, we address this gap by introducing mSTEB, a new benchmark to evaluate the pe… ▽ More Large Language models (LLMs) have demonstrated impressive performance on a wide range of tasks, including in multimodal settings such as speech. However, their evaluation is often limited to English and a few high-resource languages. For low-resource languages, there is no standardized evaluation benchmark. In this paper, we address this gap by introducing mSTEB, a new benchmark to evaluate the performance of LLMs on a wide range of tasks covering language identification, text classification, question answering, and translation tasks on both speech and text modalities. We evaluated the performance of leading LLMs such as Gemini 2.0 Flash and GPT-4o (Audio) and state-of-the-art open models such as Qwen 2 Audio and Gemma 3 27B. Our evaluation shows a wide gap in performance between high-resource and low-resource languages, especially for languages spoken in Africa and Americas/Oceania. Our findings show that more investment is needed to address their under-representation in LLMs coverage. △ Less

Submitted 24 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

Comments: working paper

arXiv:2504.20944 [pdf, other]

Deep Learning Characterizes Depression and Suicidal Ideation from Eye Movements

Authors: Kleanthis Avramidis, Woojae Jeong, Aditya Kommineni, Sudarsana R. Kadiri, Marcus Ma, Colin McDaniel, Myzelle Hughes, Thomas McGee, Elsi Kaiser, Dani Byrd, Assal Habibi, B. Rael Cahn, Idan A. Blank, Kristina Lerman, Takfarinas Medani, Richard M. Leahy, Shrikanth Narayanan

Abstract: Identifying physiological and behavioral markers for mental health conditions is a longstanding challenge in psychiatry. Depression and suicidal ideation, in particular, lack objective biomarkers, with screening and diagnosis primarily relying on self-reports and clinical interviews. Here, we investigate eye tracking as a potential marker modality for screening purposes. Eye movements are directly… ▽ More Identifying physiological and behavioral markers for mental health conditions is a longstanding challenge in psychiatry. Depression and suicidal ideation, in particular, lack objective biomarkers, with screening and diagnosis primarily relying on self-reports and clinical interviews. Here, we investigate eye tracking as a potential marker modality for screening purposes. Eye movements are directly modulated by neuronal networks and have been associated with attentional and mood-related patterns; however, their predictive value for depression and suicidality remains unclear. We recorded eye-tracking sequences from 126 young adults as they read and responded to affective sentences, and subsequently developed a deep learning framework to predict their clinical status. The proposed model included separate branches for trials of positive and negative sentiment, and used 2D time-series representations to account for both intra-trial and inter-trial variations. We were able to identify depression and suicidal ideation with an area under the receiver operating curve (AUC) of 0.793 (95% CI: 0.765-0.819) against healthy controls, and suicidality specifically with 0.826 AUC (95% CI: 0.797-0.852). The model also exhibited moderate, yet significant, accuracy in differentiating depressed from suicidal participants, with 0.609 AUC (95% CI 0.571-0.646). Discriminative patterns emerge more strongly when assessing the data relative to response generation than relative to the onset time of the final word of the sentences. The most pronounced effects were observed for negative-sentiment sentences, that are congruent to depressed and suicidal participants. Our findings highlight eye tracking as an objective tool for mental health assessment and underscore the modulatory impact of emotional stimuli on cognitive processes affecting oculomotor control. △ Less

Submitted 29 April, 2025; originally announced April 2025.

Comments: Preprint. 12 pages, 5 figures

arXiv:2504.10686 [pdf, other]

The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report

Authors: Bin Ren, Hang Guo, Lei Sun, Zongwei Wu, Radu Timofte, Yawei Li, Yao Zhang, Xinning Chai, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Li Song, Hongyuan Yu, Pufan Xu, Cheng Wan, Zhijuan Huang, Peng Guo, Shuyuan Cui, Chenjun Li, Xuehai Hu, Pan Pan, Xin Zhang, Heng Zhang, Qing Luo, Linyan Jiang , et al. (122 additional authors not shown)

Abstract: This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the… ▽ More This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the $\operatorname{DIV2K\_LSDIR\_test}$ dataset. A robust participation saw \textbf{244} registered entrants, with \textbf{43} teams submitting valid entries. This report meticulously analyzes these methods and results, emphasizing groundbreaking advancements in state-of-the-art single-image ESR techniques. The analysis highlights innovative approaches and establishes benchmarks for future research in the field. △ Less

Submitted 14 April, 2025; originally announced April 2025.

Comments: Accepted by CVPR2025 NTIRE Workshop, Efficient Super-Resolution Challenge Report. 50 pages

arXiv:2503.09489 [pdf, ps, other]

Optimal ISAC Beamforming Structure and Efficient Algorithms for Sum Rate and CRLB Balancing

Authors: Tianyu Fang, Mengyuan Ma, Markku Juntti, Nir Shlezinger, A. Lee Swindlehurst, Nhan Thanh Nguyen

Abstract: Integrated sensing and communications (ISAC) has emerged as a promising paradigm to unify wireless communications and radar sensing, enabling efficient spectrum and hardware utilization. A core challenge with realizing the gains of ISAC stems from the unique challenges of dual purpose beamforming design due to the highly non-convex nature of key performance metrics such as sum rate for communicati… ▽ More Integrated sensing and communications (ISAC) has emerged as a promising paradigm to unify wireless communications and radar sensing, enabling efficient spectrum and hardware utilization. A core challenge with realizing the gains of ISAC stems from the unique challenges of dual purpose beamforming design due to the highly non-convex nature of key performance metrics such as sum rate for communications and the Cramer-Rao lower bound (CRLB) for sensing. In this paper, we propose a low-complexity structured approach to ISAC beamforming optimization to simultaneously enhance spectral efficiency and estimation accuracy. Specifically, we develop a successive convex approximation (SCA) based algorithm which transforms the original non-convex problem into a sequence of convex subproblems ensuring convergence to a locally optimal solution. Furthermore, leveraging the proposed SCA framework and the Lagrange duality, we derive the optimal beamforming structure for CRLB optimization in ISAC systems. Our findings characterize the reduction in radar streams one can employ without affecting performance. This enables a dimensionality reduction that enhances computational efficiency. Numerical simulations validate that our approach achieves comparable or superior performance to the considered benchmarks while requiring much lower computational costs. △ Less

Submitted 12 March, 2025; originally announced March 2025.

Comments: journal version of our previous work, submitted for possible publication

arXiv:2502.21065 [pdf, other]

doi 10.1016/j.mechatronics.2025.103311

Lifted Frequency-Domain Identification of Closed-Loop Multirate Systems: Applied to Dual-Stage Actuator Hard Disk Drives

Authors: Max van Haren, Masahiro Mae, Lennart Blanken, Tom Oomen

Abstract: Frequency-domain representations are crucial for the design and performance evaluation of controllers in multirate systems, specifically to address intersample performance. The aim of this paper is to develop an effective frequency-domain system identification technique for closed-loop multirate systems using solely slow-rate output measurements. By indirect identification of multivariable time-in… ▽ More Frequency-domain representations are crucial for the design and performance evaluation of controllers in multirate systems, specifically to address intersample performance. The aim of this paper is to develop an effective frequency-domain system identification technique for closed-loop multirate systems using solely slow-rate output measurements. By indirect identification of multivariable time-invariant representations through lifting, in combination with local modeling techniques, the multirate system is effectively identified. The developed method is capable of accurate identification of closed-loop multirate systems within a single identification experiment, using fast-rate excitation and inputs, and slow-rate outputs. Finally, the developed framework is validated using a benchmark problem consisting of a multivariable dual-stage actuator from a hard disk drive, demonstrating its applicability and accuracy. △ Less

Submitted 9 April, 2025; v1 submitted 28 February, 2025; originally announced February 2025.

Journal ref: Mechatronics, 108:103311 (2025)

arXiv:2502.07243 [pdf, other]

Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement

Authors: Xueyao Zhang, Xiaohui Zhang, Kainan Peng, Zhenyu Tang, Vimal Manohar, Yingru Liu, Jeff Hwang, Dangna Li, Yuhao Wang, Julian Chan, Yuan Huang, Zhizheng Wu, Mingbo Ma

Abstract: The imitation of voice, targeted on specific speech attributes such as timbre and speaking style, is crucial in speech generation. However, existing methods rely heavily on annotated data, and struggle with effectively disentangling timbre and style, leading to challenges in achieving controllable generation, especially in zero-shot scenarios. To address these issues, we propose Vevo, a versatile… ▽ More The imitation of voice, targeted on specific speech attributes such as timbre and speaking style, is crucial in speech generation. However, existing methods rely heavily on annotated data, and struggle with effectively disentangling timbre and style, leading to challenges in achieving controllable generation, especially in zero-shot scenarios. To address these issues, we propose Vevo, a versatile zero-shot voice imitation framework with controllable timbre and style. Vevo operates in two core stages: (1) Content-Style Modeling: Given either text or speech's content tokens as input, we utilize an autoregressive transformer to generate the content-style tokens, which is prompted by a style reference; (2) Acoustic Modeling: Given the content-style tokens as input, we employ a flow-matching transformer to produce acoustic representations, which is prompted by a timbre reference. To obtain the content and content-style tokens of speech, we design a fully self-supervised approach that progressively decouples the timbre, style, and linguistic content of speech. Specifically, we adopt VQ-VAE as the tokenizer for the continuous hidden features of HuBERT. We treat the vocabulary size of the VQ-VAE codebook as the information bottleneck, and adjust it carefully to obtain the disentangled speech representations. Solely self-supervised trained on 60K hours of audiobook speech data, without any fine-tuning on style-specific corpora, Vevo matches or surpasses existing methods in accent and emotion conversion tasks. Additionally, Vevo's effectiveness in zero-shot voice conversion and text-to-speech tasks further demonstrates its strong generalization and versatility. Audio samples are available at https://versavoice.github.io. △ Less

Submitted 10 February, 2025; originally announced February 2025.

Comments: Accepted by ICLR 2025

arXiv:2501.13130 [pdf, other]

A Novel Scene Coupling Semantic Mask Network for Remote Sensing Image Segmentation

Authors: Xiaowen Ma, Rongrong Lian, Zhenkai Wu, Renxiang Guan, Tingfeng Hong, Mengjiao Zhao, Mengting Ma, Jiangtao Nie, Zhenhong Du, Siyang Song, Wei Zhang

Abstract: As a common method in the field of computer vision, spatial attention mechanism has been widely used in semantic segmentation of remote sensing images due to its outstanding long-range dependency modeling capability. However, remote sensing images are usually characterized by complex backgrounds and large intra-class variance that would degrade their analysis performance. While vanilla spatial att… ▽ More As a common method in the field of computer vision, spatial attention mechanism has been widely used in semantic segmentation of remote sensing images due to its outstanding long-range dependency modeling capability. However, remote sensing images are usually characterized by complex backgrounds and large intra-class variance that would degrade their analysis performance. While vanilla spatial attention mechanisms are based on dense affine operations, they tend to introduce a large amount of background contextual information and lack of consideration for intrinsic spatial correlation. To deal with such limitations, this paper proposes a novel scene-Coupling semantic mask network, which reconstructs the vanilla attention with scene coupling and local global semantic masks strategies. Specifically, scene coupling module decomposes scene information into global representations and object distributions, which are then embedded in the attention affinity processes. This Strategy effectively utilizes the intrinsic spatial correlation between features so that improve the process of attention modeling. Meanwhile, local global semantic masks module indirectly correlate pixels with the global semantic masks by using the local semantic mask as an intermediate sensory element, which reduces the background contextual interference and mitigates the effect of intra-class variance. By combining the above two strategies, we propose the model SCSM, which not only can efficiently segment various geospatial objects in complex scenarios, but also possesses inter-clean and elegant mathematical representations. Experimental results on four benchmark datasets demonstrate the the effectiveness of the above two strategies for improving the attention modeling of remote sensing images. The dataset and code are available at https://github.com/xwmaxwma/rssegmentation △ Less

Submitted 21 January, 2025; originally announced January 2025.

Comments: Accepted by ISPRS Journal of Photogrammetry and Remote Sensing

arXiv:2412.13967 [pdf, ps, other]

THz Channels for Short-Range Mobile Networks: Multipath Clusters and Human Body Shadowing

Authors: Minseok Kim, Jun-ichi Takada, Minghe Mao, Che Chia Kang, Xin Du, Anirban Ghosh

Abstract: The THz band (0.1-10 THz) is emerging as a crucial enabler for sixth-generation (6G) mobile communication systems, overcoming the limitations of current technologies and unlocking new opportunities for low-latency and ultra-high-speed communications by utilizing several tens of GHz transmission bandwidths. However, extremely high spreading losses and other interaction losses pose significant chall… ▽ More The THz band (0.1-10 THz) is emerging as a crucial enabler for sixth-generation (6G) mobile communication systems, overcoming the limitations of current technologies and unlocking new opportunities for low-latency and ultra-high-speed communications by utilizing several tens of GHz transmission bandwidths. However, extremely high spreading losses and other interaction losses pose significant challenges to establishing wide-area communication coverage, while human body shadowing further complicates maintaining stable communication links. Although point-to-point (P2P) fixed wireless access in the THz band has been successfully demonstrated, realizing fully mobile and reliable wireless access remains a challenge due to numerous issues to be solved for highly directional communication. To provide insights into the design of THz communication systems, this article addresses the challenges associated with THz short-range mobile access networks. It offers an overview of recent findings on the environment-dependence of multipath cluster channel properties and the impact of human body shadowing, based on measurements at 300 GHz using a double-directional high-resolution channel sounder and a motion capture-integrated channel sounder. △ Less

Submitted 18 December, 2024; originally announced December 2024.

arXiv:2412.13365 [pdf, other]

Quantitative Predictive Monitoring and Control for Safe Human-Machine Interaction

Authors: Shuyang Dong, Meiyi Ma, Josephine Lamp, Sebastian Elbaum, Matthew B. Dwyer, Lu Feng

Abstract: There is a growing trend toward AI systems interacting with humans to revolutionize a range of application domains such as healthcare and transportation. However, unsafe human-machine interaction can lead to catastrophic failures. We propose a novel approach that predicts future states by accounting for the uncertainty of human interaction, monitors whether predictions satisfy or violate safety re… ▽ More There is a growing trend toward AI systems interacting with humans to revolutionize a range of application domains such as healthcare and transportation. However, unsafe human-machine interaction can lead to catastrophic failures. We propose a novel approach that predicts future states by accounting for the uncertainty of human interaction, monitors whether predictions satisfy or violate safety requirements, and adapts control actions based on the predictive monitoring results. Specifically, we develop a new quantitative predictive monitor based on Signal Temporal Logic with Uncertainty (STL-U) to compute a robustness degree interval, which indicates the extent to which a sequence of uncertain predictions satisfies or violates an STL-U requirement. We also develop a new loss function to guide the uncertainty calibration of Bayesian deep learning and a new adaptive control method, both of which leverage STL-U quantitative predictive monitoring results. We apply the proposed approach to two case studies: Type 1 Diabetes management and semi-autonomous driving. Experiments show that the proposed approach improves safety and effectiveness in both case studies. △ Less

Submitted 17 December, 2024; originally announced December 2024.

arXiv:2410.17709 [pdf, other]

Deoxys: A Causal Inference Engine for Unhealthy Node Mitigation in Large-scale Cloud Infrastructure

Authors: Chaoyun Zhang, Randolph Yao, Si Qin, Ze Li, Shekhar Agrawal, Binit R. Mishra, Tri Tran, Minghua Ma, Qingwei Lin, Murali Chintalapati, Dongmei Zhang

Abstract: The presence of unhealthy nodes in cloud infrastructure signals the potential failure of machines, which can significantly impact the availability and reliability of cloud services, resulting in negative customer experiences. Effectively addressing unhealthy node mitigation is therefore vital for sustaining cloud system performance. This paper introduces Deoxys, a causal inference engine tailored… ▽ More The presence of unhealthy nodes in cloud infrastructure signals the potential failure of machines, which can significantly impact the availability and reliability of cloud services, resulting in negative customer experiences. Effectively addressing unhealthy node mitigation is therefore vital for sustaining cloud system performance. This paper introduces Deoxys, a causal inference engine tailored to recommending mitigation actions for unhealthy node in cloud systems to minimize virtual machine downtime and interruptions during unhealthy events. It employs double machine learning combined with causal forest to produce precise and reliable mitigation recommendations based solely on limited observational data collected from the historical unhealthy events. To enhance the causal inference model, Deoxys further incorporates a policy fallback mechanism based on model uncertainty and action overriding mechanisms to (i) improve the reliability of the system, and (ii) strike a good tradeoff between downtime reduction and resource utilization, thereby enhancing the overall system performance. After deploying Deoxys in a large-scale cloud infrastructure at Microsoft, our observations demonstrate that Deoxys significantly reduces average VM downtime by 53% compared to a legacy policy, while leading to 49.5% lower VM interruption rate. This substantial improvement enhances the reliability and stability of cloud platforms, resulting in a seamless customer experience. △ Less

Submitted 23 October, 2024; originally announced October 2024.

arXiv:2409.17644 [pdf, ps, other]

Model-Based Machine Learning for Max-Min Fairness Beamforming Design in JCAS Systems

Authors: Mengyuan Ma, Tianyu Fang, Nir Shlezinger, A. L. Swindlehurst, Markku Juntti, Nhan Nguyen

Abstract: Joint communications and sensing (JCAS) is expected to be a crucial technology for future wireless systems. This paper investigates beamforming design for a multi-user multi-target JCAS system to ensure fairness and balance between communications and sensing performance. We jointly optimize the transmit and receive beamformers to maximize the weighted sum of the minimum communications rate and sen… ▽ More Joint communications and sensing (JCAS) is expected to be a crucial technology for future wireless systems. This paper investigates beamforming design for a multi-user multi-target JCAS system to ensure fairness and balance between communications and sensing performance. We jointly optimize the transmit and receive beamformers to maximize the weighted sum of the minimum communications rate and sensing mutual information. The formulated problem is highly challenging due to its non-smooth and non-convex nature. To overcome the challenges, we reformulate the problem into an equivalent but more tractable form. We first solve this problem by alternating optimization (AO) and then propose a machine learning algorithm based on the AO approach. Numerical results show that our algorithm scales effectively with the number of the communications users and provides better performance with shorter run time compared to conventional optimization approaches. △ Less

Submitted 26 November, 2024; v1 submitted 26 September, 2024; originally announced September 2024.

Comments: 5 pages, 5 figures

arXiv:2409.17638 [pdf, ps, other]

Digital and Hybrid Precoding Designs in Massive MIMO with Low-Resolution ADCs

Authors: Mengyuan Ma, Nhan Thanh Nguyen, Italo Atzeni, A. Lee Swindlehurst, Markku Juntti

Abstract: Low-resolution analog-to-digital converters (ADCs) have emerged as an efficient solution for massive multiple-input multiple-output (MIMO) systems to reap high data rates with reasonable power consumption and hardware complexity. In this paper, we study precoding designs for digital, fully connected (FC) hybrid, and partially connected (PC) hybrid beamforming architectures in massive MIMO systems… ▽ More Low-resolution analog-to-digital converters (ADCs) have emerged as an efficient solution for massive multiple-input multiple-output (MIMO) systems to reap high data rates with reasonable power consumption and hardware complexity. In this paper, we study precoding designs for digital, fully connected (FC) hybrid, and partially connected (PC) hybrid beamforming architectures in massive MIMO systems with low-resolution ADCs at the receiver. We aim to maximize the spectral efficiency (SE) subject to a transmit power budget and hardware constraints on the analog components. The resulting problems are nonconvex and the quantization distortion introduces additional challenges. To address them, we first derive a tight lower bound for the SE, based on which we optimize the precoders for the three beamforming architectures under the majorization-minorization framework. Numerical results validate the superiority of the proposed precoding designs over their state-of-the-art counterparts in systems with low-resolution ADCs, particularly those with 1-bit resolution. The results show that the PC hybrid precoding design can achieve an SE close to those of the digital and FC hybrid precoding designs in 1-bit systems, highlighting the potential of the PC hybrid beamforming architectures. △ Less

Submitted 11 February, 2025; v1 submitted 26 September, 2024; originally announced September 2024.

Comments: 5 pages, 7 figures

arXiv:2409.04447 [pdf, other]

Leveraging Contrastive Learning and Self-Training for Multimodal Emotion Recognition with Limited Labeled Samples

Authors: Qi Fan, Yutong Li, Yi Xin, Xinyu Cheng, Guanglai Gao, Miao Ma

Abstract: The Multimodal Emotion Recognition challenge MER2024 focuses on recognizing emotions using audio, language, and visual signals. In this paper, we present our submission solutions for the Semi-Supervised Learning Sub-Challenge (MER2024-SEMI), which tackles the issue of limited annotated data in emotion recognition. Firstly, to address the class imbalance, we adopt an oversampling strategy. Secondly… ▽ More The Multimodal Emotion Recognition challenge MER2024 focuses on recognizing emotions using audio, language, and visual signals. In this paper, we present our submission solutions for the Semi-Supervised Learning Sub-Challenge (MER2024-SEMI), which tackles the issue of limited annotated data in emotion recognition. Firstly, to address the class imbalance, we adopt an oversampling strategy. Secondly, we propose a modality representation combinatorial contrastive learning (MR-CCL) framework on the trimodal input data to establish robust initial models. Thirdly, we explore a self-training approach to expand the training set. Finally, we enhance prediction robustness through a multi-classifier weighted soft voting strategy. Our proposed method is validated to be effective on the MER2024-SEMI Challenge, achieving a weighted average F-score of 88.25% and ranking 6th on the leaderboard. Our project is available at https://github.com/WooyoohL/MER2024-SEMI. △ Less

Submitted 23 August, 2024; originally announced September 2024.

Comments: Accepted by ACM MM Workshop 2024

arXiv:2408.12239 [pdf, other]

Fast Burst-Sparsity Learning Approach for Massive MIMO-OTFS Channel Estimation

Authors: Ming Ma, Jisheng Dai, Xue-Qin Jiang

Abstract: Accurate channel estimation in orthogonal time frequency space (OTFS) systems with massive multiple-input multiple-output (MIMO) configurations is challenging due to high-dimensional sparse representation (SR). Existing methods often face performance degradation and/or high computational complexity. To address these issues and exploit intricate channel sparsity structure, this letter first leverag… ▽ More Accurate channel estimation in orthogonal time frequency space (OTFS) systems with massive multiple-input multiple-output (MIMO) configurations is challenging due to high-dimensional sparse representation (SR). Existing methods often face performance degradation and/or high computational complexity. To address these issues and exploit intricate channel sparsity structure, this letter first leverages a novel hybrid burst-sparsity prior to capture the burst/common sparse structure in the angle/delay domain, and then utilizes an independent variational Bayesian inference (VBI) factorization technique to efficiently solve the high-dimensional SR problem. Additionally, an angle/Doppler refinement approach is incorporated into the proposed method to automatically mitigate off-grid mismatches. △ Less

Submitted 27 January, 2025; v1 submitted 22 August, 2024; originally announced August 2024.

Comments: 9 pages, 6 figures

arXiv:2408.11837 [pdf, other]

MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy

Authors: Hanchen David Wang, Nibraas Khan, Anna Chen, Nilanjan Sarkar, Pamela Wisniewski, Meiyi Ma

Abstract: Recent global estimates suggest that as many as 2.41 billion individuals have health conditions that would benefit from rehabilitation services. Home-based Physical Therapy (PT) faces significant challenges in providing interactive feedback and meaningful observation for therapists and patients. To fill this gap, we present MicroXercise, which integrates micro-motion analysis with wearable sensors… ▽ More Recent global estimates suggest that as many as 2.41 billion individuals have health conditions that would benefit from rehabilitation services. Home-based Physical Therapy (PT) faces significant challenges in providing interactive feedback and meaningful observation for therapists and patients. To fill this gap, we present MicroXercise, which integrates micro-motion analysis with wearable sensors, providing therapists and patients with a comprehensive feedback interface, including video, text, and scores. Crucially, it employs multi-dimensional Dynamic Time Warping (DTW) and attribution-based explainable methods to analyze the existing deep learning neural networks in monitoring exercises, focusing on a high granularity of exercise. This synergistic approach is pivotal, providing output matching the input size to precisely highlight critical subtleties and movements in PT, thus transforming complex AI analysis into clear, actionable feedback. By highlighting these micro-motions in different metrics, such as stability and range of motion, MicroXercise significantly enhances the understanding and relevance of feedback for end-users. Comparative performance metrics underscore its effectiveness over traditional methods, such as a 39% and 42% improvement in Feature Mutual Information (FMI) and Continuity. MicroXercise is a step ahead in home-based physical therapy, providing a technologically advanced and intuitively helpful solution to enhance patient care and outcomes. △ Less

Submitted 6 August, 2024; originally announced August 2024.

Comments: Accepted by IEEE/ACM CHASE 2024

arXiv:2408.06227 [pdf]

FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks

Authors: Min Ma, Yuma Koizumi, Shigeki Karita, Heiga Zen, Jason Riesa, Haruko Ishikawa, Michiel Bacchiani

Abstract: This paper introduces FLEURS-R, a speech restoration applied version of the Few-shot Learning Evaluation of Universal Representations of Speech (FLEURS) corpus. FLEURS-R maintains an N-way parallel speech corpus in 102 languages as FLEURS, with improved audio quality and fidelity by applying the speech restoration model Miipher. The aim of FLEURS-R is to advance speech technology in more languages… ▽ More This paper introduces FLEURS-R, a speech restoration applied version of the Few-shot Learning Evaluation of Universal Representations of Speech (FLEURS) corpus. FLEURS-R maintains an N-way parallel speech corpus in 102 languages as FLEURS, with improved audio quality and fidelity by applying the speech restoration model Miipher. The aim of FLEURS-R is to advance speech technology in more languages and catalyze research including text-to-speech (TTS) and other speech generation tasks in low-resource languages. Comprehensive evaluations with the restored speech and TTS baseline models trained from the new corpus show that the new corpus obtained significantly improved speech quality while maintaining the semantic contents of the speech. The corpus is publicly released via Hugging Face. △ Less

Submitted 12 August, 2024; originally announced August 2024.

Journal ref: INTERSPEECH 2024

arXiv:2407.05605 [pdf, other]

doi 10.1109/ICASSP43922.2022.9746163

Two-Path GMM-ResNet and GMM-SENet for ASV Spoofing Detection

Authors: Zhenchun Lei, Hui Yan, Changhong Liu, Minglei Ma, Yingen Yang

Abstract: The automatic speaker verification system is sometimes vulnerable to various spoofing attacks. The 2-class Gaussian Mixture Model classifier for genuine and spoofed speech is usually used as the baseline for spoofing detection. However, the GMM classifier does not separately consider the scores of feature frames on each Gaussian component. In addition, the GMM accumulates the scores on all frames… ▽ More The automatic speaker verification system is sometimes vulnerable to various spoofing attacks. The 2-class Gaussian Mixture Model classifier for genuine and spoofed speech is usually used as the baseline for spoofing detection. However, the GMM classifier does not separately consider the scores of feature frames on each Gaussian component. In addition, the GMM accumulates the scores on all frames independently, and does not consider their correlations. We propose the two-path GMM-ResNet and GMM-SENet models for spoofing detection, whose input is the Gaussian probability features based on two GMMs trained on genuine and spoofed speech respectively. The models consider not only the score distribution on GMM components, but also the relationship between adjacent frames. A two-step training scheme is applied to improve the system robustness. Experiments on the ASVspoof 2019 show that the LFCC+GMM-ResNet system can relatively reduce min-tDCF and EER by 76.1% and 76.3% on logical access scenario compared with the GMM, and the LFCC+GMM-SENet system by 94.4% and 95.4% on physical access scenario. After score fusion, the systems give the second-best results on both scenarios. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.04408 [pdf, ps, other]

Hybrid Receiver Design for Massive MIMO-OFDM with Low-Resolution ADCs and Oversampling

Authors: Mengyuan Ma, Nhan Thanh Nguyen, Italo Atzeni, Markku Juntti

Abstract: Low-resolution analog-to-digital converters (ADCs) and hybrid beamforming have emerged as efficient solutions to reduce power consumption with satisfactory spectral efficiency (SE) in massive multiple-input multiple-output (MIMO) systems. In this paper, we investigate the performance of a hybrid receiver in massive MIMO orthogonal frequency-division multiplexing (OFDM) uplink systems with low-reso… ▽ More Low-resolution analog-to-digital converters (ADCs) and hybrid beamforming have emerged as efficient solutions to reduce power consumption with satisfactory spectral efficiency (SE) in massive multiple-input multiple-output (MIMO) systems. In this paper, we investigate the performance of a hybrid receiver in massive MIMO orthogonal frequency-division multiplexing (OFDM) uplink systems with low-resolution ADCs and oversampling. Considering both the temporal and spatial correlation of the quantization distortion (QD), we derive a closed-form approximation of the frequency-domain QD covariance matrix, which facilitates the evaluation of the system's SE. Then we jointly design the analog and digital combiners of the hybrid receiver to maximize the SE. The formulated problem is challenging due to the constant-modulus constraint of the analog combiner and its coupling with the digital one. To overcome these challenges, we transform the objective function into an equivalent but more tractable form and then iteratively update the analog and digital combiners. Numerical simulations verify the superiority of the proposed algorithm over the considered benchmarks and show the resilience of the hybrid receiver to beam squint with low-resolution ADCs. Furthermore, the proposed hybrid receiver design with oversampling can achieve significantly higher energy efficiency compared with the fully digital one. △ Less

Submitted 21 January, 2025; v1 submitted 5 July, 2024; originally announced July 2024.

Comments: 6 pages, 4 figures, to be appeared in WCNC2025

arXiv:2407.03796 [pdf, ps, other]

Joint Beamforming Design and Bit Allocation in Massive MIMO with Resolution-Adaptive ADCs

Authors: Mengyuan Ma, Nhan Thanh Nguyen, Italo Atzeni, Markku Juntti

Abstract: Low-resolution analog-to-digital converters (ADCs) have emerged as a promising technology for reducing power consumption and complexity in massive multiple-input multiple-output (MIMO) systems while maintaining satisfactory spectral and energy efficiencies (SE/EE). In this work, we first identify the essential properties of optimal quantization and leverage them to derive a closed-form approximati… ▽ More Low-resolution analog-to-digital converters (ADCs) have emerged as a promising technology for reducing power consumption and complexity in massive multiple-input multiple-output (MIMO) systems while maintaining satisfactory spectral and energy efficiencies (SE/EE). In this work, we first identify the essential properties of optimal quantization and leverage them to derive a closed-form approximation of the covariance matrix of the quantization distortion. The theoretical finding facilitates the system SE analysis in the presence of low-resolution ADCs. We then focus on the joint optimization of the transmit-receive beamforming and bit allocation to maximize the SE under constraints on the transmit power and the total number of active ADC bits. To solve the resulting mixed-integer problem, we first develop an efficient beamforming design for fixed ADC resolutions. Then, we propose a low-complexity heuristic algorithm to iteratively optimize the ADC resolutions and beamforming matrices. Numerical results for a $64 \times 64$ MIMO system demonstrate that the proposed design offers $6\%$ improvement in both SE and EE with $40\%$ fewer active ADC bits compared with the uniform bit allocation. Furthermore, we numerically show that receiving more data streams with low-resolution ADCs can achieve higher SE and EE compared to receiving fewer data streams with high-resolution ADCs. △ Less

Submitted 5 May, 2025; v1 submitted 4 July, 2024; originally announced July 2024.

Comments: 15 pages, 13 figures

arXiv:2407.02170 [pdf, other]

doi 10.1109/ICASSP48485.2024.10447628

GMM-ResNet2: Ensemble of Group ResNet Networks for Synthetic Speech Detection

Authors: Zhenchun Lei, Hui Yan, Changhong Liu, Yong Zhou, Minglei Ma

Abstract: Deep learning models are widely used for speaker recognition and spoofing speech detection. We propose the GMM-ResNet2 for synthesis speech detection. Compared with the previous GMM-ResNet model, GMM-ResNet2 has four improvements. Firstly, the different order GMMs have different capabilities to form smooth approximations to the feature distribution, and multiple GMMs are used to extract multi-scal… ▽ More Deep learning models are widely used for speaker recognition and spoofing speech detection. We propose the GMM-ResNet2 for synthesis speech detection. Compared with the previous GMM-ResNet model, GMM-ResNet2 has four improvements. Firstly, the different order GMMs have different capabilities to form smooth approximations to the feature distribution, and multiple GMMs are used to extract multi-scale Log Gaussian Probability features. Secondly, the grouping technique is used to improve the classification accuracy by exposing the group cardinality while reducing both the number of parameters and the training time. The final score is obtained by ensemble of all group classifier outputs using the averaging method. Thirdly, the residual block is improved by including one activation function and one batch normalization layer. Finally, an ensemble-aware loss function is proposed to integrate the independent loss functions of all ensemble members. On the ASVspoof 2019 LA task, the GMM-ResNet2 achieves a minimum t-DCF of 0.0227 and an EER of 0.79\%. On the ASVspoof 2021 LA task, the GMM-ResNet2 achieves a minimum t-DCF of 0.2362 and an EER of 2.19\%, and represents a relative reductions of 31.4\% and 76.3\% compared with the LFCC-LCNN baseline. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2405.18739 [pdf, other]

FlocOff: Data Heterogeneity Resilient Federated Learning with Communication-Efficient Edge Offloading

Authors: Mulei Ma, Chenyu Gong, Liekang Zeng, Yang Yang, Liantao Wu

Abstract: Federated Learning (FL) has emerged as a fundamental learning paradigm to harness massive data scattered at geo-distributed edge devices in a privacy-preserving way. Given the heterogeneous deployment of edge devices, however, their data are usually Non-IID, introducing significant challenges to FL including degraded training accuracy, intensive communication costs, and high computing complexity.… ▽ More Federated Learning (FL) has emerged as a fundamental learning paradigm to harness massive data scattered at geo-distributed edge devices in a privacy-preserving way. Given the heterogeneous deployment of edge devices, however, their data are usually Non-IID, introducing significant challenges to FL including degraded training accuracy, intensive communication costs, and high computing complexity. Towards that, traditional approaches typically utilize adaptive mechanisms, which may suffer from scalability issues, increased computational overhead, and limited adaptability to diverse edge environments. To address that, this paper instead leverages the observation that the computation offloading involves inherent functionalities such as node matching and service correlation to achieve data reshaping and proposes Federated learning based on computing Offloading (FlocOff) framework, to address data heterogeneity and resource-constrained challenges. Specifically, FlocOff formulates the FL process with Non-IID data in edge scenarios and derives rigorous analysis on the impact of imbalanced data distribution. Based on this, FlocOff decouples the optimization in two steps, namely : (1) Minimizes the Kullback-Leibler (KL) divergence via Computation Offloading scheduling (MKL-CO); (2) Minimizes the Communication Cost through Resource Allocation (MCC-RA). Extensive experimental results demonstrate that the proposed FlocOff effectively improves model convergence and accuracy by 14.3\%-32.7\% while reducing data heterogeneity under various data distributions. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2404.06674 [pdf, other]

VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing

Authors: Philip Anastassiou, Zhenyu Tang, Kainan Peng, Dongya Jia, Jiaxin Li, Ming Tu, Yuping Wang, Yuxuan Wang, Mingbo Ma

Abstract: We present VoiceShop, a novel speech-to-speech framework that can modify multiple attributes of speech, such as age, gender, accent, and speech style, in a single forward pass while preserving the input speaker's timbre. Previous works have been constrained to specialized models that can only edit these attributes individually and suffer from the following pitfalls: the magnitude of the conversion… ▽ More We present VoiceShop, a novel speech-to-speech framework that can modify multiple attributes of speech, such as age, gender, accent, and speech style, in a single forward pass while preserving the input speaker's timbre. Previous works have been constrained to specialized models that can only edit these attributes individually and suffer from the following pitfalls: the magnitude of the conversion effect is weak, there is no zero-shot capability for out-of-distribution speakers, or the synthesized outputs exhibit undesirable timbre leakage. Our work proposes solutions for each of these issues in a simple modular framework based on a conditional diffusion backbone model with optional normalizing flow-based and sequence-to-sequence speaker attribute-editing modules, whose components can be combined or removed during inference to meet a wide array of tasks without additional model finetuning. Audio samples are available at \url{https://voiceshopai.github.io}. △ Less

Submitted 11 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

arXiv:2404.04904 [pdf, other]

Cross-Domain Audio Deepfake Detection: Dataset and Analysis

Authors: Yuang Li, Min Zhang, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Hao Yang

Abstract: Audio deepfake detection (ADD) is essential for preventing the misuse of synthetic voices that may infringe on personal rights and privacy. Recent zero-shot text-to-speech (TTS) models pose higher risks as they can clone voices with a single utterance. However, the existing ADD datasets are outdated, leading to suboptimal generalization of detection models. In this paper, we construct a new cross-… ▽ More Audio deepfake detection (ADD) is essential for preventing the misuse of synthetic voices that may infringe on personal rights and privacy. Recent zero-shot text-to-speech (TTS) models pose higher risks as they can clone voices with a single utterance. However, the existing ADD datasets are outdated, leading to suboptimal generalization of detection models. In this paper, we construct a new cross-domain ADD dataset comprising over 300 hours of speech data that is generated by five advanced zero-shot TTS models. To simulate real-world scenarios, we employ diverse attack methods and audio prompts from different datasets. Experiments show that, through novel attack-augmented training, the Wav2Vec2-large and Whisper-medium models achieve equal error rates of 4.1\% and 6.5\% respectively. Additionally, we demonstrate our models' outstanding few-shot ADD ability by fine-tuning with just one minute of target-domain data. Nonetheless, neural codec compressors greatly affect the detection accuracy, necessitating further research. △ Less

Submitted 20 September, 2024; v1 submitted 7 April, 2024; originally announced April 2024.

arXiv:2403.02039 [pdf, other]

A Frequency-Domain Approach for Enhanced Performance and Task Flexibility in Finite-Time ILC

Authors: Max van Haren, Kentaro Tsurumoto, Masahiro Mae, Lennart Blanken, Wataru Ohnishi, Tom Oomen

Abstract: Iterative learning control (ILC) is capable of improving the tracking performance of repetitive control systems by utilizing data from past iterations. The aim of this paper is to achieve both task flexibility, which is often achieved by ILC with basis functions, and the performance of frequency-domain ILC, with an intuitive design procedure. The cost function of norm-optimal ILC is determined tha… ▽ More Iterative learning control (ILC) is capable of improving the tracking performance of repetitive control systems by utilizing data from past iterations. The aim of this paper is to achieve both task flexibility, which is often achieved by ILC with basis functions, and the performance of frequency-domain ILC, with an intuitive design procedure. The cost function of norm-optimal ILC is determined that recovers frequency-domain ILC, and consequently, the feedforward signal is parameterized in terms of basis functions and frequency-domain ILC. The resulting method has the performance and design procedure of frequency-domain ILC and the task flexibility of basis functions ILC, and are complimentary to each other. Validation on a benchmark example confirms the capabilities of the framework. △ Less

Submitted 4 March, 2024; originally announced March 2024.

arXiv:2402.12482 [pdf, other]

SECP: A Speech Enhancement-Based Curation Pipeline For Scalable Acquisition Of Clean Speech

Authors: Adam Sabra, Cyprian Wronka, Michelle Mao, Samer Hijazi

Abstract: As more speech technologies rely on a supervised deep learning approach with clean speech as the ground truth, a methodology to onboard said speech at scale is needed. However, this approach needs to minimize the dependency on human listening and annotation, only requiring a human-in-the-loop when needed. In this paper, we address this issue by outlining Speech Enhancement-based Curation Pipeline… ▽ More As more speech technologies rely on a supervised deep learning approach with clean speech as the ground truth, a methodology to onboard said speech at scale is needed. However, this approach needs to minimize the dependency on human listening and annotation, only requiring a human-in-the-loop when needed. In this paper, we address this issue by outlining Speech Enhancement-based Curation Pipeline (SECP) which serves as a framework to onboard clean speech. This clean speech can then train a speech enhancement model, which can further refine the original dataset and thus close the iterative loop. By running two iterative rounds, we observe that enhanced output used as ground truth does not degrade model performance according to $Δ_{PESQ}$, a metric used in this paper. We also show through comparative mean opinion score (CMOS) based subjective tests that the highest and lowest bound of refined data is perceptually better than the original data. △ Less

Submitted 19 February, 2024; originally announced February 2024.

Comments: Accepted to the International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024

arXiv:2311.07613

A Physics-informed Machine Learning-based Control Method for Nonlinear Dynamic Systems with Highly Noisy Measurements

Authors: Mason Ma, Jiajie Wu, Chase Post, Tony Shi, Jingang Yi, Tony Schmitz, Hong Wang

Abstract: This study presents a physics-informed machine learning-based control method for nonlinear dynamic systems with highly noisy measurements. Existing data-driven control methods that use machine learning for system identification cannot effectively cope with highly noisy measurements, resulting in unstable control performance. To address this challenge, the present study extends current physics-info… ▽ More This study presents a physics-informed machine learning-based control method for nonlinear dynamic systems with highly noisy measurements. Existing data-driven control methods that use machine learning for system identification cannot effectively cope with highly noisy measurements, resulting in unstable control performance. To address this challenge, the present study extends current physics-informed machine learning capabilities for modeling nonlinear dynamics with control and integrates them into a model predictive control framework. To demonstrate the capability of the proposed method we test and validate with two noisy nonlinear dynamic systems: the chaotic Lorenz 3 system, and turning machine tool. Analysis of the results illustrate that the proposed method outperforms state-of-the-art benchmarks as measured by both modeling accuracy and control performance for nonlinear dynamic systems under high-noise conditions. △ Less

Submitted 22 March, 2025; v1 submitted 11 November, 2023; originally announced November 2023.

Comments: We completely redesigned and rewrote this paper. It will be a completely different paper with different title, author list, and content

arXiv:2311.00332 [pdf, other]

SDF4CHD: Generative Modeling of Cardiac Anatomies with Congenital Heart Defects

Authors: Fanwei Kong, Sascha Stocker, Perry S. Choi, Michael Ma, Daniel B. Ennis, Alison Marsden

Abstract: Congenital heart disease (CHD) encompasses a spectrum of cardiovascular structural abnormalities, often requiring customized treatment plans for individual patients. Computational modeling and analysis of these unique cardiac anatomies can improve diagnosis and treatment planning and may ultimately lead to improved outcomes. Deep learning (DL) methods have demonstrated the potential to enable effi… ▽ More Congenital heart disease (CHD) encompasses a spectrum of cardiovascular structural abnormalities, often requiring customized treatment plans for individual patients. Computational modeling and analysis of these unique cardiac anatomies can improve diagnosis and treatment planning and may ultimately lead to improved outcomes. Deep learning (DL) methods have demonstrated the potential to enable efficient treatment planning by automating cardiac segmentation and mesh construction for patients with normal cardiac anatomies. However, CHDs are often rare, making it challenging to acquire sufficiently large patient cohorts for training such DL models. Generative modeling of cardiac anatomies has the potential to fill this gap via the generation of virtual cohorts; however, prior approaches were largely designed for normal anatomies and cannot readily capture the significant topological variations seen in CHD patients. Therefore, we propose a type- and shape-disentangled generative approach suitable to capture the wide spectrum of cardiac anatomies observed in different CHD types and synthesize differently shaped cardiac anatomies that preserve the unique topology for specific CHD types. Our DL approach represents generic whole heart anatomies with CHD type-specific abnormalities implicitly using signed distance fields (SDF) based on CHD type diagnosis, which conveniently captures divergent anatomical variations across different types and represents meaningful intermediate CHD states. To capture the shape-specific variations, we then learn invertible deformations to morph the learned CHD type-specific anatomies and reconstruct patient-specific shapes. Our approach has the potential to augment the image-segmentation pairs for rarer CHD types for cardiac segmentation and generate cohorts of CHD cardiac meshes for computational simulation. △ Less

Submitted 8 November, 2023; v1 submitted 1 November, 2023; originally announced November 2023.

arXiv:2310.15407 [pdf, ps, other]

Finite-Time Adaptive Fuzzy Tracking Control for Nonlinear State Constrained Pure-Feedback Systems

Authors: Ju Wu, Tong Wang, Min Ma

Abstract: This paper investigates the finite-time adaptive fuzzy tracking control problem for a class of pure-feedback system with full-state constraints. With the help of Mean-Value Theorem, the pure-feedback nonlinear system is transformed into strict-feedback case. By employing finite-time-stable like function and state transformation for output tracking error, the output tracking error converges to a pr… ▽ More This paper investigates the finite-time adaptive fuzzy tracking control problem for a class of pure-feedback system with full-state constraints. With the help of Mean-Value Theorem, the pure-feedback nonlinear system is transformed into strict-feedback case. By employing finite-time-stable like function and state transformation for output tracking error, the output tracking error converges to a predefined set in a fixed finite interval. To tackle the problem of state constraints, integral Barrier Lyapunov functions are utilized to guarantee that the state variables remain within the prescribed constraints with feasibility check. Fuzzy logic systems are utilized to approximate the unknown nonlinear functions. In addition, all the signals in the closed-loop system are guaranteed to be semi-global ultimately uniformly bounded. Finally, two simulation examples are given to show the effectiveness of the proposed control strategy. △ Less

Submitted 28 December, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

Comments: typos checked and corrected in 'Introduction'

arXiv:2310.08804 [pdf, other]

Spiking Semantic Communication for Feature Transmission with HARQ

Authors: Mengyang Wang, Jiahui Li, Mengyao Ma, Xiaopeng Fan

Abstract: In Collaborative Intelligence (CI), the Artificial Intelligence (AI) model is divided between the edge and the cloud, with intermediate features being sent from the edge to the cloud for inference. Several deep learning-based Semantic Communication (SC) models have been proposed to reduce feature transmission overhead and mitigate channel noise interference. Previous research has demonstrated that… ▽ More In Collaborative Intelligence (CI), the Artificial Intelligence (AI) model is divided between the edge and the cloud, with intermediate features being sent from the edge to the cloud for inference. Several deep learning-based Semantic Communication (SC) models have been proposed to reduce feature transmission overhead and mitigate channel noise interference. Previous research has demonstrated that Spiking Neural Network (SNN)-based SC models exhibit greater robustness on digital channels compared to Deep Neural Network (DNN)-based SC models. However, the existing SNN-based SC models require fixed time steps, resulting in fixed transmission bandwidths that cannot be adaptively adjusted based on channel conditions. To address this issue, this paper introduces a novel SC model called SNN-SC-HARQ, which combines the SNN-based SC model with the Hybrid Automatic Repeat Request (HARQ) mechanism. SNN-SC-HARQ comprises an SNN-based SC model that supports the transmission of features at varying bandwidths, along with a policy model that determines the appropriate bandwidth. Experimental results show that SNN-SC-HARQ can dynamically adjust the bandwidth according to the channel conditions without performance loss. △ Less

Submitted 12 October, 2023; originally announced October 2023.

arXiv:2309.10567 [pdf, other]

Multimodal Modeling For Spoken Language Identification

Authors: Shikhar Bharadwaj, Min Ma, Shikhar Vashishth, Ankur Bapna, Sriram Ganapathy, Vera Axelrod, Siddharth Dalmia, Wei Han, Yu Zhang, Daan van Esch, Sandy Ritchie, Partha Talukdar, Jason Riesa

Abstract: Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance. Conventionally, it is modeled as a speech-based language identification task. Prior techniques have been constrained to a single modality; however in the case of video data there is a wealth of other metadata that may be beneficial for this task. In this work, we propose MuSeLI,… ▽ More Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance. Conventionally, it is modeled as a speech-based language identification task. Prior techniques have been constrained to a single modality; however in the case of video data there is a wealth of other metadata that may be beneficial for this task. In this work, we propose MuSeLI, a Multimodal Spoken Language Identification method, which delves into the use of various metadata sources to enhance language identification. Our study reveals that metadata such as video title, description and geographic location provide substantial information to identify the spoken language of the multimedia recording. We conduct experiments using two diverse public datasets of YouTube videos, and obtain state-of-the-art results on the language identification task. We additionally conduct an ablation study that describes the distinct contribution of each modality for language recognition. △ Less

Submitted 19 September, 2023; originally announced September 2023.

arXiv:2308.01317 [pdf]

ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders

Authors: Shawn Xu, Lin Yang, Christopher Kelly, Marcin Sieniek, Timo Kohlberger, Martin Ma, Wei-Hung Weng, Atilla Kiraly, Sahar Kazemzadeh, Zakkai Melamed, Jungyeon Park, Patricia Strachan, Yun Liu, Chuck Lau, Preeti Singh, Christina Chen, Mozziyar Etemadi, Sreenivasa Raju Kalidindi, Yossi Matias, Katherine Chou, Greg S. Corrado, Shravya Shetty, Daniel Tse, Shruthi Prabhakara, Daniel Golden , et al. (3 additional authors not shown)

Abstract: In this work, we present an approach, which we call Embeddings for Language/Image-aligned X-Rays, or ELIXR, that leverages a language-aligned image encoder combined or grafted onto a fixed LLM, PaLM 2, to perform a broad range of chest X-ray tasks. We train this lightweight adapter architecture using images paired with corresponding free-text radiology reports from the MIMIC-CXR dataset. ELIXR ach… ▽ More In this work, we present an approach, which we call Embeddings for Language/Image-aligned X-Rays, or ELIXR, that leverages a language-aligned image encoder combined or grafted onto a fixed LLM, PaLM 2, to perform a broad range of chest X-ray tasks. We train this lightweight adapter architecture using images paired with corresponding free-text radiology reports from the MIMIC-CXR dataset. ELIXR achieved state-of-the-art performance on zero-shot chest X-ray (CXR) classification (mean AUC of 0.850 across 13 findings), data-efficient CXR classification (mean AUCs of 0.893 and 0.898 across five findings (atelectasis, cardiomegaly, consolidation, pleural effusion, and pulmonary edema) for 1% (~2,200 images) and 10% (~22,000 images) training data), and semantic search (0.76 normalized discounted cumulative gain (NDCG) across nineteen queries, including perfect retrieval on twelve of them). Compared to existing data-efficient methods including supervised contrastive learning (SupCon), ELIXR required two orders of magnitude less data to reach similar performance. ELIXR also showed promise on CXR vision-language tasks, demonstrating overall accuracies of 58.7% and 62.5% on visual question answering and report quality assurance tasks, respectively. These results suggest that ELIXR is a robust and versatile approach to CXR AI. △ Less

Submitted 7 September, 2023; v1 submitted 2 August, 2023; originally announced August 2023.

arXiv:2308.00393 [pdf, other]

A Survey of Time Series Anomaly Detection Methods in the AIOps Domain

Authors: Zhenyu Zhong, Qiliang Fan, Jiacheng Zhang, Minghua Ma, Shenglin Zhang, Yongqian Sun, Qingwei Lin, Yuzhi Zhang, Dan Pei

Abstract: Internet-based services have seen remarkable success, generating vast amounts of monitored key performance indicators (KPIs) as univariate or multivariate time series. Monitoring and analyzing these time series are crucial for researchers, service operators, and on-call engineers to detect outliers or anomalies indicating service failures or significant events. Numerous advanced anomaly detection… ▽ More Internet-based services have seen remarkable success, generating vast amounts of monitored key performance indicators (KPIs) as univariate or multivariate time series. Monitoring and analyzing these time series are crucial for researchers, service operators, and on-call engineers to detect outliers or anomalies indicating service failures or significant events. Numerous advanced anomaly detection methods have emerged to address availability and performance issues. This review offers a comprehensive overview of time series anomaly detection in Artificial Intelligence for IT operations (AIOps), which uses AI capabilities to automate and optimize operational workflows. Additionally, it explores future directions for real-world and next-generation time-series anomaly detection based on recent advancements. △ Less

Submitted 1 August, 2023; originally announced August 2023.

arXiv:2307.10982 [pdf, other]

MASR: Multi-label Aware Speech Representation

Authors: Anjali Raj, Shikhar Bharadwaj, Sriram Ganapathy, Min Ma, Shikhar Vashishth

Abstract: In the recent years, speech representation learning is constructed primarily as a self-supervised learning (SSL) task, using the raw audio signal alone, while ignoring the side-information that is often available for a given speech recording. In this paper, we propose MASR, a Multi-label Aware Speech Representation learning framework, which addresses the aforementioned limitations. MASR enables th… ▽ More In the recent years, speech representation learning is constructed primarily as a self-supervised learning (SSL) task, using the raw audio signal alone, while ignoring the side-information that is often available for a given speech recording. In this paper, we propose MASR, a Multi-label Aware Speech Representation learning framework, which addresses the aforementioned limitations. MASR enables the inclusion of multiple external knowledge sources to enhance the utilization of meta-data information. The external knowledge sources are incorporated in the form of sample-level pair-wise similarity matrices that are useful in a hard-mining loss. A key advantage of the MASR framework is that it can be combined with any choice of SSL method. Using MASR representations, we perform evaluations on several downstream tasks such as language identification, speech recognition and other non-semantic tasks such as speaker and emotion recognition. In these experiments, we illustrate significant performance improvements for the MASR over other established benchmarks. We perform a detailed analysis on the language identification task to provide insights on how the proposed loss function enables the representations to separate closely related languages. △ Less

Submitted 25 September, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

Comments: Accepted at ASRU 2023

arXiv:2307.04327 [pdf]

Legal Decision-making for Highway Automated Driving

Authors: Xiaohan Ma, Wenhao Yu, Chengxiang Zhao, Changjun Wang, Wenhui Zhou, Guangming Zhao, Mingyue Ma, Weida Wang, Lin Yang, Rui Mu, Hong Wang, Jun Li

Abstract: Compliance with traffic laws is a fundamental requirement for human drivers on the road, and autonomous vehicles must adhere to traffic laws as well. However, current autonomous vehicles prioritize safety and collision avoidance primarily in their decision-making and planning, which will lead to misunderstandings and distrust from human drivers and may even result in accidents in mixed traffic flo… ▽ More Compliance with traffic laws is a fundamental requirement for human drivers on the road, and autonomous vehicles must adhere to traffic laws as well. However, current autonomous vehicles prioritize safety and collision avoidance primarily in their decision-making and planning, which will lead to misunderstandings and distrust from human drivers and may even result in accidents in mixed traffic flow. Therefore, ensuring the compliance of the autonomous driving decision-making system is essential for ensuring the safety of autonomous driving and promoting the widespread adoption of autonomous driving technology. To this end, the paper proposes a trigger-based layered compliance decision-making framework. This framework utilizes the decision intent at the highest level as a signal to activate an online violation monitor that identifies the type of violation committed by the vehicle. Then, a four-layer architecture for compliance decision-making is employed to generate compliantly trajectories. Using this system, autonomous vehicles can detect and correct potential violations in real-time, thereby enhancing safety and building public confidence in autonomous driving technology. Finally, the proposed method is evaluated on the DJI AD4CHE highway dataset under four typical highway scenarios: speed limit, following distance, overtaking, and lane-changing. The results indicate that the proposed method increases the vehicle's overall compliance rate from 13.85% to 84.46%, while reducing the proportion of active violations to 0%, demonstrating its effectiveness. △ Less

Submitted 9 July, 2023; originally announced July 2023.

Comments: 14 pages, 17 figures

arXiv:2306.17697 [pdf, ps, other]

doi 10.1109/SPAWC53906.2023.10304436

Analysis of Oversampling in Uplink Massive MIMO-OFDM with Low-Resolution ADCs

Authors: Mengyuan Ma, Nhan Thanh Nguyen, Italo Atzeni, Markku Juntti

Abstract: Low-resolution analog-to-digital converters (ADCs) have emerged as an efficient solution for massive multiple-input multiple-output (MIMO) systems to reap high data rates with reasonable power consumption and hardware complexity. In this paper, we analyze the performance of oversampling in uplink massive MIMO orthogonal frequency-division multiplexing (MIMO-OFDM) systems with low-resolution ADCs.… ▽ More Low-resolution analog-to-digital converters (ADCs) have emerged as an efficient solution for massive multiple-input multiple-output (MIMO) systems to reap high data rates with reasonable power consumption and hardware complexity. In this paper, we analyze the performance of oversampling in uplink massive MIMO orthogonal frequency-division multiplexing (MIMO-OFDM) systems with low-resolution ADCs. Considering both the temporal and spatial correlation of the quantization distortion, we derive an approximate closed-form expression of an achievable sum rate, which reveals how the oversampling ratio (OSR), the ADC resolution, and the signal-to-noise ratio (SNR) jointly affect the system performance. In particular, we demonstrate that oversampling can effectively improve the sum rate by mitigating the impact of the quantization distortion, especially at high SNR and with very low ADC resolution. Furthermore, we show that the considered low-resolution massive MIMO-OFDM system can achieve the same performance as the unquantized one when both the SNR and the OSR are sufficiently high. Numerical simulations confirm our analysis. △ Less

Submitted 9 November, 2024; v1 submitted 30 June, 2023; originally announced June 2023.

Comments: Appeared in IEEE SPAWC2023. This version corrects some symbol typos

Journal ref: 2023 IEEE 24th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC)

arXiv:2306.10232

Multi-Task Offloading via Graph Neural Networks in Heterogeneous Multi-access Edge Computing

Authors: Mulei Ma

Abstract: In the rapidly evolving field of Heterogeneous Multi-access Edge Computing (HMEC), efficient task offloading plays a pivotal role in optimizing system throughput and resource utilization. However, existing task offloading methods often fall short of adequately modeling the dependency topology relationships between offloaded tasks, which limits their effectiveness in capturing the complex interdepe… ▽ More In the rapidly evolving field of Heterogeneous Multi-access Edge Computing (HMEC), efficient task offloading plays a pivotal role in optimizing system throughput and resource utilization. However, existing task offloading methods often fall short of adequately modeling the dependency topology relationships between offloaded tasks, which limits their effectiveness in capturing the complex interdependencies of task features. To address this limitation, we propose a task offloading mechanism based on Graph Neural Networks (GNN). Our modeling approach takes into account factors such as task characteristics, network conditions, and available resources at the edge, and embeds these captured features into the graph structure. By utilizing GNNs, our mechanism can capture and analyze the intricate relationships between task features, enabling a more comprehensive understanding of the underlying dependency topology. Through extensive evaluations in heterogeneous networks, our proposed algorithm improves 18.6\%-53.8\% over greedy and approximate algorithms in optimizing system throughput and resource utilization. Our experiments showcase the advantage of considering the intricate interplay of task features using GNN-based modeling. △ Less

Submitted 30 May, 2024; v1 submitted 16 June, 2023; originally announced June 2023.

Comments: Insufficient completion, there are some errors in the current version

arXiv:2306.04374 [pdf, other]

Label Aware Speech Representation Learning For Language Identification

Authors: Shikhar Vashishth, Shikhar Bharadwaj, Sriram Ganapathy, Ankur Bapna, Min Ma, Wei Han, Vera Axelrod, Partha Talukdar

Abstract: Speech representation learning approaches for non-semantic tasks such as language recognition have either explored supervised embedding extraction methods using a classifier model or self-supervised representation learning approaches using raw data. In this paper, we propose a novel framework of combining self-supervised representation learning with the language label information for the pre-train… ▽ More Speech representation learning approaches for non-semantic tasks such as language recognition have either explored supervised embedding extraction methods using a classifier model or self-supervised representation learning approaches using raw data. In this paper, we propose a novel framework of combining self-supervised representation learning with the language label information for the pre-training task. This framework, termed as Label Aware Speech Representation (LASR) learning, uses a triplet based objective function to incorporate language labels along with the self-supervised loss function. The speech representations are further fine-tuned for the downstream task. The language recognition experiments are performed on two public datasets - FLEURS and Dhwani. In these experiments, we illustrate that the proposed LASR framework improves over the state-of-the-art systems on language identification. We also report an analysis of the robustness of LASR approach to noisy/missing labels as well as its application to multi-lingual speech recognition tasks. △ Less

Submitted 7 June, 2023; originally announced June 2023.

Comments: Accepted at Interspeech 2023

arXiv:2305.15719 [pdf, other]

Efficient Neural Music Generation

Authors: Max W. Y. Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, Jitong Chen, Yuping Wang, Yuxuan Wang

Abstract: Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real… ▽ More Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real-time generation. Efficient music generation with a quality on par with MusicLM remains a significant challenge. In this paper, we present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality meanwhile reducing 95.7% or 99.6% forward passes in MusicLM, respectively, for sampling 10s or 30s music. MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform. DPD is proposed to simultaneously model the coarse and fine acoustics by incorporating the semantic information into segments of latents effectively via cross-attention at each denoising step. Our experimental results suggest the superiority of MeLoDy, not only in its practical advantages on sampling speed and infinitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation. Our samples are available at https://Efficient-MeLoDy.github.io/. △ Less

Submitted 25 May, 2023; originally announced May 2023.

arXiv:2303.12466 [pdf, ps, other]

doi 10.1109/ICC45041.2023

Beam Squint Analysis and Mitigation via Hybrid Beamforming Design in THz Communications

Authors: Mengyuan Ma, Nhan Thanh Nguyen, Markku Juntti

Abstract: We investigate the beam squint effect in uniform planar arrays (UPAs) and propose an efficient hybrid beamforming (HBF) design to mitigate the beam squint in multiple-input multiple-output orthogonal frequency-division multiplexing (MIMO-OFDM) systems operating at terahertz band. We first analyze the array gain and derive the closed-form beam squint ratio that characterizes the severity of the bea… ▽ More We investigate the beam squint effect in uniform planar arrays (UPAs) and propose an efficient hybrid beamforming (HBF) design to mitigate the beam squint in multiple-input multiple-output orthogonal frequency-division multiplexing (MIMO-OFDM) systems operating at terahertz band. We first analyze the array gain and derive the closed-form beam squint ratio that characterizes the severity of the beam squint effect on UPAs. The effect is shown to be more severe with a higher fractional bandwidth, while it can be significantly mitigated when the shape of a UPA approaches a square. We then focus on the HBF design that maximizes the system spectral efficiency. The design problem is challenging due to the frequency-flat nature and hardware constraints of the analog beamformer. We overcome the challenges by proposing an efficient decoupling design in which the digital and analog beamformers admit closed-form solutions, which facilitate practical implementations. Numerical results validate our analysis and show that the proposed HBF design is robust to beam squint, and thus, it outperforms the state-of-the-art methods in wideband massive MIMO systems. △ Less

Submitted 22 March, 2023; originally announced March 2023.

Comments: 6 pages, 7 figures, to be appeared in IEEE ICC2023

arXiv:2303.03470 [pdf, other]

What Would Trojans Do? Exploiting Partial-Information Vulnerabilities in Autonomous Vehicle Sensing

Authors: R. Spencer Hallyburton, Qingzhao Zhang, Z. Morley Mao, Michael Reiter, Miroslav Pajic

Abstract: Safety-critical sensors in autonomous vehicles (AVs) form an essential part of the vehicle's trusted computing base (TCB), yet they are highly susceptible to attacks. Alarmingly, Tier 1 manufacturers have already exposed vulnerabilities to attacks introducing Trojans that can stealthily alter sensor outputs. We analyze the feasible capability and safety-critical outcomes of an attack on sensing at… ▽ More Safety-critical sensors in autonomous vehicles (AVs) form an essential part of the vehicle's trusted computing base (TCB), yet they are highly susceptible to attacks. Alarmingly, Tier 1 manufacturers have already exposed vulnerabilities to attacks introducing Trojans that can stealthily alter sensor outputs. We analyze the feasible capability and safety-critical outcomes of an attack on sensing at a cyber level. To further address these threats, we design realistic attacks in AV simulators and real-world datasets under two practical constraints: attackers (1) possess only partial information and (2) are constrained by data structures that maintain sensor integrity.Examining the role of camera and LiDAR in multi-sensor AVs, we find that attacks targeting only the camera have minimal safety impact due to the sensor fusion system's strong reliance on 3D data from LiDAR. This reliance makes LiDAR-based attacks especially detrimental to safety. To mitigate the vulnerabilities, we introduce security-aware sensor fusion incorporating (1) a probabilistic data-asymmetry monitor and (2) a scalable track-to-track fusion of 3D LiDAR and monocular detections (T2T-3DLM). We demonstrate that these methods significantly diminish attack success rate. △ Less

Submitted 13 March, 2025; v1 submitted 6 March, 2023; originally announced March 2023.

arXiv:2303.01723 [pdf, other]

AI-Empowered Hybrid MIMO Beamforming

Authors: Nir Shlezinger, Mengyuan Ma, Ortal Lavi, Nhan Thanh Nguyen, Yonina C. Eldar, Markku Juntti

Abstract: Hybrid multiple-input multiple-output (MIMO) is an attractive technology for realizing extreme massive MIMO systems envisioned for future wireless communications in a scalable and power-efficient manner. However, the fact that hybrid MIMO systems implement part of their beamforming in analog and part in digital makes the optimization of their beampattern notably more challenging compared with conv… ▽ More Hybrid multiple-input multiple-output (MIMO) is an attractive technology for realizing extreme massive MIMO systems envisioned for future wireless communications in a scalable and power-efficient manner. However, the fact that hybrid MIMO systems implement part of their beamforming in analog and part in digital makes the optimization of their beampattern notably more challenging compared with conventional fully digital MIMO. Consequently, recent years have witnessed a growing interest in using data-aided artificial intelligence (AI) tools for hybrid beamforming design. This article reviews candidate strategies to leverage data to improve real-time hybrid beamforming design. We discuss the architectural constraints and characterize the core challenges associated with hybrid beamforming optimization. We then present how these challenges are treated via conventional optimization, and identify different AI-aided design approaches. These can be roughly divided into purely data-driven deep learning models and different forms of deep unfolding techniques for combining AI with classical optimization.We provide a systematic comparative study between existing approaches including both numerical evaluations and qualitative measures. We conclude by presenting future research opportunities associated with the incorporation of AI in hybrid MIMO systems. △ Less

Submitted 3 March, 2023; originally announced March 2023.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2302.12041 [pdf, other]

Deep Unfolding Hybrid Beamforming Designs for THz Massive MIMO Systems

Authors: Nhan Thanh Nguyen, Mengyuan Ma, Nir Shlezinger, Yonina C. Eldar, A. L. Swindlehurst, Markku Juntti

Abstract: Hybrid beamforming (HBF) is a key enabler for wideband terahertz (THz) massive multiple-input multiple-output (mMIMO) communications systems. A core challenge with designing HBF systems stems from the fact their application often involves a non-convex, highly complex optimization of large dimensions. In this paper, we propose HBF schemes that leverage data to enable efficient designs for both the… ▽ More Hybrid beamforming (HBF) is a key enabler for wideband terahertz (THz) massive multiple-input multiple-output (mMIMO) communications systems. A core challenge with designing HBF systems stems from the fact their application often involves a non-convex, highly complex optimization of large dimensions. In this paper, we propose HBF schemes that leverage data to enable efficient designs for both the fully-connected HBF (FC-HBF) and dynamic sub-connected HBF (SC-HBF) architectures. We develop a deep unfolding framework based on factorizing the optimal fully digital beamformer into analog and digital terms and formulating two corresponding equivalent least squares (LS) problems. Then, the digital beamformer is obtained via a closed-form LS solution, while the analog beamformer is obtained via ManNet, a lightweight sparsely-connected deep neural network based on unfolding projected gradient descent. Incorporating ManNet into the developed deep unfolding framework leads to the ManNet-based FC-HBF scheme. We show that the proposed ManNet can also be applied to SC-HBF designs after determining the connections between the radio frequency chain and antennas. We further develop a simplified version of ManNet, referred to as subManNet, that directly produces the sparse analog precoder for SC-HBF architectures. Both networks are trained with an unsupervised training procedure. Numerical results verify that the proposed ManNet/subManNet-based HBF approaches outperform the conventional model-based and deep unfolded counterparts with very low complexity and a fast run time. For example, in a simulation with 128 transmit antennas, it attains a slightly higher spectral efficiency than the Riemannian manifold scheme, but over 1000 times faster and with a complexity reduction of more than by a factor of six (6). △ Less

Submitted 23 February, 2023; originally announced February 2023.

Comments: This paper has been submitted to IEEE Transaction on Signal Processing

arXiv:2212.05751 [pdf, other]

Zero-Shot Accent Conversion using Pseudo Siamese Disentanglement Network

Authors: Dongya Jia, Qiao Tian, Kainan Peng, Jiaxin Li, Yuanzhe Chen, Mingbo Ma, Yuping Wang, Yuxuan Wang

Abstract: The goal of accent conversion (AC) is to convert the accent of speech into the target accent while preserving the content and speaker identity. AC enables a variety of applications, such as language learning, speech content creation, and data augmentation. Previous methods rely on reference utterances in the inference phase or are unable to preserve speaker identity. To address these issues, we pr… ▽ More The goal of accent conversion (AC) is to convert the accent of speech into the target accent while preserving the content and speaker identity. AC enables a variety of applications, such as language learning, speech content creation, and data augmentation. Previous methods rely on reference utterances in the inference phase or are unable to preserve speaker identity. To address these issues, we propose a zero-shot reference-free accent conversion method, which is able to convert unseen speakers' utterances into a target accent. Pseudo Siamese Disentanglement Network (PSDN) is proposed to disentangle the accent from the content representation. Experimental results show that our model generates speech samples with much higher accentedness than the input and comparable naturalness, on two-way conversion including foreign-to-native and native-to-foreign. △ Less

Submitted 10 August, 2023; v1 submitted 12 December, 2022; originally announced December 2022.

Comments: Accepted by INTERSPEECH 2023

arXiv:2210.06890 [pdf, ps, other]

Switch-based Hybrid Beamforming Transceiver Design for Wideband Communications with Beam Squint

Authors: Mengyuan Ma, Nhan Thanh Nguyen, Markku Juntti

Abstract: Hybrid beamforming (HBF) transceiver architectures based on frequency-independent phase shifters (PS-HBF) are sensitive to the phases and physical directions with limited capability to compensate for the detrimental effects of the beam squint. Motivated by the fact that switches are phase-independent and more power/cost efficient than PSs, we consider the switch-based HBF (SW-HBF) for wideband lar… ▽ More Hybrid beamforming (HBF) transceiver architectures based on frequency-independent phase shifters (PS-HBF) are sensitive to the phases and physical directions with limited capability to compensate for the detrimental effects of the beam squint. Motivated by the fact that switches are phase-independent and more power/cost efficient than PSs, we consider the switch-based HBF (SW-HBF) for wideband large-scale multiple-input multiple-output systems in this paper. We first derive a closed-form expression of the beam squint ratio and compare the expected array gains of both SW-HBF and PS-HBF architectures. The results show that SW-HBF is more robust to the beam squint effect. We then focus on the SW-HBF designs to maximize the spectral efficiency (SE) in both single-user and multiuser systems, which are both non-convex mixed-integer problems. For the former, by combining the tabu search (TS) method and projected gradient ascend (PGA), we propose an efficient heuristic PGA-TS algorithm to design analog beamformers while the digital ones admit closed-form solutions. For the latter, we develop a two-step algorithm based on fractional programming and the PGA-TS method. Simulations show that the proposed SW-HBF schemes are efficient and can outperform PS-based HBF architectures in terms of both SE and energy efficiency in terahertz communication systems. △ Less

Submitted 26 September, 2024; v1 submitted 13 October, 2022; originally announced October 2022.

Comments: 16 pages, 20 figures

arXiv:2210.06836 [pdf, other]

SNN-SC: A Spiking Semantic Communication Framework for Collaborative Intelligence

Authors: Mengyang Wang, Jiahui Li, Mengyao Ma, Xiaopeng Fan

Abstract: Collaborative Intelligence (CI) has emerged as a promising framework for deploying Artificial Intelligence (AI) models on resource-constrained edge devices. In CI, the AI model is partitioned between the edge device and the cloud, with intermediate features transmitted from the edge sub-model to the cloud sub-model to complete the inference task. However, reducing feature transmission overhead whi… ▽ More Collaborative Intelligence (CI) has emerged as a promising framework for deploying Artificial Intelligence (AI) models on resource-constrained edge devices. In CI, the AI model is partitioned between the edge device and the cloud, with intermediate features transmitted from the edge sub-model to the cloud sub-model to complete the inference task. However, reducing feature transmission overhead while maintaining task performance remains a challenge, particularly in the case of noisy wireless channels. In this paper, we propose a Spiking Neural Network (SNN)-based Semantic Communication (SC) model, SNN-SC, which extracts compact semantic information from features and transmits it through digital binary channels. Compared to the Deep Neural Network (DNN)-based SC model, whose output is floating-point, the binary output of SNN makes SNN-SC directly applicable to digital binary channels without the need for extra quantization. Moreover, we introduce a novel spiking neuron called IHF to enhance the reconstruction capability of the SNN-SC decoder. Finally, we enhance the performance of SNN-SC by maximizing the entropy of semantic information. SNN-SC achieves a higher compression ratio and overcomes the `cliff effect' compared to the traditional separate source and channel coding method. In addition, SNN-SC has lower computational complexity than the DNN-based SC model and maintains higher task performance under poor channel conditions. △ Less

Submitted 22 November, 2024; v1 submitted 13 October, 2022; originally announced October 2022.

Comments: Accepted for publication in the IEEE Transactions on Vehicular Technology

arXiv:2210.06747 [pdf, other]

DCANet: Differential Convolution Attention Network for RGB-D Semantic Segmentation

Authors: Lizhi Bai, Jun Yang, Chunqi Tian, Yaoru Sun, Maoyu Mao, Yanjun Xu, Weirong Xu

Abstract: Combining RGB images and the corresponding depth maps in semantic segmentation proves the effectiveness in the past few years. Existing RGB-D modal fusion methods either lack the non-linear feature fusion ability or treat both modal images equally, regardless of the intrinsic distribution gap or information loss. Here we find that depth maps are suitable to provide intrinsic fine-grained patterns… ▽ More Combining RGB images and the corresponding depth maps in semantic segmentation proves the effectiveness in the past few years. Existing RGB-D modal fusion methods either lack the non-linear feature fusion ability or treat both modal images equally, regardless of the intrinsic distribution gap or information loss. Here we find that depth maps are suitable to provide intrinsic fine-grained patterns of objects due to their local depth continuity, while RGB images effectively provide a global view. Based on this, we propose a pixel differential convolution attention (DCA) module to consider geometric information and local-range correlations for depth data. Furthermore, we extend DCA to ensemble differential convolution attention (EDCA) which propagates long-range contextual dependencies and seamlessly incorporates spatial distribution for RGB data. DCA and EDCA dynamically adjust convolutional weights by pixel difference to enable self-adaptive in local and long range, respectively. A two-branch network built with DCA and EDCA, called Differential Convolutional Network (DCANet), is proposed to fuse local and global information of two-modal data. Consequently, the individual advantage of RGB and depth data are emphasized. Our DCANet is shown to set a new state-of-the-art performance for RGB-D semantic segmentation on two challenging benchmark datasets, i.e., NYUDv2 and SUN-RGBD. △ Less

Submitted 13 October, 2022; originally announced October 2022.

arXiv:2208.02792 [pdf]

A Cooperative Perception Environment for Traffic Operations and Control

Authors: Hanlin Chen, Brian Liu, Xumiao Zhang, Feng Qian, Z. Morley Mao, Yiheng Feng

Abstract: Existing data collection methods for traffic operations and control usually rely on infrastructure-based loop detectors or probe vehicle trajectories. Connected and automated vehicles (CAVs) not only can report data about themselves but also can provide the status of all detected surrounding vehicles. Integration of perception data from multiple CAVs as well as infrastructure sensors (e.g., LiDAR)… ▽ More Existing data collection methods for traffic operations and control usually rely on infrastructure-based loop detectors or probe vehicle trajectories. Connected and automated vehicles (CAVs) not only can report data about themselves but also can provide the status of all detected surrounding vehicles. Integration of perception data from multiple CAVs as well as infrastructure sensors (e.g., LiDAR) can provide richer information even under a very low penetration rate. This paper aims to develop a cooperative data collection system, which integrates Lidar point cloud data from both infrastructure and CAVs to create a cooperative perception environment for various transportation applications. The state-of-the-art 3D detection models are applied to detect vehicles in the merged point cloud. We test the proposed cooperative perception environment with the max pressure adaptive signal control model in a co-simulation platform with CARLA and SUMO. Results show that very low penetration rates of CAV plus an infrastructure sensor are sufficient to achieve comparable performance with 30% or higher penetration rates of connected vehicles (CV). We also show the equivalent CV penetration rate (E-CVPR) under different CAV penetration rates to demonstrate the data collection efficiency of the cooperative perception environment. △ Less

Submitted 4 August, 2022; originally announced August 2022.

arXiv:2206.07008 [pdf, other]

doi 10.1109/LSP.2022.3184251

Constellation Design for Deep Joint Source-Channel Coding

Authors: Mengyang Wang, Jiahui Li, Mengyao Ma, Xiaopeng Fan

Abstract: Deep learning-based joint source-channel coding (JSCC) has shown excellent performance in image and feature transmission. However, the output values of the JSCC encoder are continuous, which makes the constellation of modulation complex and dense. It is hard and expensive to design radio frequency chains for transmitting such full-resolution constellation points. In this paper, two methods of mapp… ▽ More Deep learning-based joint source-channel coding (JSCC) has shown excellent performance in image and feature transmission. However, the output values of the JSCC encoder are continuous, which makes the constellation of modulation complex and dense. It is hard and expensive to design radio frequency chains for transmitting such full-resolution constellation points. In this paper, two methods of mapping the full-resolution constellation to finite constellation are proposed for real system implementation. The constellation mapping results of the proposed methods correspond to regular constellation and irregular constellation, respectively. We apply the methods to existing deep JSCC models and evaluate them on AWGN channels with different signal-to-noise ratios (SNRs). Experimental results show that the proposed methods outperform the traditional uniform quadrature amplitude modulation (QAM) constellation mapping method by only adding a few additional parameters. △ Less

Submitted 7 June, 2022; originally announced June 2022.

arXiv:2205.12446 [pdf, other]

FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech

Authors: Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, Ankur Bapna

Abstract: We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of Speech benchmark. FLEURS is an n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark, with approximately 12 hours of speech supervision per language. FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Languag… ▽ More We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of Speech benchmark. FLEURS is an n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark, with approximately 12 hours of speech supervision per language. FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Language Identification (Speech LangID), Translation and Retrieval. In this paper, we provide baselines for the tasks based on multilingual pre-trained models like mSLAM. The goal of FLEURS is to enable speech technology in more languages and catalyze research in low-resource speech understanding. △ Less

Submitted 24 May, 2022; originally announced May 2022.

arXiv:2205.03524 [pdf, other]

Dual Adversarial Adaptation for Cross-Device Real-World Image Super-Resolution

Authors: Xiaoqian Xu, Pengxu Wei, Weikai Chen, Mingzhi Mao, Liang Lin, Guanbin Li

Abstract: Due to the sophisticated imaging process, an identical scene captured by different cameras could exhibit distinct imaging patterns, introducing distinct proficiency among the super-resolution (SR) models trained on images from different devices. In this paper, we investigate a novel and practical task coded cross-device SR, which strives to adapt a real-world SR model trained on the paired images… ▽ More Due to the sophisticated imaging process, an identical scene captured by different cameras could exhibit distinct imaging patterns, introducing distinct proficiency among the super-resolution (SR) models trained on images from different devices. In this paper, we investigate a novel and practical task coded cross-device SR, which strives to adapt a real-world SR model trained on the paired images captured by one camera to low-resolution (LR) images captured by arbitrary target devices. The proposed task is highly challenging due to the absence of paired data from various imaging devices. To address this issue, we propose an unsupervised domain adaptation mechanism for real-world SR, named Dual ADversarial Adaptation (DADA), which only requires LR images in the target domain with available real paired data from a source camera. DADA employs the Domain-Invariant Attention (DIA) module to establish the basis of target model training even without HR supervision. Furthermore, the dual framework of DADA facilitates an Inter-domain Adversarial Adaptation (InterAA) in one branch for two LR input images from two domains, and an Intra-domain Adversarial Adaptation (IntraAA) in two branches for an LR input image. InterAA and IntraAA together improve the model transferability from the source domain to the target. We empirically conduct experiments under six Real to Real adaptation settings among three different cameras, and achieve superior performance compared with existing state-of-the-art approaches. We also evaluate the proposed DADA to address the adaptation to the video camera, which presents a promising research topic to promote the wide applications of real-world super-resolution. Our source code is publicly available at https://github.com/lonelyhope/DADA.git. △ Less

Submitted 6 May, 2022; originally announced May 2022.

Showing 1–50 of 83 results for author: Mao, M