Search | arXiv e-print repository

Perceptual Ratings Predict Speech Inversion Articulatory Kinematics in Childhood Speech Sound Disorders

Authors: Nina R. Benway, Saba Tabatabaee, Dongliang Wang, Benjamin Munson, Jonathan L. Preston, Carol Espy-Wilson

Abstract: Purpose: This study evaluated whether articulatory kinematics, inferred by Articulatory Phonology speech inversion neural networks, aligned with perceptual ratings of /r/ and /s/ in the speech of children with speech sound disorders. Methods: Articulatory Phonology vocal tract variables were inferred for 5,961 utterances from 118 children and 3 adults, aged 2.25-45 years. Perceptual ratings were… ▽ More Purpose: This study evaluated whether articulatory kinematics, inferred by Articulatory Phonology speech inversion neural networks, aligned with perceptual ratings of /r/ and /s/ in the speech of children with speech sound disorders. Methods: Articulatory Phonology vocal tract variables were inferred for 5,961 utterances from 118 children and 3 adults, aged 2.25-45 years. Perceptual ratings were standardized using the novel 5-point PERCEPT Rating Scale and training protocol. Two research questions examined if the articulatory patterns of inferred vocal tract variables aligned with the perceptual error category for the phones investigated (e.g., tongue tip is more anterior in dentalized /s/ productions than in correct /s/). A third research question examined if gradient PERCEPT Rating Scale scores predicted articulatory proximity to correct productions. Results: Estimated marginal means from linear mixed models supported 17 of 18 /r/ hypotheses, involving tongue tip and tongue body constrictions. For /s/, estimated marginal means from a second linear mixed model supported 7 of 15 hypotheses, particularly those related to the tongue tip. A third linear mixed model revealed that PERCEPT Rating Scale scores significantly predicted articulatory proximity of errored phones to correct productions. Conclusion: Inferred vocal tract variables differentiated category and magnitude of articulatory errors for /r/, and to a lesser extent for /s/, aligning with perceptual judgments. These findings support the clinical interpretability of speech inversion vocal tract variables and the PERCEPT Rating Scale in quantifying articulatory proximity to the target sound, particularly for /r/. △ Less

Submitted 2 July, 2025; originally announced July 2025.

Comments: This manuscript is in submission for publication. It has not yet been peer reviewed

arXiv:2506.15125 [pdf, ps, other]

Fiber Signal Denoising Algorithm using Hybrid Deep Learning Networks

Authors: Linlin Wang, Wei Wang, Dezhao Wang, Shanwen Wang

Abstract: With the applicability of optical fiber-based distributed acoustic sensing (DAS) systems, effective signal processing and analysis approaches are needed to promote its popularization in the field of intelligent transportation systems (ITS). This paper presents a signal denoising algorithm using a hybrid deep-learning network (HDLNet). Without annotated data and time-consuming labeling, this self-s… ▽ More With the applicability of optical fiber-based distributed acoustic sensing (DAS) systems, effective signal processing and analysis approaches are needed to promote its popularization in the field of intelligent transportation systems (ITS). This paper presents a signal denoising algorithm using a hybrid deep-learning network (HDLNet). Without annotated data and time-consuming labeling, this self-supervised network runs in parallel, combining an autoencoder for denoising (DAE) and a long short-term memory (LSTM) for sequential processing. Additionally, a line-by-line matching algorithm for vehicle detection and tracking is introduced, thus realizing the complete processing of fiber signal denoising and feature extraction. Experiments were carried out on a self-established real highway tunnel dataset, showing that our proposed hybrid network yields more satisfactory denoising performance than Spatial-domain DAE. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: 15 pages, 10 figures

arXiv:2506.11540 [pdf, ps, other]

MMWiLoc: A Multi-Sensor Dataset and Robust Device-Free Localization Method Using Commercial Off-The-Shelf Millimeter Wave Wi-Fi Devices

Authors: Wenbo Ding, Yang Li, Dongsheng Wang, Bin Zhao, Yunrong Zhu, Yibo Zhang, Yumeng Miao

Abstract: Device-free Wi-Fi sensing has numerous benefits in practical settings, as it eliminates the requirement for dedicated sensing devices and can be accomplished using current low-cost Wi-Fi devices. With the development of Wi-Fi standards, millimeter wave Wi-Fi devices with 60GHz operating frequency and up to 4GHz bandwidth have become commercially available. Although millimeter wave Wi-Fi presents g… ▽ More Device-free Wi-Fi sensing has numerous benefits in practical settings, as it eliminates the requirement for dedicated sensing devices and can be accomplished using current low-cost Wi-Fi devices. With the development of Wi-Fi standards, millimeter wave Wi-Fi devices with 60GHz operating frequency and up to 4GHz bandwidth have become commercially available. Although millimeter wave Wi-Fi presents great promise for Device-Free Wi-Fi sensing with increased bandwidth and beam-forming ability, there still lacks a method for localization using millimeter wave Wi-Fi. Here, we present two major contributions: First, we provide a comprehensive multi-sensor dataset that synchronously captures human movement data from millimeter wave Wi-Fi, 2.4GHz Wi-Fi, and millimeter wave radar sensors. This dataset enables direct performance comparisons across different sensing modalities and facilitates reproducible researches in indoor localization. Second, we introduce MMWiLoc, a novel localization method that achieves centimeter-level precision with low computational cost. MMWiLoc incorporates two components: beam pattern calibration using Expectation Maximization and target localization through Multi-Scale Compression Sensing. The system processes beam Signal-to-Noise Ratio (beamSNR) information from the beam-forming process to determine target Angle of Arrival (AoA), which is then fused across devices for localization. Our extensive evaluation demonstrates that MMWiLoc achieves centimeter-level precision, outperforming 2.4GHz Wi-Fi systems while maintaining competitive performance with high-precision radar systems. The dataset and examples processing code will be released after this paper is accepted at https://github.com/wowoyoho/MMWiLoc. △ Less

Submitted 13 June, 2025; originally announced June 2025.

Comments: 8 pages, 8 figures

arXiv:2506.09512 [pdf, ps, other]

A Survey on the Role of Artificial Intelligence and Machine Learning in 6G-V2X Applications

Authors: Donglin Wang, Anjie Qiu, Qiuheng Zhou, Hans D. Schotten

Abstract: The rapid advancement of Vehicle-to-Everything (V2X) communication is transforming Intelligent Transportation Systems (ITS), with 6G networks expected to provide ultra-reliable, low-latency, and high-capacity connectivity for Connected and Autonomous Vehicles (CAVs). Artificial Intelligence (AI) and Machine Learning (ML) have emerged as key enablers in optimizing V2X communication by enhancing net… ▽ More The rapid advancement of Vehicle-to-Everything (V2X) communication is transforming Intelligent Transportation Systems (ITS), with 6G networks expected to provide ultra-reliable, low-latency, and high-capacity connectivity for Connected and Autonomous Vehicles (CAVs). Artificial Intelligence (AI) and Machine Learning (ML) have emerged as key enablers in optimizing V2X communication by enhancing network management, predictive analytics, security, and cooperative driving due to their outstanding performance across various domains, such as natural language processing and computer vision. This survey comprehensively reviews recent advances in AI and ML models applied to 6G-V2X communication. It focuses on state-of-the-art techniques, including Deep Learning (DL), Reinforcement Learning (RL), Generative Learning (GL), and Federated Learning (FL), with particular emphasis on developments from the past two years. Notably, AI, especially GL, has shown remarkable progress and emerging potential in enhancing the performance, adaptability, and intelligence of 6G-V2X systems. Despite these advances, a systematic summary of recent research efforts in this area remains lacking, which this survey aims to address. We analyze their roles in 6G-V2X applications, such as intelligent resource allocation, beamforming, intelligent traffic management, and security management. Furthermore, we explore the technical challenges, including computational complexity, data privacy, and real-time decision-making constraints, while identifying future research directions for AI-driven 6G-V2X development. This study aims to provide valuable insights for researchers, engineers, and policymakers working towards realizing intelligent, AI-powered V2X ecosystems in 6G communication. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: 7 pages, 1 figure

arXiv:2506.08038 [pdf, ps, other]

Joint Routing and Control Optimization in VANET

Authors: Chen Huang, Dingxuan Wang, Ronghui Hou

Abstract: In this paper, we introduce DynaRoute, an adaptive joint optimization framework for dynamic vehicular networks that simultaneously addresses platoon control and data transmission through trajectory-aware routing and safety-constrained vehicle coordination. DynaRoute guarantees continuous vehicle movement via platoon safety control with optimizing transmission paths through real-time trajectory pre… ▽ More In this paper, we introduce DynaRoute, an adaptive joint optimization framework for dynamic vehicular networks that simultaneously addresses platoon control and data transmission through trajectory-aware routing and safety-constrained vehicle coordination. DynaRoute guarantees continuous vehicle movement via platoon safety control with optimizing transmission paths through real-time trajectory prediction and ensuring reliable data. Our solution achieves three key objectives: (1) maintaining platoon stability through accurate data transmission, (2) enabling adaptive routing based on vehicle movement patterns, and (3) enhancing overall intelligent transportation system performance. DynaRoute equires predefined traffic models and adapts to dynamic network conditions using local vehicle state information. We present comprehensive simulation results demonstrating that DynaRoute maintains control and transmission performance in multiple complex scenarios while significantly improving throughput and reliability compared to traditional approaches. △ Less

Submitted 5 June, 2025; originally announced June 2025.

Comments: 11 pages; 10 figures

arXiv:2506.04779 [pdf, ps, other]

MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Authors: Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, Helen Meng

Abstract: Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent mu… ▽ More Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have demonstrated remarkable capabilities in processing audio information, their ability to perform fine-grained perception and complex reasoning in natural speech remains largely unexplored. To address this gap, we introduce MMSU, a comprehensive benchmark designed specifically for understanding and reasoning in spoken language. MMSU comprises 5,000 meticulously curated audio-question-answer triplets across 47 distinct tasks. To ground our benchmark in linguistic theory, we systematically incorporate a wide range of linguistic phenomena, including phonetics, prosody, rhetoric, syntactics, semantics, and paralinguistics. Through a rigorous evaluation of 14 advanced SpeechLLMs, we identify substantial room for improvement in existing models, highlighting meaningful directions for future optimization. MMSU establishes a new standard for comprehensive assessment of spoken language understanding, providing valuable insights for developing more sophisticated human-AI speech interaction systems. MMSU benchmark is available at https://huggingface.co/datasets/ddwang2000/MMSU. Evaluation Code is available at https://github.com/dingdongwang/MMSU_Bench. △ Less

Submitted 5 June, 2025; originally announced June 2025.

Comments: MMSU benchmark is available at https://huggingface.co/datasets/ddwang2000/MMSU. Evaluation Code is available at https://github.com/dingdongwang/MMSU_Bench

arXiv:2506.02012 [pdf, other]

Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing

Authors: Zehua Liu, Xiaolou Li, Li Guo, Lantian Li, Dong Wang

Abstract: Visual Speech Recognition (VSR) transcribes speech by analyzing lip movements. Recently, Large Language Models (LLMs) have been integrated into VSR systems, leading to notable performance improvements. However, the potential of LLMs has not been extensively studied, and how to effectively utilize LLMs in VSR tasks remains unexplored. This paper systematically explores how to better leverage LLMs f… ▽ More Visual Speech Recognition (VSR) transcribes speech by analyzing lip movements. Recently, Large Language Models (LLMs) have been integrated into VSR systems, leading to notable performance improvements. However, the potential of LLMs has not been extensively studied, and how to effectively utilize LLMs in VSR tasks remains unexplored. This paper systematically explores how to better leverage LLMs for VSR tasks and provides three key contributions: (1) Scaling Test: We study how the LLM size affects VSR performance, confirming a scaling law in the VSR task. (2) Context-Aware Decoding: We add contextual text to guide the LLM decoding, improving recognition accuracy. (3) Iterative Polishing: We propose iteratively refining LLM outputs, progressively reducing recognition errors. Extensive experiments demonstrate that by these designs, the great potential of LLMs can be largely harnessed, leading to significant VSR performance improvement. △ Less

Submitted 27 May, 2025; originally announced June 2025.

arXiv:2506.02010 [pdf, other]

CNVSRC 2024: The Second Chinese Continuous Visual Speech Recognition Challenge

Authors: Zehua Liu, Xiaolou Li, Chen Chen, Lantian Li, Dong Wang

Abstract: This paper presents the second Chinese Continuous Visual Speech Recognition Challenge (CNVSRC 2024), which builds on CNVSRC 2023 to advance research in Chinese Large Vocabulary Continuous Visual Speech Recognition (LVC-VSR). The challenge evaluates two test scenarios: reading in recording studios and Internet speech. CNVSRC 2024 uses the same datasets as its predecessor CNVSRC 2023, which involves… ▽ More This paper presents the second Chinese Continuous Visual Speech Recognition Challenge (CNVSRC 2024), which builds on CNVSRC 2023 to advance research in Chinese Large Vocabulary Continuous Visual Speech Recognition (LVC-VSR). The challenge evaluates two test scenarios: reading in recording studios and Internet speech. CNVSRC 2024 uses the same datasets as its predecessor CNVSRC 2023, which involves CN-CVS for training and CNVSRC-Single/Multi for development and evaluation. However, CNVSRC 2024 introduced two key improvements: (1) a stronger baseline system, and (2) an additional dataset, CN-CVS2-P1, for open tracks to improve data volume and diversity. The new challenge has demonstrated several important innovations in data preprocessing, feature extraction, model design, and training strategies, further pushing the state-of-the-art in Chinese LVC-VSR. More details and resources are available at the official website. △ Less

Submitted 27 May, 2025; originally announced June 2025.

Comments: to be published in INTERSPEECH 2025

arXiv:2506.00885 [pdf, ps, other]

CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching

Authors: Leying Zhang, Yao Qian, Xiaofei Wang, Manthan Thakker, Dongmei Wang, Jianwei Yu, Haibin Wu, Yuxuan Hu, Jinyu Li, Yanmin Qian, Sheng Zhao

Abstract: Generating natural-sounding, multi-speaker dialogue is crucial for applications such as podcast creation, virtual agents, and multimedia content generation. However, existing systems struggle to maintain speaker consistency, model overlapping speech, and synthesize coherent conversations efficiently. In this paper, we introduce CoVoMix2, a fully non-autoregressive framework for zero-shot multi-tal… ▽ More Generating natural-sounding, multi-speaker dialogue is crucial for applications such as podcast creation, virtual agents, and multimedia content generation. However, existing systems struggle to maintain speaker consistency, model overlapping speech, and synthesize coherent conversations efficiently. In this paper, we introduce CoVoMix2, a fully non-autoregressive framework for zero-shot multi-talker dialogue generation. CoVoMix2 directly predicts mel-spectrograms from multi-stream transcriptions using a flow-matching-based generative model, eliminating the reliance on intermediate token representations. To better capture realistic conversational dynamics, we propose transcription-level speaker disentanglement, sentence-level alignment, and prompt-level random masking strategies. Our approach achieves state-of-the-art performance, outperforming strong baselines like MoonCast and Sesame in speech quality, speaker consistency, and inference speed. Notably, CoVoMix2 operates without requiring transcriptions for the prompt and supports controllable dialogue generation, including overlapping speech and precise timing control, demonstrating strong generalizability to real-world speech generation scenarios. △ Less

Submitted 1 June, 2025; originally announced June 2025.

arXiv:2505.21805 [pdf, ps, other]

An Investigation on Speaker Augmentation for End-to-End Speaker Extraction

Authors: Zhenghai You, Zhenyu Zhou, Lantian Li, Dong Wang

Abstract: Target confusion, defined as occasional switching to non-target speakers, poses a key challenge for end-to-end speaker extraction (E2E-SE) systems. We argue that this problem is largely caused by the lack of generalizability and discrimination of the speaker embeddings, and introduce a simple yet effective speaker augmentation strategy to tackle the problem. Specifically, we propose a time-domain… ▽ More Target confusion, defined as occasional switching to non-target speakers, poses a key challenge for end-to-end speaker extraction (E2E-SE) systems. We argue that this problem is largely caused by the lack of generalizability and discrimination of the speaker embeddings, and introduce a simple yet effective speaker augmentation strategy to tackle the problem. Specifically, we propose a time-domain resampling and rescaling pipeline that alters speaker traits while preserving other speech properties. This generates a variety of pseudo-speakers to help establish a generalizable speaker embedding space, while the speaker-trait-specific augmentation creates hard samples that force the model to focus on genuine speaker characteristics. Experiments on WSJ0-2Mix and LibriMix show that our method mitigates the target confusion and improves extraction performance. Moreover, it can be combined with metric learning, another effective approach to address target confusion, leading to further gains. △ Less

Submitted 27 May, 2025; originally announced May 2025.

arXiv:2505.18533 [pdf, ps, other]

TS-URGENet: A Three-stage Universal Robust and Generalizable Speech Enhancement Network

Authors: Xiaobin Rong, Dahan Wang, Qinwen Hu, Yushi Wang, Yuxiang Hu, Jing Lu

Abstract: Universal speech enhancement aims to handle input speech with different distortions and input formats. To tackle this challenge, we present TS-URGENet, a Three-Stage Universal, Robust, and Generalizable speech Enhancement Network. To address various distortions, the proposed system employs a novel three-stage architecture consisting of a filling stage, a separation stage, and a restoration stage.… ▽ More Universal speech enhancement aims to handle input speech with different distortions and input formats. To tackle this challenge, we present TS-URGENet, a Three-Stage Universal, Robust, and Generalizable speech Enhancement Network. To address various distortions, the proposed system employs a novel three-stage architecture consisting of a filling stage, a separation stage, and a restoration stage. The filling stage mitigates packet loss by preliminarily filling lost regions under noise interference, ensuring signal continuity. The separation stage suppresses noise, reverberation, and clipping distortion to improve speech clarity. Finally, the restoration stage compensates for bandwidth limitation, codec artifacts, and residual packet loss distortion, refining the overall speech quality. Our proposed TS-URGENet achieved outstanding performance in the Interspeech 2025 URGENT Challenge, ranking 2nd in Track 1. △ Less

Submitted 24 May, 2025; originally announced May 2025.

Comments: Accepted by Interspeech 2025

arXiv:2505.02439 [pdf, ps, other]

ReeM: Ensemble Building Thermodynamics Model for Efficient HVAC Control via Hierarchical Reinforcement Learning

Authors: Yang Deng, Yaohui Liu, Rui Liang, Dafang Zhao, Donghua Xie, Ittetsu Taniguchi, Dan Wang

Abstract: The building thermodynamics model, which predicts real-time indoor temperature changes under potential HVAC (Heating, Ventilation, and Air Conditioning) control operations, is crucial for optimizing HVAC control in buildings. While pioneering studies have attempted to develop such models for various building environments, these models often require extensive data collection periods and rely heavil… ▽ More The building thermodynamics model, which predicts real-time indoor temperature changes under potential HVAC (Heating, Ventilation, and Air Conditioning) control operations, is crucial for optimizing HVAC control in buildings. While pioneering studies have attempted to develop such models for various building environments, these models often require extensive data collection periods and rely heavily on expert knowledge, making the modeling process inefficient and limiting the reusability of the models. This paper explores a model ensemble perspective that utilizes existing developed models as base models to serve a target building environment, thereby providing accurate predictions while reducing the associated efforts. Given that building data streams are non-stationary and the number of base models may increase, we propose a Hierarchical Reinforcement Learning (HRL) approach to dynamically select and weight the base models. Our approach employs a two-tiered decision-making process: the high-level focuses on model selection, while the low-level determines the weights of the selected models. We thoroughly evaluate the proposed approach through offline experiments and an on-site case study, and the experimental results demonstrate the effectiveness of our method. △ Less

Submitted 5 May, 2025; originally announced May 2025.

arXiv:2504.17898 [pdf, other]

Material Identification Via RFID For Smart Shopping

Authors: David Wang, Derek Goh, Jiale Zhang

Abstract: Cashierless stores rely on computer vision and RFID tags to associate shoppers with items, but concealed items placed in backpacks, pockets, or bags create challenges for theft prevention. We introduce a system that turns existing RFID tagged items into material sensors by exploiting how different containers attenuate and scatter RF signals. Using RSSI and phase angle, we trained a neural network… ▽ More Cashierless stores rely on computer vision and RFID tags to associate shoppers with items, but concealed items placed in backpacks, pockets, or bags create challenges for theft prevention. We introduce a system that turns existing RFID tagged items into material sensors by exploiting how different containers attenuate and scatter RF signals. Using RSSI and phase angle, we trained a neural network to classify seven common containers. In a simulated retail environment, the model achieves 89% accuracy with one second samples and 74% accuracy from single reads. Incorporating distance measurements, our system achieves 82% accuracy across 0.3-2m tag to reader separations. When deployed at aisle or doorway choke points, the system can flag suspicious events in real time, prompting camera screening or staff intervention. By combining material identification with computer vision tracking, our system provides proactive loss prevention for cashierless retail while utilizing existing infrastructure. △ Less

Submitted 24 April, 2025; originally announced April 2025.

Comments: 5 pages, 7 figures

ACM Class: J.0; J.7; B.0

arXiv:2504.04969 [pdf, other]

Grouped Target Tracking and Seamless People Counting with a 24 GHz MIMO FMCW

Authors: Dingyang Wang, Sen Yuan, Alexander Yarovoy, Francesco Fioranelli

Abstract: The problem of radar-based tracking of groups of people moving together and counting their numbers in indoor environments is considered here. A novel processing pipeline to track groups of people moving together and count their numbers is proposed and validated. The pipeline is specifically designed to deal with frequent changes of direction and stop & go movements typical of indoor activities. Th… ▽ More The problem of radar-based tracking of groups of people moving together and counting their numbers in indoor environments is considered here. A novel processing pipeline to track groups of people moving together and count their numbers is proposed and validated. The pipeline is specifically designed to deal with frequent changes of direction and stop & go movements typical of indoor activities. The proposed approach combines a tracker with a classifier to count the number of grouped people; this uses both spatial features extracted from range-azimuth maps, and Doppler frequency features extracted with wavelet decomposition. Thus, the pipeline outputs over time both the location and number of people present. The proposed approach is verified with experimental data collected with a 24 GHz Frequency Modulated Continuous Wave (FMCW) radar. It is shown that the proposed method achieves 95.59% accuracy in counting the number of people, and a tracking metric OSPA of 0.338. Furthermore, the performance is analyzed as a function of different relevant variables such as feature combinations and scenarios. △ Less

Submitted 7 April, 2025; originally announced April 2025.

arXiv:2504.02402 [pdf, other]

EvMic: Event-based Non-contact sound recovery from effective spatial-temporal modeling

Authors: Hao Yin, Shi Guo, Xu Jia, Xudong XU, Lu Zhang, Si Liu, Dong Wang, Huchuan Lu, Tianfan Xue

Abstract: When sound waves hit an object, they induce vibrations that produce high-frequency and subtle visual changes, which can be used for recovering the sound. Early studies always encounter trade-offs related to sampling rate, bandwidth, field of view, and the simplicity of the optical path. Recent advances in event camera hardware show good potential for its application in visual sound recovery, becau… ▽ More When sound waves hit an object, they induce vibrations that produce high-frequency and subtle visual changes, which can be used for recovering the sound. Early studies always encounter trade-offs related to sampling rate, bandwidth, field of view, and the simplicity of the optical path. Recent advances in event camera hardware show good potential for its application in visual sound recovery, because of its superior ability in capturing high-frequency signals. However, existing event-based vibration recovery methods are still sub-optimal for sound recovery. In this work, we propose a novel pipeline for non-contact sound recovery, fully utilizing spatial-temporal information from the event stream. We first generate a large training set using a novel simulation pipeline. Then we designed a network that leverages the sparsity of events to capture spatial information and uses Mamba to model long-term temporal information. Lastly, we train a spatial aggregation block to aggregate information from different locations to further improve signal quality. To capture event signals caused by sound waves, we also designed an imaging system using a laser matrix to enhance the gradient and collected multiple data sequences for testing. Experimental results on synthetic and real-world data demonstrate the effectiveness of our method. △ Less

Submitted 3 April, 2025; originally announced April 2025.

Comments: Our project page: https://yyzq1.github.io/EvMic/

arXiv:2504.01519 [pdf, other]

Chain of Correction for Full-text Speech Recognition with Large Language Models

Authors: Zhiyuan Tang, Dong Wang, Zhikai Zhou, Yong Liu, Shen Huang, Shidong Shang

Abstract: Full-text error correction with Large Language Models (LLMs) for Automatic Speech Recognition (ASR) has gained increased attention due to its potential to correct errors across long contexts and address a broader spectrum of error types, including punctuation restoration and inverse text normalization. Nevertheless, many challenges persist, including issues related to stability, controllability, c… ▽ More Full-text error correction with Large Language Models (LLMs) for Automatic Speech Recognition (ASR) has gained increased attention due to its potential to correct errors across long contexts and address a broader spectrum of error types, including punctuation restoration and inverse text normalization. Nevertheless, many challenges persist, including issues related to stability, controllability, completeness, and fluency. To mitigate these challenges, this paper proposes the Chain of Correction (CoC) for full-text error correction with LLMs, which corrects errors segment by segment using pre-recognized text as guidance within a regular multi-turn chat format. The CoC also uses pre-recognized full text for context, allowing the model to better grasp global semantics and maintain a comprehensive overview of the entire content. Utilizing the open-sourced full-text error correction dataset ChFT, we fine-tune a pre-trained LLM to evaluate the performance of the CoC framework. Experimental results demonstrate that the CoC effectively corrects errors in full-text ASR outputs, significantly outperforming baseline and benchmark systems. We further analyze how to set the correction threshold to balance under-correction and over-rephrasing, extrapolate the CoC model on extremely long ASR outputs, and investigate whether other types of information can be employed to guide the error correction process. △ Less

Submitted 2 April, 2025; originally announced April 2025.

arXiv:2503.24313 [pdf]

1-Tb/s/λ Transmission over Record 10714-km AR-HCF

Authors: Dawei Ge, Siyuan Liu, Qiang Qiu, Peng Li, Qiang Guo, Yiqi Li, Dong Wang, Baoluo Yan, Mingqing Zuo, Lei Zhang, Dechao Zhang, Hu Shi, Jie Luo, Han Li, Zhangyuan Chen

Abstract: We present the first single-channel 1.001-Tb/s DP-36QAM-PCS recirculating transmission over 73 loops of 146.77-km ultra-low-loss & low-IMI DNANF-5 fiber, achieving a record transmission distance of 10,714.28 km. We present the first single-channel 1.001-Tb/s DP-36QAM-PCS recirculating transmission over 73 loops of 146.77-km ultra-low-loss & low-IMI DNANF-5 fiber, achieving a record transmission distance of 10,714.28 km. △ Less

Submitted 2 April, 2025; v1 submitted 31 March, 2025; originally announced March 2025.

arXiv:2503.21491 [pdf, other]

Data-Driven Contact-Aware Control Method for Real-Time Deformable Tool Manipulation: A Case Study in the Environmental Swabbing

Authors: Siavash Mahmoudi, Amirreza Davar, Dongyi Wang

Abstract: Deformable Object Manipulation (DOM) remains a critical challenge in robotics due to the complexities of developing suitable model-based control strategies. Deformable Tool Manipulation (DTM) further complicates this task by introducing additional uncertainties between the robot and its environment. While humans effortlessly manipulate deformable tools using touch and experience, robotic systems s… ▽ More Deformable Object Manipulation (DOM) remains a critical challenge in robotics due to the complexities of developing suitable model-based control strategies. Deformable Tool Manipulation (DTM) further complicates this task by introducing additional uncertainties between the robot and its environment. While humans effortlessly manipulate deformable tools using touch and experience, robotic systems struggle to maintain stability and precision. To address these challenges, we present a novel State-Adaptive Koopman LQR (SA-KLQR) control framework for real-time deformable tool manipulation, demonstrated through a case study in environmental swab sampling for food safety. This method leverages Koopman operator-based control to linearize nonlinear dynamics while adapting to state-dependent variations in tool deformation and contact forces. A tactile-based feedback system dynamically estimates and regulates the swab tool's angle, contact pressure, and surface coverage, ensuring compliance with food safety standards. Additionally, a sensor-embedded contact pad monitors force distribution to mitigate tool pivoting and deformation, improving stability during dynamic interactions. Experimental results validate the SA-KLQR approach, demonstrating accurate contact angle estimation, robust trajectory tracking, and reliable force regulation. The proposed framework enhances precision, adaptability, and real-time control in deformable tool manipulation, bridging the gap between data-driven learning and optimal control in robotic interaction tasks. △ Less

Submitted 27 March, 2025; originally announced March 2025.

Comments: Submitted for Journal Review

arXiv:2503.20274 [pdf, other]

Near-Field THz Bending Beamforming: A Convex Optimization Perspective

Authors: Aoran Liu, Weidong Mei, Peilan Wang, Dong Wang, Ya Fei Wu, Zhi Chen, Boyu Ning

Abstract: Terahertz (THz) communication systems suffer severe blockage issues, which may significantly degrade the communication coverage and quality. Bending beams, capable of adjusting their propagation direction to bypass obstacles, have recently emerged as a promising solution to resolve this issue by engineering the propagation trajectory of the beam. However, traditional bending beam generation method… ▽ More Terahertz (THz) communication systems suffer severe blockage issues, which may significantly degrade the communication coverage and quality. Bending beams, capable of adjusting their propagation direction to bypass obstacles, have recently emerged as a promising solution to resolve this issue by engineering the propagation trajectory of the beam. However, traditional bending beam generation methods rely heavily on the specific geometric properties of the propagation trajectory and can only achieve sub-optimal performance. In this paper, we propose a new and general bending beamforming method by adopting the convex optimization techniques. In particular, we formulate the bending beamforming design as a max-min optimization problem, aiming to optimize the analog or digital transmit beamforming vector to maximize the minimum received signal power among all positions along the bending beam trajectory. However, the resulting problem is non-convex and difficult to be solved optimally. To tackle this difficulty, we apply the successive convex approximation (SCA) technique to obtain a high-quality suboptimal solution. Numerical results show that our proposed bending beamforming method outperforms the traditional method and shows robustness to the obstacle in the environment. △ Less

Submitted 7 April, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

arXiv:2503.18375 [pdf, other]

ALWNN Empowered Automatic Modulation Classification: Conquering Complexity and Scarce Sample Conditions

Authors: Yunhao Quan, Chuang Gao, Nan Cheng, Zhijie Zhang, Zhisheng Yin, Wenchao Xu, Danyang Wang

Abstract: In Automatic Modulation Classification (AMC), deep learning methods have shown remarkable performance, offering significant advantages over traditional approaches and demonstrating their vast potential. Nevertheless, notable drawbacks, particularly in their high demands for storage, computational resources, and large-scale labeled data, which limit their practical application in real-world scenari… ▽ More In Automatic Modulation Classification (AMC), deep learning methods have shown remarkable performance, offering significant advantages over traditional approaches and demonstrating their vast potential. Nevertheless, notable drawbacks, particularly in their high demands for storage, computational resources, and large-scale labeled data, which limit their practical application in real-world scenarios. To tackle this issue, this paper innovatively proposes an automatic modulation classification model based on the Adaptive Lightweight Wavelet Neural Network (ALWNN) and the few-shot framework (MALWNN). The ALWNN model, by integrating the adaptive wavelet neural network and depth separable convolution, reduces the number of model parameters and computational complexity. The MALWNN framework, using ALWNN as an encoder and incorporating prototype network technology, decreases the model's dependence on the quantity of samples. Simulation results indicate that this model performs remarkably well on mainstream datasets. Moreover, in terms of Floating Point Operations Per Second (FLOPS) and Normalized Multiply - Accumulate Complexity (NMACC), ALWNN significantly reduces computational complexity compared to existing methods. This is further validated by real-world system tests on USRP and Raspberry Pi platforms. Experiments with MALWNN show its superior performance in few-shot learning scenarios compared to other algorithms. △ Less

Submitted 24 March, 2025; originally announced March 2025.

arXiv:2503.17886 [pdf, other]

Elevating Robust Multi-Talker ASR by Decoupling Speaker Separation and Speech Recognition

Authors: Yufeng Yang, Hassan Taherian, Vahid Ahmadi Kalkhorani, DeLiang Wang

Abstract: Despite the tremendous success of automatic speech recognition (ASR) with the introduction of deep learning, its performance is still unsatisfactory in many real-world multi-talker scenarios. Speaker separation excels in separating individual talkers but, as a frontend, it introduces processing artifacts that degrade the ASR backend trained on clean speech. As a result, mainstream robust ASR syste… ▽ More Despite the tremendous success of automatic speech recognition (ASR) with the introduction of deep learning, its performance is still unsatisfactory in many real-world multi-talker scenarios. Speaker separation excels in separating individual talkers but, as a frontend, it introduces processing artifacts that degrade the ASR backend trained on clean speech. As a result, mainstream robust ASR systems train the backend on noisy speech to avoid processing artifacts. In this work, we propose to decouple the training of the speaker separation frontend and the ASR backend, with the latter trained on clean speech only. Our decoupled system achieves 5.1% word error rates (WER) on the Libri2Mix dev/test sets, significantly outperforming other multi-talker ASR baselines. Its effectiveness is also demonstrated with the state-of-the-art 7.60%/5.74% WERs on 1-ch and 6-ch SMS-WSJ. Furthermore, on recorded LibriCSS, we achieve the speaker-attributed WER of 2.92%. These state-of-the-art results suggest that decoupling speaker separation and recognition is an effective approach to elevate robust multi-talker ASR. △ Less

Submitted 22 March, 2025; originally announced March 2025.

arXiv:2503.14185 [pdf, other]

AdaST: Dynamically Adapting Encoder States in the Decoder for End-to-End Speech-to-Text Translation

Authors: Wuwei Huang, Dexin Wang, Deyi Xiong

Abstract: In end-to-end speech translation, acoustic representations learned by the encoder are usually fixed and static, from the perspective of the decoder, which is not desirable for dealing with the cross-modal and cross-lingual challenge in speech translation. In this paper, we show the benefits of varying acoustic states according to decoder hidden states and propose an adaptive speech-to-text transla… ▽ More In end-to-end speech translation, acoustic representations learned by the encoder are usually fixed and static, from the perspective of the decoder, which is not desirable for dealing with the cross-modal and cross-lingual challenge in speech translation. In this paper, we show the benefits of varying acoustic states according to decoder hidden states and propose an adaptive speech-to-text translation model that is able to dynamically adapt acoustic states in the decoder. We concatenate the acoustic state and target word embedding sequence and feed the concatenated sequence into subsequent blocks in the decoder. In order to model the deep interaction between acoustic states and target hidden states, a speech-text mixed attention sublayer is introduced to replace the conventional cross-attention network. Experiment results on two widely-used datasets show that the proposed method significantly outperforms state-of-the-art neural speech translation models. △ Less

Submitted 18 March, 2025; originally announced March 2025.

Comments: ACL 2021 Findings

arXiv:2503.13478 [pdf]

Advancing Highway Work Zone Safety: A Comprehensive Review of Sensor Technologies for Intrusion and Proximity Hazards

Authors: Ayenew Yihune Demeke, Moein Younesi Heravi, Israt Sharmin Dola, Youjin Jang, Chau Le, Inbae Jeong, Zhibin Lin, Danling Wang

Abstract: Highway work zones are critical areas where accidents frequently occur, often due to the proximity of workers to heavy machinery and ongoing traffic. With technological advancements in sensor technologies and the Internet of Things, promising solutions are emerging to address these safety concerns. This paper provides a systematic review of existing studies on the application of sensor technologie… ▽ More Highway work zones are critical areas where accidents frequently occur, often due to the proximity of workers to heavy machinery and ongoing traffic. With technological advancements in sensor technologies and the Internet of Things, promising solutions are emerging to address these safety concerns. This paper provides a systematic review of existing studies on the application of sensor technologies in enhancing highway work zone safety, particularly in preventing intrusion and proximity hazards. Following the Preferred Reporting Items for Systematic Review and Meta-Analyses (PRISMA) protocol, the review examines a broad spectrum of publications on various sensor technologies, including GPS, radar, laser, infrared, RFID, Bluetooth, ultrasonic, and infrared sensors, detailing their application in reducing intrusion and proximity incidents. The review also assesses these technologies in terms of their accuracy, range, power consumption, cost, and user-friendliness, with a specific emphasis on their suitability for highway work zones. The findings highlight the potential of sensor technologies to significantly enhance work zone safety. As there are a wide range of sensor technologies to choose from, the review also revealed that selection of sensors for a particular application needs careful consideration of different factors. Finally, while sensor technologies offer promising solutions for enhancing highway work zone safety, their effective implementation requires comprehensive consideration of various factors beyond technological capabilities, including developing integrated, cost-effective, user-friendly, and secure systems, and creating regulatory frameworks to support the rapid development of these technologies. △ Less

Submitted 4 March, 2025; originally announced March 2025.

Comments: 4 Figures, 5 Tables

arXiv:2503.13257 [pdf, other]

Anatomically and Metabolically Informed Diffusion for Unified Denoising and Segmentation in Low-Count PET Imaging

Authors: Menghua Xia, Kuan-Yin Ko, Der-Shiun Wang, Ming-Kai Chen, Qiong Liu, Huidong Xie, Liang Guo, Wei Ji, Jinsong Ouyang, Reimund Bayerlein, Benjamin A. Spencer, Quanzheng Li, Ramsey D. Badawi, Georges El Fakhri, Chi Liu

Abstract: Positron emission tomography (PET) image denoising, along with lesion and organ segmentation, are critical steps in PET-aided diagnosis. However, existing methods typically treat these tasks independently, overlooking inherent synergies between them as correlated steps in the analysis pipeline. In this work, we present the anatomically and metabolically informed diffusion (AMDiff) model, a unified… ▽ More Positron emission tomography (PET) image denoising, along with lesion and organ segmentation, are critical steps in PET-aided diagnosis. However, existing methods typically treat these tasks independently, overlooking inherent synergies between them as correlated steps in the analysis pipeline. In this work, we present the anatomically and metabolically informed diffusion (AMDiff) model, a unified framework for denoising and lesion/organ segmentation in low-count PET imaging. By integrating multi-task functionality and exploiting the mutual benefits of these tasks, AMDiff enables direct quantification of clinical metrics, such as total lesion glycolysis (TLG), from low-count inputs. The AMDiff model incorporates a semantic-informed denoiser based on diffusion strategy and a denoising-informed segmenter utilizing nnMamba architecture. The segmenter constrains denoised outputs via a lesion-organ-specific regularizer, while the denoiser enhances the segmenter by providing enriched image information through a denoising revision module. These components are connected via a warming-up mechanism to optimize multitask interactions. Experiments on multi-vendor, multi-center, and multi-noise-level datasets demonstrate the superior performance of AMDiff. For test cases below 20% of the clinical count levels from participating sites, AMDiff achieves TLG quantification biases of -26.98%, outperforming its ablated versions which yield biases of -35.85% (without the lesion-organ-specific regularizer) and -40.79% (without the denoising revision module). △ Less

Submitted 17 March, 2025; originally announced March 2025.

arXiv:2503.12840 [pdf, other]

Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics

Authors: Chen Liu, Liying Yang, Peike Li, Dadong Wang, Lincheng Li, Xin Yu

Abstract: Sound-guided object segmentation has drawn considerable attention for its potential to enhance multimodal perception. Previous methods primarily focus on developing advanced architectures to facilitate effective audio-visual interactions, without fully addressing the inherent challenges posed by audio natures, \emph{\ie}, (1) feature confusion due to the overlapping nature of audio signals, and (2… ▽ More Sound-guided object segmentation has drawn considerable attention for its potential to enhance multimodal perception. Previous methods primarily focus on developing advanced architectures to facilitate effective audio-visual interactions, without fully addressing the inherent challenges posed by audio natures, \emph{\ie}, (1) feature confusion due to the overlapping nature of audio signals, and (2) audio-visual matching difficulty from the varied sounds produced by the same object. To address these challenges, we propose Dynamic Derivation and Elimination (DDESeg): a novel audio-visual segmentation framework. Specifically, to mitigate feature confusion, DDESeg reconstructs the semantic content of the mixed audio signal by enriching the distinct semantic information of each individual source, deriving representations that preserve the unique characteristics of each sound. To reduce the matching difficulty, we introduce a discriminative feature learning module, which enhances the semantic distinctiveness of generated audio representations. Considering that not all derived audio representations directly correspond to visual features (e.g., off-screen sounds), we propose a dynamic elimination module to filter out non-matching elements. This module facilitates targeted interaction between sounding regions and relevant audio semantics. By scoring the interacted features, we identify and filter out irrelevant audio information, ensuring accurate audio-visual alignment. Comprehensive experiments demonstrate that our framework achieves superior performance in AVS datasets. △ Less

Submitted 17 March, 2025; originally announced March 2025.

Comments: Accepted by CVPR2025

arXiv:2503.08134 [pdf, other]

THz Beam Squint Mitigation via 3D Rotatable Antennas

Authors: Yike Xie, Weidong Mei, Dong Wang, Boyu Ning, Zhi Chen, Jun Fang, Wei Guo

Abstract: Analog beamforming holds great potential for future terahertz (THz) communications due to its ability to generate high-gain directional beams with low-cost phase shifters.However, conventional analog beamforming may suffer substantial performance degradation in wideband systems due to the beam-squint effects. Instead of relying on high-cost true time delayers, we propose in this paper an efficient… ▽ More Analog beamforming holds great potential for future terahertz (THz) communications due to its ability to generate high-gain directional beams with low-cost phase shifters.However, conventional analog beamforming may suffer substantial performance degradation in wideband systems due to the beam-squint effects. Instead of relying on high-cost true time delayers, we propose in this paper an efficient three-dimensional (3D) rotatable antenna technology to mitigate the beam-squint effects, motivated by the fact that beam squint disappears along the boresight direction. In particular, we focus on a wideband wide-beam coverage problem in this paper, aiming to maximize the minimum beamforming gain within a given angle and frequency range by jointly optimizing the analog beamforming vector and the 3D rotation angles of the antenna array. However, this problem is non-convex and difficult to be optimally solved due to the coupling of the spatial and frequency domains and that of the antenna weights and rotation. To tackle this issue, we first reformulate the problem into an equivalent form by merging the spatial and frequency domains into a single composite domain. Next, we combine alternating optimization (AO) and successive convex approximation (SCA) algorithms to optimize the analog beamforming and rotation angles within this composite domain. Simulation results demonstrate that the proposed scheme can significantly outperform conventional schemes without antenna rotation, thus offering a cost-effective solution for wideband transmission over THz bands. △ Less

Submitted 11 March, 2025; originally announced March 2025.

arXiv:2503.07997 [pdf, ps, other]

A Survey of Challenges and Sensing Technologies in Autonomous Retail Systems

Authors: Shimmy Rukundo, David Wang, Front Wongnonthawitthaya, Youssouf Sidibé, Minsik Kim, Emily Su, Jiale Zhang

Abstract: Autonomous stores leverage advanced sensing technologies to enable cashier-less shopping, real-time inventory tracking, and seamless customer interactions. However, these systems face significant challenges, including occlusion in vision-based tracking, scalability of sensor deployment, theft prevention, and real-time data processing. To address these issues, researchers have explored multi-modal… ▽ More Autonomous stores leverage advanced sensing technologies to enable cashier-less shopping, real-time inventory tracking, and seamless customer interactions. However, these systems face significant challenges, including occlusion in vision-based tracking, scalability of sensor deployment, theft prevention, and real-time data processing. To address these issues, researchers have explored multi-modal sensing approaches, integrating computer vision, RFID, weight sensing, vibration-based detection, and LiDAR to enhance accuracy and efficiency. This survey provides a comprehensive review of sensing technologies used in autonomous retail environments, highlighting their strengths, limitations, and integration strategies. We categorize existing solutions across inventory tracking, environmental monitoring, people-tracking, and theft detection, discussing key challenges and emerging trends. Finally, we outline future directions for scalable, cost-efficient, and privacy-conscious autonomous store systems. △ Less

Submitted 10 March, 2025; originally announced March 2025.

ACM Class: J.0; J.7; A.1

arXiv:2503.02769 [pdf, ps, other]

InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training

Authors: Dingdong Wang, Jin Xu, Ruihang Chu, Zhifang Guo, Xiong Wang, Jincenzi Wu, Dongchao Yang, Shengpeng Ji, Junyang Lin

Abstract: Recent advancements in speech large language models (SpeechLLMs) have attracted considerable attention. Nonetheless, current methods exhibit suboptimal performance in adhering to speech instructions. Notably, the intelligence of models significantly diminishes when processing speech-form input as compared to direct text-form input. Prior work has attempted to mitigate this semantic inconsistency b… ▽ More Recent advancements in speech large language models (SpeechLLMs) have attracted considerable attention. Nonetheless, current methods exhibit suboptimal performance in adhering to speech instructions. Notably, the intelligence of models significantly diminishes when processing speech-form input as compared to direct text-form input. Prior work has attempted to mitigate this semantic inconsistency between speech and text representations through techniques such as representation and behavior alignment, which involve the meticulous design of data pairs during the post-training phase. In this paper, we introduce a simple and scalable training method called InSerter, which stands for Interleaved Speech-Text Representation Pre-training. InSerter is designed to pre-train large-scale unsupervised speech-text sequences, where the speech is synthesized from randomly selected segments of an extensive text corpus using text-to-speech conversion. Consequently, the model acquires the ability to generate textual continuations corresponding to the provided speech segments, obviating the need for intensive data design endeavors. To systematically evaluate speech instruction-following capabilities, we introduce SpeechInstructBench, the first comprehensive benchmark specifically designed for speech-oriented instruction-following tasks. Our proposed InSerter achieves SOTA performance in SpeechInstructBench and demonstrates superior or competitive results across diverse speech processing tasks. △ Less

Submitted 4 June, 2025; v1 submitted 4 March, 2025; originally announced March 2025.

Comments: Accepted to ACL 2025; Data is available at: https://huggingface.co/datasets/ddwang2000/SpeechInstructBench

arXiv:2503.02647 [pdf, other]

A Framework for Uplink ISAC Receiver Designs: Performance Analysis and Algorithm Development

Authors: Zhiyuan Yu, Hong Ren, Cunhua Pan, Gui Zhou, Dongming Wang, Chau Yuen, Jiangzhou Wang

Abstract: Uplink integrated sensing and communication (ISAC) systems have recently emerged as a promising research direction, enabling simultaneous uplink signal detection and target sensing. In this paper, we propose the flexible projection (FP)-type receiver that unify the projection-type receiver and the successive interference cancellation (SIC)-type receiver by using a flexible tradeoff factor to adapt… ▽ More Uplink integrated sensing and communication (ISAC) systems have recently emerged as a promising research direction, enabling simultaneous uplink signal detection and target sensing. In this paper, we propose the flexible projection (FP)-type receiver that unify the projection-type receiver and the successive interference cancellation (SIC)-type receiver by using a flexible tradeoff factor to adapt to dynamically changing uplink ISAC scenarios. The FP-type receiver addresses the joint signal detection and target response estimation problem through two coordinated phases: 1) Communication signal detection using a reconstructed signal whose composition is controlled by the tradeoff factor, followed by 2) Target response estimation performed through subtraction of the detected communication signal from the received signal. With adjustable tradeoff factors, the FP-type receiver can balance the enhancement of the signal-to-interference-plus-noise ratio (SINR) with the reduction of correlation in the reconstructed signal for communication signal detection. The pairwise error probabilities (PEPs) are analyzed for both the maximum likelihood (ML) and the zero-forcing (ZF) detectors, revealing that the optimal tradeoff factor should be determined based on the adopted detection algorithm and the relative power of the sensing and communication (S\&C) signal. A homotopy optimization framework is first applied for the FP-type receiver with a fixed trade-off factor. This framework is then extended to develop the dynamic FP (DFP)-type receiver, which iteratively adjust the trade-off factor for improved algorithm performance and environmental adaptability. Subsequently, two extensions are explored to further enhance the receiver's performance: parallel DFP (PDFP)-type receiver and a block-structured receiver design. Finally, the effectiveness of the proposed receiver designs is verified via simulations. △ Less

Submitted 3 April, 2025; v1 submitted 4 March, 2025; originally announced March 2025.

Comments: 13 pages, 9 figures, submitted to an IEEE journal for possible publication

arXiv:2503.00340 [pdf, other]

UL-UNAS: Ultra-Lightweight U-Nets for Real-Time Speech Enhancement via Network Architecture Search

Authors: Xiaobin Rong, Dahan Wang, Yuxiang Hu, Changbao Zhu, Kai Chen, Jing Lu

Abstract: Lightweight models are essential for real-time speech enhancement applications. In recent years, there has been a growing trend toward developing increasingly compact models for speech enhancement. In this paper, we propose an Ultra-Lightweight U-net optimized by Network Architecture Search (UL-UNAS), which is suitable for implementation in low-footprint devices. Firstly, we explore the applicatio… ▽ More Lightweight models are essential for real-time speech enhancement applications. In recent years, there has been a growing trend toward developing increasingly compact models for speech enhancement. In this paper, we propose an Ultra-Lightweight U-net optimized by Network Architecture Search (UL-UNAS), which is suitable for implementation in low-footprint devices. Firstly, we explore the application of various efficient convolutional blocks within the U-Net framework to identify the most promising candidates. Secondly, we introduce two boosting components to enhance the capacity of these convolutional blocks: a novel activation function named affine PReLU and a causal time-frequency attention module. Furthermore, we leverage neural architecture search to discover an optimal architecture within our carefully designed search space. By integrating the above strategies, UL-UNAS not only significantly outperforms the latest ultra-lightweight models with the same or lower computational complexity, but also delivers competitive performance compared to recent baseline models that require substantially higher computational resources. △ Less

Submitted 28 February, 2025; originally announced March 2025.

Comments: 13 pages, 8 figures, submitted to Neural Networks

arXiv:2502.14224 [pdf, other]

Adaptive Convolution for CNN-based Speech Enhancement Models

Authors: Dahan Wang, Xiaobin Rong, Shiruo Sun, Yuxiang Hu, Changbao Zhu, Jing Lu

Abstract: Deep learning-based speech enhancement methods have significantly improved speech quality and intelligibility. Convolutional neural networks (CNNs) have been proven to be essential components of many high-performance models. In this paper, we introduce adaptive convolution, an efficient and versatile convolutional module that enhances the model's capability to adaptively represent speech signals.… ▽ More Deep learning-based speech enhancement methods have significantly improved speech quality and intelligibility. Convolutional neural networks (CNNs) have been proven to be essential components of many high-performance models. In this paper, we introduce adaptive convolution, an efficient and versatile convolutional module that enhances the model's capability to adaptively represent speech signals. Adaptive convolution performs frame-wise causal dynamic convolution, generating time-varying kernels for each frame by assembling multiple parallel candidate kernels. A Lightweight attention mechanism leverages both current and historical information to assign adaptive weights to each candidate kernel, guiding their aggregation. This enables the convolution operation to adapt to frame-level speech spectral features, leading to more efficient extraction and reconstruction. Experimental results on various CNN-based models demonstrate that adaptive convolution significantly improves the performance with negligible increases in computational complexity, especially for lightweight models. Furthermore, we propose the adaptive convolutional recurrent network (AdaptCRN), an ultra-lightweight model that incorporates adaptive convolution and an efficient encoder-decoder design, achieving superior performance compared to models with similar or even higher computational costs. △ Less

Submitted 19 February, 2025; originally announced February 2025.

Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

arXiv:2502.09631 [pdf, other]

Volumetric Temporal Texture Synthesis for Smoke Stylization using Neural Cellular Automata

Authors: Dongqing Wang, Ehsan Pajouheshgar, Yitao Xu, Tong Zhang, Sabine Süsstrunk

Abstract: Artistic stylization of 3D volumetric smoke data is still a challenge in computer graphics due to the difficulty of ensuring spatiotemporal consistency given a reference style image, and that within reasonable time and computational resources. In this work, we introduce Volumetric Neural Cellular Automata (VNCA), a novel model for efficient volumetric style transfer that synthesizes, in real-time,… ▽ More Artistic stylization of 3D volumetric smoke data is still a challenge in computer graphics due to the difficulty of ensuring spatiotemporal consistency given a reference style image, and that within reasonable time and computational resources. In this work, we introduce Volumetric Neural Cellular Automata (VNCA), a novel model for efficient volumetric style transfer that synthesizes, in real-time, multi-view consistent stylizing features on the target smoke with temporally coherent transitions between stylized simulation frames. VNCA synthesizes a 3D texture volume with color and density stylization and dynamically aligns this volume with the intricate motion patterns of the smoke simulation under the Eulerian framework. Our approach replaces the explicit fluid advection modeling and the inter-frame smoothing terms with the self-emerging motion of the underlying cellular automaton, thus reducing the training time by over an order of magnitude. Beyond smoke simulations, we demonstrate the versatility of our approach by showcasing its applicability to mesh stylization. △ Less

Submitted 5 February, 2025; originally announced February 2025.

arXiv:2502.00421 [pdf, other]

Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language

Authors: Turi Abu, Ying Shi, Thomas Fang Zheng, Dong Wang

Abstract: We present a novel Automatic Speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. The dataset was collected through a crowd-sourcing initiative, encompassing a diverse range of speakers and phonetic variations. It consists of 100 hours of real-world audio recordings paired with transcriptions, covering read speech in both clean and… ▽ More We present a novel Automatic Speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. The dataset was collected through a crowd-sourcing initiative, encompassing a diverse range of speakers and phonetic variations. It consists of 100 hours of real-world audio recordings paired with transcriptions, covering read speech in both clean and noisy environments. This dataset addresses the critical need for ASR resources for the Oromo language which is underrepresented. To show its applicability for the ASR task, we conducted experiments using the Conformer model, achieving a Word Error Rate (WER) of 15.32% with hybrid CTC and AED loss and WER of 18.74% with pure CTC loss. Additionally, fine-tuning the Whisper model resulted in a significantly improved WER of 10.82%. These results establish baselines for Oromo ASR, highlighting both the challenges and the potential for improving ASR performance in Oromo. The dataset is publicly available at https://github.com/turinaf/sagalee and we encourage its use for further research and development in Oromo speech processing. △ Less

Submitted 1 February, 2025; originally announced February 2025.

Comments: Accepted for ICASSP2025 (2025 IEEE International Conference on Acoustics, Speech, and Signal Processing)

arXiv:2501.15368 [pdf, other]

Baichuan-Omni-1.5 Technical Report

Authors: Yadong Li, Jun Liu, Tao Zhang, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, Chong Li, Yuanbo Fang, Dongdong Kuang, Mingrui Wang, Chenglin Zhu, Youwei Zhang, Hongyu Guo, Fengyu Zhang, Yuran Wang, Bowen Ding, Wei Song, Xu Li, Yuqi Huo, Zheng Liang , et al. (68 additional authors not shown)

Abstract: We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pip… ▽ More We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data (text, audio, and vision). Second, an audio-tokenizer (Baichuan-Audio-Tokenizer) has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM. Lastly, we designed a multi-stage training strategy that progressively integrates multimodal alignment and multitask fine-tuning, ensuring effective synergy across all modalities. Baichuan-Omni-1.5 leads contemporary models (including GPT4o-mini and MiniCPM-o 2.6) in terms of comprehensive omni-modal capabilities. Notably, it achieves results comparable to leading models such as Qwen2-VL-72B across various multimodal medical benchmarks. △ Less

Submitted 25 January, 2025; originally announced January 2025.

arXiv:2501.14234 [pdf, other]

STAR-RIS-Enabled Multi-Path Beam Routing with Passive Beam Splitting

Authors: Bonan An, Weidong Mei, Yuanwei Liu, Dong Wang, Zhi Chen

Abstract: Reconfigurable intelligent surfaces (RISs) can be densely deployed in the environment to create multi-reflection line-of-sight (LoS) links for signal coverage enhancement. However, conventional reflection-only RISs can only achieve half-space reflection, which limits the LoS path diversity. In contrast, simultaneously transmitting and reflecting RISs (STAR-RISs) can achieve full-space reflection a… ▽ More Reconfigurable intelligent surfaces (RISs) can be densely deployed in the environment to create multi-reflection line-of-sight (LoS) links for signal coverage enhancement. However, conventional reflection-only RISs can only achieve half-space reflection, which limits the LoS path diversity. In contrast, simultaneously transmitting and reflecting RISs (STAR-RISs) can achieve full-space reflection and transmission, thereby creating more LoS paths. Hence, in this paper, we study a new multi-STAR-RIS-aided communication system, where a multi-antenna base station (BS) transmits to multiple single-antenna users by exploiting the signal beam routing over a set of cascaded LoS paths each formed by multiple STAR-RISs. To reveal essential insights, we first consider a simplified single-user case, aiming to maximize its received signal power by jointly optimizing the active beamforming at the BS, the BS's power allocation over different paths, the number of selected beam-routing paths, the selected STAR-RISs for each path, as well as their amplitude and phase shifts for transmission/reflection. However, this problem is difficult to be optimally solved as different paths may be intricately coupled at their shared STAR-RISs. To tackle this difficulty, we first derive the optimal solution to this problem in closed-form for a given set of paths. The clique-based approach in graph theory is then applied to solve the remaining multi-path selection problem efficiently. Next, we extend the proposed clique-based method to the multi-user case to maximize the minimum received signal power among all users, subject to additional constraints on the disjointness of the selected paths for different users. Simulation results show that our proposed STAR-RIS-enabled beam routing outperforms the conventional beam routing with reflection-only RISs in both single- and multi-user cases. △ Less

Submitted 19 May, 2025; v1 submitted 23 January, 2025; originally announced January 2025.

arXiv:2501.13336 [pdf, other]

Gradient-Free Adversarial Purification with Diffusion Models

Authors: Xuelong Dai, Dong Wang, Duan Mingxing, Bin Xiao

Abstract: Adversarial training and adversarial purification are two effective and practical defense methods to enhance a model's robustness against adversarial attacks. However, adversarial training necessitates additional training, while adversarial purification suffers from low time efficiency. More critically, current defenses are designed under the perturbation-based adversarial threat model, which is i… ▽ More Adversarial training and adversarial purification are two effective and practical defense methods to enhance a model's robustness against adversarial attacks. However, adversarial training necessitates additional training, while adversarial purification suffers from low time efficiency. More critically, current defenses are designed under the perturbation-based adversarial threat model, which is ineffective against the recently proposed unrestricted adversarial attacks. In this paper, we propose an effective and efficient adversarial defense method that counters both perturbation-based and unrestricted adversarial attacks. Our defense is inspired by the observation that adversarial attacks are typically located near the decision boundary and are sensitive to pixel changes. To address this, we introduce adversarial anti-aliasing to mitigate adversarial modifications. Additionally, we propose adversarial super-resolution, which leverages prior knowledge from clean datasets to benignly recover images. These approaches do not require additional training and are computationally efficient without calculating gradients. Extensive experiments against both perturbation-based and unrestricted adversarial attacks demonstrate that our defense method outperforms state-of-the-art adversarial purification methods. △ Less

Submitted 22 January, 2025; originally announced January 2025.

arXiv:2412.20371 [pdf, other]

Cooperative ISAC-empowered Low-Altitude Economy

Authors: Jun Tang, Yiming Yu, Cunhua Pan, Hong Ren, Dongming Wang, Jiangzhou Wang, Xiaohu You

Abstract: This paper proposes a cooperative integrated sensing and communication (ISAC) scheme for the low-altitude sensing scenario, aiming at estimating the parameters of the unmanned aerial vehicles (UAVs) and enhancing the sensing performance via cooperation. The proposed scheme consists of two stages. In Stage I, we formulate the monostatic parameter estimation problem via using a tensor decomposition… ▽ More This paper proposes a cooperative integrated sensing and communication (ISAC) scheme for the low-altitude sensing scenario, aiming at estimating the parameters of the unmanned aerial vehicles (UAVs) and enhancing the sensing performance via cooperation. The proposed scheme consists of two stages. In Stage I, we formulate the monostatic parameter estimation problem via using a tensor decomposition model. By leveraging the Vandermonde structure of the factor matrix, a spatial smoothing tensor decomposition scheme is introduced to estimate the UAVs' parameters. To further reduce the computational complexity, we design a reduced-dimensional (RD) angle of arrival (AoA) estimation algorithm based on generalized Rayleigh quotient (GRQ). In Stage II, the positions and true velocities of the UAVs are determined through the data fusion across multiple base stations (BSs). Specifically, we first develop a false removing minimum spanning tree (MST)-based data association method to accurately match the BSs' parameter estimations to the same UAV. Then, a Pareto optimality method and a residual weighting scheme are developed to facilitate the position and velocity estimation, respectively. We further extend our approach to the dual-polarized system. Simulation results validate the effectiveness of the proposed schemes in comparison to the conventional techniques. △ Less

Submitted 29 December, 2024; originally announced December 2024.

arXiv:2412.20349 [pdf, other]

Two-Timescale Design for AP Mode Selection of Cooperative ISAC Networks

Authors: Zhichu Ren, Cunhua Pan, Hong Ren, Dongming Wang, Lexi Xu, Jiangzhou Wang

Abstract: As an emerging technology, cooperative bi-static integrated sensing and communication (ISAC) is promising to achieve high-precision sensing, high-rate communication as well as self-interference (SI) avoidance. This paper investigates the two-timescale design for access point (AP) mode selection to realize the full potential of the cooperative bi-static ISAC network with low system overhead, where… ▽ More As an emerging technology, cooperative bi-static integrated sensing and communication (ISAC) is promising to achieve high-precision sensing, high-rate communication as well as self-interference (SI) avoidance. This paper investigates the two-timescale design for access point (AP) mode selection to realize the full potential of the cooperative bi-static ISAC network with low system overhead, where the beamforming at the APs is adapted to the rapidly-changing instantaneous channel state information (CSI), while the AP mode is adapted to the slowly-changing statistical CSI. We first apply the minimum mean square error (MMSE) estimator to estimate the channel between the APs and the channels from the APs to the user equipments (UEs). Then we adopt the low-complexity maximum ratio transmission (MRT) beamforming and the maximum ratio combining (MRC) detector, and derive the closed-form expressions of the communication rate and the sensing signal-to-interference-plus-noise-ratio (SINR). We formulate a non-convex integer optimization problem to maximize the minimum sensing SINR under the communication quality of service (QoS) constraints. McCormick envelope relaxation and successive convex approximation (SCA) techniques are applied to solve the challenging non-convex integer optimization problem. Simulation results validate the closed-form expressions and prove the convergence and effectiveness of the proposed AP mode selection scheme. △ Less

Submitted 28 December, 2024; originally announced December 2024.

Comments: 13 pages, 8 figures

arXiv:2412.13891 [pdf, ps, other]

Graph-Driven Models for Gas Mixture Identification and Concentration Estimation on Heterogeneous Sensor Array Signals

Authors: Ding Wang, Lei Wang, Huilin Yin, Guoqing Gu, Zhiping Lin, Wenwen Zhang

Abstract: Accurately identifying gas mixtures and estimating their concentrations are crucial across various industrial applications using gas sensor arrays. However, existing models face challenges in generalizing across heterogeneous datasets, which limits their scalability and practical applicability. To address this problem, this study develops two novel deep-learning models that integrate temporal grap… ▽ More Accurately identifying gas mixtures and estimating their concentrations are crucial across various industrial applications using gas sensor arrays. However, existing models face challenges in generalizing across heterogeneous datasets, which limits their scalability and practical applicability. To address this problem, this study develops two novel deep-learning models that integrate temporal graph structures for enhanced performance: a Graph-Enhanced Capsule Network (GraphCapsNet) employing dynamic routing for gas mixture classification and a Graph-Enhanced Attention Network (GraphANet) leveraging self-attention for concentration estimation. Both models were validated on datasets from the University of California, Irvine (UCI) Machine Learning Repository and a custom dataset, demonstrating superior performance in gas mixture identification and concentration estimation compared to recent models. In classification tasks, GraphCapsNet achieved over 98.00% accuracy across multiple datasets, while in concentration estimation, GraphANet attained an R2 score exceeding 0.96 across various gas components. Both GraphCapsNet and GraphANet exhibited significantly higher accuracy and stability, positioning them as promising solutions for scalable gas analysis in industrial settings. △ Less

Submitted 18 December, 2024; originally announced December 2024.

arXiv:2412.11614 [pdf]

Acceleration and Parallelization Methods for ISRS EGN Model

Authors: Ruiyang Xia, Guanjun Gao, Zanshan Zhao, Haoyu Wang, Kun Wen, Daobin Wang

Abstract: The enhanced Gaussian noise (EGN) model, which accounts for inter-channel stimulated Raman scattering (ISRS), has been extensively utilized for evaluating nonlinear interference (NLI) within the C+L band. Compared to closed-form expressions and machine learning-based NLI evaluation models, it demonstrates broader applicability and its accuracy is not dependent on the support of large-scale dataset… ▽ More The enhanced Gaussian noise (EGN) model, which accounts for inter-channel stimulated Raman scattering (ISRS), has been extensively utilized for evaluating nonlinear interference (NLI) within the C+L band. Compared to closed-form expressions and machine learning-based NLI evaluation models, it demonstrates broader applicability and its accuracy is not dependent on the support of large-scale datasets. However, its high computational complexity often results in lengthy computation times. Through analysis, the high-frequency oscillations of the four-wave mixing (FWM) efficiency factor integrand were identified as a primary factor limiting the computational speed of the ISRS EGN model. To address this issue, we propose an accurate approximation method that enables the derivation of a closed-form expression for the FWM efficiency factor without imposing restrictive conditions. Thereby, the scheme proposed in this paper could significantly accelerate the computational speed. Numerical results demonstrate that method in this work could achieve low error levels under high ISRS influence levels, with an MAE of less than 0.001 dB, and no cumulative error over increasing transmission distances, while reducing computation time by over 97%. Furthermore, a parallel computation strategy targeting independent regions within the integration domain is proposed, which could further improve computational efficiency by nearly 11 times. △ Less

Submitted 16 December, 2024; originally announced December 2024.

Comments: 12 pages, 12 figures, preprint submitted to IEEE for possible publication

arXiv:2412.10489 [pdf, other]

CognitionCapturer: Decoding Visual Stimuli From Human EEG Signal With Multimodal Information

Authors: Kaifan Zhang, Lihuo He, Xin Jiang, Wen Lu, Di Wang, Xinbo Gao

Abstract: Electroencephalogram (EEG) signals have attracted significant attention from researchers due to their non-invasive nature and high temporal sensitivity in decoding visual stimuli. However, most recent studies have focused solely on the relationship between EEG and image data pairs, neglecting the valuable ``beyond-image-modality" information embedded in EEG signals. This results in the loss of cri… ▽ More Electroencephalogram (EEG) signals have attracted significant attention from researchers due to their non-invasive nature and high temporal sensitivity in decoding visual stimuli. However, most recent studies have focused solely on the relationship between EEG and image data pairs, neglecting the valuable ``beyond-image-modality" information embedded in EEG signals. This results in the loss of critical multimodal information in EEG. To address this limitation, we propose CognitionCapturer, a unified framework that fully leverages multimodal data to represent EEG signals. Specifically, CognitionCapturer trains Modality Expert Encoders for each modality to extract cross-modal information from the EEG modality. Then, it introduces a diffusion prior to map the EEG embedding space to the CLIP embedding space, followed by using a pretrained generative model, the proposed framework can reconstruct visual stimuli with high semantic and structural fidelity. Notably, the framework does not require any fine-tuning of the generative models and can be extended to incorporate more modalities. Through extensive experiments, we demonstrate that CognitionCapturer outperforms state-of-the-art methods both qualitatively and quantitatively. Code: https://github.com/XiaoZhangYES/CognitionCapturer. △ Less

Submitted 24 December, 2024; v1 submitted 13 December, 2024; originally announced December 2024.

arXiv:2412.09887 [pdf, other]

CSL-L2M: Controllable Song-Level Lyric-to-Melody Generation Based on Conditional Transformer with Fine-Grained Lyric and Musical Controls

Authors: Li Chai, Donglin Wang

Abstract: Lyric-to-melody generation is a highly challenging task in the field of AI music generation. Due to the difficulty of learning strict yet weak correlations between lyrics and melodies, previous methods have suffered from weak controllability, low-quality and poorly structured generation. To address these challenges, we propose CSL-L2M, a controllable song-level lyric-to-melody generation method ba… ▽ More Lyric-to-melody generation is a highly challenging task in the field of AI music generation. Due to the difficulty of learning strict yet weak correlations between lyrics and melodies, previous methods have suffered from weak controllability, low-quality and poorly structured generation. To address these challenges, we propose CSL-L2M, a controllable song-level lyric-to-melody generation method based on an in-attention Transformer decoder with fine-grained lyric and musical controls, which is able to generate full-song melodies matched with the given lyrics and user-specified musical attributes. Specifically, we first introduce REMI-Aligned, a novel music representation that incorporates strict syllable- and sentence-level alignments between lyrics and melodies, facilitating precise alignment modeling. Subsequently, sentence-level semantic lyric embeddings independently extracted from a sentence-wise Transformer encoder are combined with word-level part-of-speech embeddings and syllable-level tone embeddings as fine-grained controls to enhance the controllability of lyrics over melody generation. Then we introduce human-labeled musical tags, sentence-level statistical musical attributes, and learned musical features extracted from a pre-trained VQ-VAE as coarse-grained, fine-grained and high-fidelity controls, respectively, to the generation process, thereby enabling user control over melody generation. Finally, an in-attention Transformer decoder technique is leveraged to exert fine-grained control over the full-song melody generation with the aforementioned lyric and musical conditions. Experimental results demonstrate that our proposed CSL-L2M outperforms the state-of-the-art models, generating melodies with higher quality, better controllability and enhanced structure. Demos and source code are available at https://lichaiustc.github.io/CSL-L2M/. △ Less

Submitted 14 January, 2025; v1 submitted 13 December, 2024; originally announced December 2024.

Comments: Accepted at AAAI-25

arXiv:2411.13288 [pdf]

EEG Signal Denoising Using pix2pix GAN: Enhancing Neurological Data Analysis

Authors: Haoyi Wang, Xufang Chen, Yue Yang, Kewei Zhou, Meining Lv, Dongrui Wang, Wenjie Zhang

Abstract: Electroencephalography (EEG) is essential in neuroscience and clinical practice, yet it suffers from physiological artifacts, particularly electromyography (EMG), which distort signals. We propose a deep learning model using pix2pixGAN to remove such noise and generate reliable EEG signals. Leveraging the EEGdenoiseNet dataset, we created synthetic datasets with controlled EMG noise levels for mod… ▽ More Electroencephalography (EEG) is essential in neuroscience and clinical practice, yet it suffers from physiological artifacts, particularly electromyography (EMG), which distort signals. We propose a deep learning model using pix2pixGAN to remove such noise and generate reliable EEG signals. Leveraging the EEGdenoiseNet dataset, we created synthetic datasets with controlled EMG noise levels for model training and testing across a signal-to-noise ratio (SNR) from -7 to 2. Our evaluation metrics included RRMSE and Pearson's CC, assessing both time and frequency domains, and compared our model with others. The pix2pixGAN model excelled, especially under high noise conditions, showing significant improvements in lower RRMSE and higher CC values. This demonstrates the model's superior accuracy and stability in purifying EEG signals, offering a robust solution for EEG analysis challenges and advancing clinical and neuroscience applications. △ Less

Submitted 20 November, 2024; originally announced November 2024.

Comments: 17 pages,6 figures

MSC Class: I.4.9

arXiv:2411.08742 [pdf, other]

A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models

Authors: Dingdong Wang, Mingyu Cui, Dongchao Yang, Xueyuan Chen, Helen Meng

Abstract: With the rise of Speech Large Language Models (Speech LLMs), there has been growing interest in discrete speech tokens for their ability to integrate with text-based tokens seamlessly. Compared to most studies that focus on continuous speech features, although discrete-token based LLMs have shown promising results on certain tasks, the performance gap between these two paradigms is rarely explored… ▽ More With the rise of Speech Large Language Models (Speech LLMs), there has been growing interest in discrete speech tokens for their ability to integrate with text-based tokens seamlessly. Compared to most studies that focus on continuous speech features, although discrete-token based LLMs have shown promising results on certain tasks, the performance gap between these two paradigms is rarely explored. In this paper, we present a fair and thorough comparison between discrete and continuous features across a variety of semantic-related tasks using a light-weight LLM (Qwen1.5-0.5B). Our findings reveal that continuous features generally outperform discrete tokens, particularly in tasks requiring fine-grained semantic understanding. Moreover, this study goes beyond surface-level comparison by identifying key factors behind the under-performance of discrete tokens, such as limited token granularity and inefficient information retention. To enhance the performance of discrete tokens, we explore potential aspects based on our analysis. We hope our results can offer new insights into the opportunities for advancing discrete speech tokens in Speech LLMs. △ Less

Submitted 13 November, 2024; originally announced November 2024.

Comments: 5 tables, 4 figures

arXiv:2411.07486 [pdf, other]

Reference Signal-Based Waveform Design for Integrated Sensing and Communications System

Authors: Ming Lyu, Hao Chen, Dan Wang, Guangyin Feng, Chen Qiu, Xiaodong Xu

Abstract: Integrated sensing and communications (ISAC) as one of the key technologies is capable of supporting high-speed communication and high-precision sensing for the upcoming 6G. This paper studies a waveform strategy by designing the orthogonal frequency division multiplexing (OFDM)-based reference signal (RS) for sensing and communication in ISAC system. We derive the closed-form expressions of Cramé… ▽ More Integrated sensing and communications (ISAC) as one of the key technologies is capable of supporting high-speed communication and high-precision sensing for the upcoming 6G. This paper studies a waveform strategy by designing the orthogonal frequency division multiplexing (OFDM)-based reference signal (RS) for sensing and communication in ISAC system. We derive the closed-form expressions of Cramér-Rao Bound (CRB) for the distance and velocity estimations, and obtain the communication rate under the mean square error of channel estimation. Then, a weighted sum CRB minimization problem on the distance and velocity estimations is formulated by considering communication rate requirement and RS intervals constraints, which is a mixed-integer problem due to the discrete RS interval values. To solve this problem, some numerical methods are typically adopted to obtain the optimal solutions, whose computational complexity grow exponentially with the number of symbols and subcarriers of OFDM. Therefore, we propose a relaxation and approximation method to transform the original discrete problem into a continuous convex one and obtain the sub-optimal solutions. Finally, our proposed scheme is compared with the exhaustive search method in numerical simulations, which show slight gap between the obtained sub-optimal and optimal solutions, and this gap further decreases with large weight factor. △ Less

Submitted 11 November, 2024; originally announced November 2024.

Comments: 6 pages, 4 figures

arXiv:2411.07387 [pdf, other]

Isochrony-Controlled Speech-to-Text Translation: A study on translating from Sino-Tibetan to Indo-European Languages

Authors: Midia Yousefi, Yao Qian, Junkun Chen, Gang Wang, Yanqing Liu, Dongmei Wang, Xiaofei Wang, Jian Xue

Abstract: End-to-end speech translation (ST), which translates source language speech directly into target language text, has garnered significant attention in recent years. Many ST applications require strict length control to ensure that the translation duration matches the length of the source audio, including both speech and pause segments. Previous methods often controlled the number of words or charac… ▽ More End-to-end speech translation (ST), which translates source language speech directly into target language text, has garnered significant attention in recent years. Many ST applications require strict length control to ensure that the translation duration matches the length of the source audio, including both speech and pause segments. Previous methods often controlled the number of words or characters generated by the Machine Translation model to approximate the source sentence's length without considering the isochrony of pauses and speech segments, as duration can vary between languages. To address this, we present improvements to the duration alignment component of our sequence-to-sequence ST model. Our method controls translation length by predicting the duration of speech and pauses in conjunction with the translation process. This is achieved by providing timing information to the decoder, ensuring it tracks the remaining duration for speech and pauses while generating the translation. The evaluation on the Zh-En test set of CoVoST 2, demonstrates that the proposed Isochrony-Controlled ST achieves 0.92 speech overlap and 8.9 BLEU, which has only a 1.4 BLEU drop compared to the ST baseline. △ Less

Submitted 11 November, 2024; originally announced November 2024.

arXiv:2411.07001 [pdf, other]

DoF Analysis and Beamforming Design for Active IRS-aided Multi-user MIMO Wireless Communication in Rank-deficient Channels

Authors: Feng Shu, Jinbing Jiang, Xuehui Wang, Ke Yang, Chong Shen, Qi Zhang, Dongming Wang, Jiangzhou Wang

Abstract: Due to its ability of significantly improving data rate, intelligent reflecting surface (IRS) will be a potential crucial technique for the future generation wireless networks like 6G. In this paper, we will focus on the analysis of degree of freedom (DoF) in IRS-aided multi-user MIMO network. Firstly, the DoF upper bound of IRS-aided single-user MIMO network, i.e., the achievable maximum DoF of s… ▽ More Due to its ability of significantly improving data rate, intelligent reflecting surface (IRS) will be a potential crucial technique for the future generation wireless networks like 6G. In this paper, we will focus on the analysis of degree of freedom (DoF) in IRS-aided multi-user MIMO network. Firstly, the DoF upper bound of IRS-aided single-user MIMO network, i.e., the achievable maximum DoF of such a system, is derived, and the corresponding results are extended to the case of IRS-aided multiuser MIMO by using the matrix rank inequalities. In particular, in serious rank-deficient, also called low-rank, channels like line-of-sight (LoS), the network DoF may doubles over no-IRS with the help of IRS. To verify the rate performance gain from augmented DoF, three closed-form beamforming methods, null-space projection plus maximize transmit power and maximize receive power (NSP-MTP-MRP), Schmidt orthogonalization plus minimum mean square error (SO-MMSE) and two-layer leakage plus MMSE (TLL-MMSE) are proposed to achieve the maximum DoF. Simulation results shows that IRS does make a dramatic rate enhancement. For example, in a serious deficient channel, the sum-rate of the proposed TLL-MMSE aided by IRS is about twice that of no IRS. This means that IRS may achieve a significant DoF improvement in such a channel. △ Less

Submitted 13 November, 2024; v1 submitted 11 November, 2024; originally announced November 2024.

Comments: 12 pages, 9 figures

arXiv:2411.05305 [pdf, other]

Hybrid Precoding with Per-Beam Timing Advance for Asynchronous Cell-free mmWave Massive MIMO-OFDM Systems

Authors: Pengzhe Xin, Yang Cao, Yue Wu, Dongming Wang, Xiaohu You, Jiangzhou Wang

Abstract: Cell-free massive multiple-input-multiple-output (CF-mMIMO) is regarded as one of the promising technologies for next-generation wireless networks. However, due to its distributed architecture, geographically separated access points (APs) jointly serve a large number of user-equipments (UEs), there will inevitably be a discrepancies in the arrival time of transmitted signals. In this paper, we inv… ▽ More Cell-free massive multiple-input-multiple-output (CF-mMIMO) is regarded as one of the promising technologies for next-generation wireless networks. However, due to its distributed architecture, geographically separated access points (APs) jointly serve a large number of user-equipments (UEs), there will inevitably be a discrepancies in the arrival time of transmitted signals. In this paper, we investigate millimeter-wave (mmWave) CF-mMIMO orthogonal frequency division multiplexing (OFDM) systems with asynchronous reception in a wide area coverage scenario, where asynchronous timing offsets may extend far beyond the cyclic prefix (CP) range. A comprehensive asynchronous beam-domain signal transmission model is presented for mmWave CF-mMIMO-OFDM systems in both downlink and uplink, incorporating phase offset, inter-carrier interference (ICI) and inter-symbol interference (ISI). To address the issue of asynchronous reception, we propose a novel per-beam timing advance (PBTA) hybrid precoding architecture and analyze the spectral efficiency (SE) in the beam domain for downlink and uplink asynchronous receptions. Both scalable centralized and distributed implementations are taken into account, and the asynchronous delay phase is utilized to design precoding/combining vectors. Furthermore, we formulate the sum rate maximization problem and develop two low-complexity joint beam selection and UE association algorithms considering the impact of asynchronous timing offset exceeding the CP range. Simulation results demonstrate that the performance will be severely limited by ICI and ISI, and our proposed PBTA hybrid precoding architecture effectively mitigates asynchronous interference compared to the nearest AAU/UE-based timing-advance scheme. Additionally, numerical results show that our proposed low-complexity joint beam selection and UE association algorithms achieve superior SE performance. △ Less

Submitted 7 November, 2024; originally announced November 2024.

arXiv:2411.03723 [pdf]

Zero-shot Dynamic MRI Reconstruction with Global-to-local Diffusion Model

Authors: Yu Guan, Kunlong Zhang, Qi Qi, Dong Wang, Ziwen Ke, Shaoyu Wang, Dong Liang, Qiegen Liu

Abstract: Diffusion models have recently demonstrated considerable advancement in the generation and reconstruction of magnetic resonance imaging (MRI) data. These models exhibit great potential in handling unsampled data and reducing noise, highlighting their promise as generative models. However, their application in dynamic MRI remains relatively underexplored. This is primarily due to the substantial am… ▽ More Diffusion models have recently demonstrated considerable advancement in the generation and reconstruction of magnetic resonance imaging (MRI) data. These models exhibit great potential in handling unsampled data and reducing noise, highlighting their promise as generative models. However, their application in dynamic MRI remains relatively underexplored. This is primarily due to the substantial amount of fully-sampled data typically required for training, which is difficult to obtain in dynamic MRI due to its spatio-temporal complexity and high acquisition costs. To address this challenge, we propose a dynamic MRI reconstruction method based on a time-interleaved acquisition scheme, termed the Glob-al-to-local Diffusion Model. Specifically, fully encoded full-resolution reference data are constructed by merging under-sampled k-space data from adjacent time frames, generating two distinct bulk training datasets for global and local models. The global-to-local diffusion framework alternately optimizes global information and local image details, enabling zero-shot reconstruction. Extensive experiments demonstrate that the proposed method performs well in terms of noise reduction and detail preservation, achieving reconstruction quality comparable to that of supervised approaches. △ Less

Submitted 6 November, 2024; originally announced November 2024.

Comments: 11 pages, 9 figures

arXiv:2410.22362 [pdf, other]

MMM-RS: A Multi-modal, Multi-GSD, Multi-scene Remote Sensing Dataset and Benchmark for Text-to-Image Generation

Authors: Jialin Luo, Yuanzhi Wang, Ziqi Gu, Yide Qiu, Shuaizhen Yao, Fuyun Wang, Chunyan Xu, Wenhua Zhang, Dan Wang, Zhen Cui

Abstract: Recently, the diffusion-based generative paradigm has achieved impressive general image generation capabilities with text prompts due to its accurate distribution modeling and stable training process. However, generating diverse remote sensing (RS) images that are tremendously different from general images in terms of scale and perspective remains a formidable challenge due to the lack of a compre… ▽ More Recently, the diffusion-based generative paradigm has achieved impressive general image generation capabilities with text prompts due to its accurate distribution modeling and stable training process. However, generating diverse remote sensing (RS) images that are tremendously different from general images in terms of scale and perspective remains a formidable challenge due to the lack of a comprehensive remote sensing image generation dataset with various modalities, ground sample distances (GSD), and scenes. In this paper, we propose a Multi-modal, Multi-GSD, Multi-scene Remote Sensing (MMM-RS) dataset and benchmark for text-to-image generation in diverse remote sensing scenarios. Specifically, we first collect nine publicly available RS datasets and conduct standardization for all samples. To bridge RS images to textual semantic information, we utilize a large-scale pretrained vision-language model to automatically output text prompts and perform hand-crafted rectification, resulting in information-rich text-image pairs (including multi-modal images). In particular, we design some methods to obtain the images with different GSD and various environments (e.g., low-light, foggy) in a single sample. With extensive manual screening and refining annotations, we ultimately obtain a MMM-RS dataset that comprises approximately 2.1 million text-image pairs. Extensive experimental results verify that our proposed MMM-RS dataset allows off-the-shelf diffusion models to generate diverse RS images across various modalities, scenes, weather conditions, and GSD. The dataset is available at https://github.com/ljl5261/MMM-RS. △ Less

Submitted 26 October, 2024; originally announced October 2024.

Comments: Accepted by NeurIPS 2024

Showing 1–50 of 379 results for author: Wang, D