Skip to main content

Showing 1–50 of 379 results for author: Wang, D

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.01888  [pdf

    eess.AS

    Perceptual Ratings Predict Speech Inversion Articulatory Kinematics in Childhood Speech Sound Disorders

    Authors: Nina R. Benway, Saba Tabatabaee, Dongliang Wang, Benjamin Munson, Jonathan L. Preston, Carol Espy-Wilson

    Abstract: Purpose: This study evaluated whether articulatory kinematics, inferred by Articulatory Phonology speech inversion neural networks, aligned with perceptual ratings of /r/ and /s/ in the speech of children with speech sound disorders. Methods: Articulatory Phonology vocal tract variables were inferred for 5,961 utterances from 118 children and 3 adults, aged 2.25-45 years. Perceptual ratings were… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: This manuscript is in submission for publication. It has not yet been peer reviewed

  2. arXiv:2506.15125  [pdf, ps, other

    eess.SP

    Fiber Signal Denoising Algorithm using Hybrid Deep Learning Networks

    Authors: Linlin Wang, Wei Wang, Dezhao Wang, Shanwen Wang

    Abstract: With the applicability of optical fiber-based distributed acoustic sensing (DAS) systems, effective signal processing and analysis approaches are needed to promote its popularization in the field of intelligent transportation systems (ITS). This paper presents a signal denoising algorithm using a hybrid deep-learning network (HDLNet). Without annotated data and time-consuming labeling, this self-s… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

    Comments: 15 pages, 10 figures

  3. arXiv:2506.11540  [pdf, ps, other

    eess.SP

    MMWiLoc: A Multi-Sensor Dataset and Robust Device-Free Localization Method Using Commercial Off-The-Shelf Millimeter Wave Wi-Fi Devices

    Authors: Wenbo Ding, Yang Li, Dongsheng Wang, Bin Zhao, Yunrong Zhu, Yibo Zhang, Yumeng Miao

    Abstract: Device-free Wi-Fi sensing has numerous benefits in practical settings, as it eliminates the requirement for dedicated sensing devices and can be accomplished using current low-cost Wi-Fi devices. With the development of Wi-Fi standards, millimeter wave Wi-Fi devices with 60GHz operating frequency and up to 4GHz bandwidth have become commercially available. Although millimeter wave Wi-Fi presents g… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: 8 pages, 8 figures

  4. arXiv:2506.09512  [pdf, ps, other

    eess.SY cs.LG

    A Survey on the Role of Artificial Intelligence and Machine Learning in 6G-V2X Applications

    Authors: Donglin Wang, Anjie Qiu, Qiuheng Zhou, Hans D. Schotten

    Abstract: The rapid advancement of Vehicle-to-Everything (V2X) communication is transforming Intelligent Transportation Systems (ITS), with 6G networks expected to provide ultra-reliable, low-latency, and high-capacity connectivity for Connected and Autonomous Vehicles (CAVs). Artificial Intelligence (AI) and Machine Learning (ML) have emerged as key enablers in optimizing V2X communication by enhancing net… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: 7 pages, 1 figure

  5. arXiv:2506.08038  [pdf, ps, other

    eess.SY cs.MA

    Joint Routing and Control Optimization in VANET

    Authors: Chen Huang, Dingxuan Wang, Ronghui Hou

    Abstract: In this paper, we introduce DynaRoute, an adaptive joint optimization framework for dynamic vehicular networks that simultaneously addresses platoon control and data transmission through trajectory-aware routing and safety-constrained vehicle coordination. DynaRoute guarantees continuous vehicle movement via platoon safety control with optimizing transmission paths through real-time trajectory pre… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Comments: 11 pages; 10 figures

  6. arXiv:2506.04779  [pdf, ps, other

    cs.CL cs.SD eess.AS

    MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

    Authors: Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, Helen Meng

    Abstract: Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent mu… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Comments: MMSU benchmark is available at https://huggingface.co/datasets/ddwang2000/MMSU. Evaluation Code is available at https://github.com/dingdongwang/MMSU_Bench

  7. arXiv:2506.02012  [pdf, other

    cs.CV cs.SD eess.AS

    Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing

    Authors: Zehua Liu, Xiaolou Li, Li Guo, Lantian Li, Dong Wang

    Abstract: Visual Speech Recognition (VSR) transcribes speech by analyzing lip movements. Recently, Large Language Models (LLMs) have been integrated into VSR systems, leading to notable performance improvements. However, the potential of LLMs has not been extensively studied, and how to effectively utilize LLMs in VSR tasks remains unexplored. This paper systematically explores how to better leverage LLMs f… ▽ More

    Submitted 27 May, 2025; originally announced June 2025.

  8. arXiv:2506.02010  [pdf, other

    cs.CV cs.SD eess.AS

    CNVSRC 2024: The Second Chinese Continuous Visual Speech Recognition Challenge

    Authors: Zehua Liu, Xiaolou Li, Chen Chen, Lantian Li, Dong Wang

    Abstract: This paper presents the second Chinese Continuous Visual Speech Recognition Challenge (CNVSRC 2024), which builds on CNVSRC 2023 to advance research in Chinese Large Vocabulary Continuous Visual Speech Recognition (LVC-VSR). The challenge evaluates two test scenarios: reading in recording studios and Internet speech. CNVSRC 2024 uses the same datasets as its predecessor CNVSRC 2023, which involves… ▽ More

    Submitted 27 May, 2025; originally announced June 2025.

    Comments: to be published in INTERSPEECH 2025

  9. arXiv:2506.00885  [pdf, ps, other

    cs.SD cs.AI eess.AS

    CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching

    Authors: Leying Zhang, Yao Qian, Xiaofei Wang, Manthan Thakker, Dongmei Wang, Jianwei Yu, Haibin Wu, Yuxuan Hu, Jinyu Li, Yanmin Qian, Sheng Zhao

    Abstract: Generating natural-sounding, multi-speaker dialogue is crucial for applications such as podcast creation, virtual agents, and multimedia content generation. However, existing systems struggle to maintain speaker consistency, model overlapping speech, and synthesize coherent conversations efficiently. In this paper, we introduce CoVoMix2, a fully non-autoregressive framework for zero-shot multi-tal… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

  10. arXiv:2505.21805  [pdf, ps, other

    cs.SD eess.AS

    An Investigation on Speaker Augmentation for End-to-End Speaker Extraction

    Authors: Zhenghai You, Zhenyu Zhou, Lantian Li, Dong Wang

    Abstract: Target confusion, defined as occasional switching to non-target speakers, poses a key challenge for end-to-end speaker extraction (E2E-SE) systems. We argue that this problem is largely caused by the lack of generalizability and discrimination of the speaker embeddings, and introduce a simple yet effective speaker augmentation strategy to tackle the problem. Specifically, we propose a time-domain… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  11. arXiv:2505.18533  [pdf, ps, other

    eess.AS cs.AI

    TS-URGENet: A Three-stage Universal Robust and Generalizable Speech Enhancement Network

    Authors: Xiaobin Rong, Dahan Wang, Qinwen Hu, Yushi Wang, Yuxiang Hu, Jing Lu

    Abstract: Universal speech enhancement aims to handle input speech with different distortions and input formats. To tackle this challenge, we present TS-URGENet, a Three-Stage Universal, Robust, and Generalizable speech Enhancement Network. To address various distortions, the proposed system employs a novel three-stage architecture consisting of a filling stage, a separation stage, and a restoration stage.… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

    Comments: Accepted by Interspeech 2025

  12. arXiv:2505.02439  [pdf, ps, other

    cs.AI cs.LG eess.SY

    ReeM: Ensemble Building Thermodynamics Model for Efficient HVAC Control via Hierarchical Reinforcement Learning

    Authors: Yang Deng, Yaohui Liu, Rui Liang, Dafang Zhao, Donghua Xie, Ittetsu Taniguchi, Dan Wang

    Abstract: The building thermodynamics model, which predicts real-time indoor temperature changes under potential HVAC (Heating, Ventilation, and Air Conditioning) control operations, is crucial for optimizing HVAC control in buildings. While pioneering studies have attempted to develop such models for various building environments, these models often require extensive data collection periods and rely heavil… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

  13. arXiv:2504.17898  [pdf, other

    eess.SP cs.CV

    Material Identification Via RFID For Smart Shopping

    Authors: David Wang, Derek Goh, Jiale Zhang

    Abstract: Cashierless stores rely on computer vision and RFID tags to associate shoppers with items, but concealed items placed in backpacks, pockets, or bags create challenges for theft prevention. We introduce a system that turns existing RFID tagged items into material sensors by exploiting how different containers attenuate and scatter RF signals. Using RSSI and phase angle, we trained a neural network… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

    Comments: 5 pages, 7 figures

    ACM Class: J.0; J.7; B.0

  14. arXiv:2504.04969  [pdf, other

    eess.SP

    Grouped Target Tracking and Seamless People Counting with a 24 GHz MIMO FMCW

    Authors: Dingyang Wang, Sen Yuan, Alexander Yarovoy, Francesco Fioranelli

    Abstract: The problem of radar-based tracking of groups of people moving together and counting their numbers in indoor environments is considered here. A novel processing pipeline to track groups of people moving together and count their numbers is proposed and validated. The pipeline is specifically designed to deal with frequent changes of direction and stop & go movements typical of indoor activities. Th… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  15. arXiv:2504.02402  [pdf, other

    cs.SD cs.AI eess.AS

    EvMic: Event-based Non-contact sound recovery from effective spatial-temporal modeling

    Authors: Hao Yin, Shi Guo, Xu Jia, Xudong XU, Lu Zhang, Si Liu, Dong Wang, Huchuan Lu, Tianfan Xue

    Abstract: When sound waves hit an object, they induce vibrations that produce high-frequency and subtle visual changes, which can be used for recovering the sound. Early studies always encounter trade-offs related to sampling rate, bandwidth, field of view, and the simplicity of the optical path. Recent advances in event camera hardware show good potential for its application in visual sound recovery, becau… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

    Comments: Our project page: https://yyzq1.github.io/EvMic/

  16. arXiv:2504.01519  [pdf, other

    cs.CL eess.AS

    Chain of Correction for Full-text Speech Recognition with Large Language Models

    Authors: Zhiyuan Tang, Dong Wang, Zhikai Zhou, Yong Liu, Shen Huang, Shidong Shang

    Abstract: Full-text error correction with Large Language Models (LLMs) for Automatic Speech Recognition (ASR) has gained increased attention due to its potential to correct errors across long contexts and address a broader spectrum of error types, including punctuation restoration and inverse text normalization. Nevertheless, many challenges persist, including issues related to stability, controllability, c… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

  17. arXiv:2503.24313  [pdf

    physics.optics eess.SP

    1-Tb/s/λ Transmission over Record 10714-km AR-HCF

    Authors: Dawei Ge, Siyuan Liu, Qiang Qiu, Peng Li, Qiang Guo, Yiqi Li, Dong Wang, Baoluo Yan, Mingqing Zuo, Lei Zhang, Dechao Zhang, Hu Shi, Jie Luo, Han Li, Zhangyuan Chen

    Abstract: We present the first single-channel 1.001-Tb/s DP-36QAM-PCS recirculating transmission over 73 loops of 146.77-km ultra-low-loss & low-IMI DNANF-5 fiber, achieving a record transmission distance of 10,714.28 km.

    Submitted 2 April, 2025; v1 submitted 31 March, 2025; originally announced March 2025.

  18. arXiv:2503.21491  [pdf, other

    cs.RO eess.SY

    Data-Driven Contact-Aware Control Method for Real-Time Deformable Tool Manipulation: A Case Study in the Environmental Swabbing

    Authors: Siavash Mahmoudi, Amirreza Davar, Dongyi Wang

    Abstract: Deformable Object Manipulation (DOM) remains a critical challenge in robotics due to the complexities of developing suitable model-based control strategies. Deformable Tool Manipulation (DTM) further complicates this task by introducing additional uncertainties between the robot and its environment. While humans effortlessly manipulate deformable tools using touch and experience, robotic systems s… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

    Comments: Submitted for Journal Review

  19. arXiv:2503.20274  [pdf, other

    eess.SP

    Near-Field THz Bending Beamforming: A Convex Optimization Perspective

    Authors: Aoran Liu, Weidong Mei, Peilan Wang, Dong Wang, Ya Fei Wu, Zhi Chen, Boyu Ning

    Abstract: Terahertz (THz) communication systems suffer severe blockage issues, which may significantly degrade the communication coverage and quality. Bending beams, capable of adjusting their propagation direction to bypass obstacles, have recently emerged as a promising solution to resolve this issue by engineering the propagation trajectory of the beam. However, traditional bending beam generation method… ▽ More

    Submitted 7 April, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

  20. arXiv:2503.18375  [pdf, other

    cs.LG eess.SP

    ALWNN Empowered Automatic Modulation Classification: Conquering Complexity and Scarce Sample Conditions

    Authors: Yunhao Quan, Chuang Gao, Nan Cheng, Zhijie Zhang, Zhisheng Yin, Wenchao Xu, Danyang Wang

    Abstract: In Automatic Modulation Classification (AMC), deep learning methods have shown remarkable performance, offering significant advantages over traditional approaches and demonstrating their vast potential. Nevertheless, notable drawbacks, particularly in their high demands for storage, computational resources, and large-scale labeled data, which limit their practical application in real-world scenari… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

  21. arXiv:2503.17886  [pdf, other

    cs.SD eess.AS

    Elevating Robust Multi-Talker ASR by Decoupling Speaker Separation and Speech Recognition

    Authors: Yufeng Yang, Hassan Taherian, Vahid Ahmadi Kalkhorani, DeLiang Wang

    Abstract: Despite the tremendous success of automatic speech recognition (ASR) with the introduction of deep learning, its performance is still unsatisfactory in many real-world multi-talker scenarios. Speaker separation excels in separating individual talkers but, as a frontend, it introduces processing artifacts that degrade the ASR backend trained on clean speech. As a result, mainstream robust ASR syste… ▽ More

    Submitted 22 March, 2025; originally announced March 2025.

  22. arXiv:2503.14185  [pdf, other

    cs.CL cs.SD eess.AS

    AdaST: Dynamically Adapting Encoder States in the Decoder for End-to-End Speech-to-Text Translation

    Authors: Wuwei Huang, Dexin Wang, Deyi Xiong

    Abstract: In end-to-end speech translation, acoustic representations learned by the encoder are usually fixed and static, from the perspective of the decoder, which is not desirable for dealing with the cross-modal and cross-lingual challenge in speech translation. In this paper, we show the benefits of varying acoustic states according to decoder hidden states and propose an adaptive speech-to-text transla… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

    Comments: ACL 2021 Findings

  23. arXiv:2503.13478  [pdf

    eess.SP cs.CR cs.CY

    Advancing Highway Work Zone Safety: A Comprehensive Review of Sensor Technologies for Intrusion and Proximity Hazards

    Authors: Ayenew Yihune Demeke, Moein Younesi Heravi, Israt Sharmin Dola, Youjin Jang, Chau Le, Inbae Jeong, Zhibin Lin, Danling Wang

    Abstract: Highway work zones are critical areas where accidents frequently occur, often due to the proximity of workers to heavy machinery and ongoing traffic. With technological advancements in sensor technologies and the Internet of Things, promising solutions are emerging to address these safety concerns. This paper provides a systematic review of existing studies on the application of sensor technologie… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

    Comments: 4 Figures, 5 Tables

  24. arXiv:2503.13257  [pdf, other

    eess.IV

    Anatomically and Metabolically Informed Diffusion for Unified Denoising and Segmentation in Low-Count PET Imaging

    Authors: Menghua Xia, Kuan-Yin Ko, Der-Shiun Wang, Ming-Kai Chen, Qiong Liu, Huidong Xie, Liang Guo, Wei Ji, Jinsong Ouyang, Reimund Bayerlein, Benjamin A. Spencer, Quanzheng Li, Ramsey D. Badawi, Georges El Fakhri, Chi Liu

    Abstract: Positron emission tomography (PET) image denoising, along with lesion and organ segmentation, are critical steps in PET-aided diagnosis. However, existing methods typically treat these tasks independently, overlooking inherent synergies between them as correlated steps in the analysis pipeline. In this work, we present the anatomically and metabolically informed diffusion (AMDiff) model, a unified… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

  25. arXiv:2503.12840  [pdf, other

    cs.SD cs.CV eess.AS

    Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics

    Authors: Chen Liu, Liying Yang, Peike Li, Dadong Wang, Lincheng Li, Xin Yu

    Abstract: Sound-guided object segmentation has drawn considerable attention for its potential to enhance multimodal perception. Previous methods primarily focus on developing advanced architectures to facilitate effective audio-visual interactions, without fully addressing the inherent challenges posed by audio natures, \emph{\ie}, (1) feature confusion due to the overlapping nature of audio signals, and (2… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR2025

  26. arXiv:2503.08134  [pdf, other

    eess.SP

    THz Beam Squint Mitigation via 3D Rotatable Antennas

    Authors: Yike Xie, Weidong Mei, Dong Wang, Boyu Ning, Zhi Chen, Jun Fang, Wei Guo

    Abstract: Analog beamforming holds great potential for future terahertz (THz) communications due to its ability to generate high-gain directional beams with low-cost phase shifters.However, conventional analog beamforming may suffer substantial performance degradation in wideband systems due to the beam-squint effects. Instead of relying on high-cost true time delayers, we propose in this paper an efficient… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

  27. arXiv:2503.07997  [pdf, ps, other

    eess.SP eess.IV eess.SY

    A Survey of Challenges and Sensing Technologies in Autonomous Retail Systems

    Authors: Shimmy Rukundo, David Wang, Front Wongnonthawitthaya, Youssouf Sidibé, Minsik Kim, Emily Su, Jiale Zhang

    Abstract: Autonomous stores leverage advanced sensing technologies to enable cashier-less shopping, real-time inventory tracking, and seamless customer interactions. However, these systems face significant challenges, including occlusion in vision-based tracking, scalability of sensor deployment, theft prevention, and real-time data processing. To address these issues, researchers have explored multi-modal… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

    ACM Class: J.0; J.7; A.1

  28. arXiv:2503.02769  [pdf, ps, other

    cs.SD cs.CL cs.HC eess.AS

    InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training

    Authors: Dingdong Wang, Jin Xu, Ruihang Chu, Zhifang Guo, Xiong Wang, Jincenzi Wu, Dongchao Yang, Shengpeng Ji, Junyang Lin

    Abstract: Recent advancements in speech large language models (SpeechLLMs) have attracted considerable attention. Nonetheless, current methods exhibit suboptimal performance in adhering to speech instructions. Notably, the intelligence of models significantly diminishes when processing speech-form input as compared to direct text-form input. Prior work has attempted to mitigate this semantic inconsistency b… ▽ More

    Submitted 4 June, 2025; v1 submitted 4 March, 2025; originally announced March 2025.

    Comments: Accepted to ACL 2025; Data is available at: https://huggingface.co/datasets/ddwang2000/SpeechInstructBench

  29. arXiv:2503.02647  [pdf, other

    cs.IT eess.SP

    A Framework for Uplink ISAC Receiver Designs: Performance Analysis and Algorithm Development

    Authors: Zhiyuan Yu, Hong Ren, Cunhua Pan, Gui Zhou, Dongming Wang, Chau Yuen, Jiangzhou Wang

    Abstract: Uplink integrated sensing and communication (ISAC) systems have recently emerged as a promising research direction, enabling simultaneous uplink signal detection and target sensing. In this paper, we propose the flexible projection (FP)-type receiver that unify the projection-type receiver and the successive interference cancellation (SIC)-type receiver by using a flexible tradeoff factor to adapt… ▽ More

    Submitted 3 April, 2025; v1 submitted 4 March, 2025; originally announced March 2025.

    Comments: 13 pages, 9 figures, submitted to an IEEE journal for possible publication

  30. arXiv:2503.00340  [pdf, other

    eess.AS

    UL-UNAS: Ultra-Lightweight U-Nets for Real-Time Speech Enhancement via Network Architecture Search

    Authors: Xiaobin Rong, Dahan Wang, Yuxiang Hu, Changbao Zhu, Kai Chen, Jing Lu

    Abstract: Lightweight models are essential for real-time speech enhancement applications. In recent years, there has been a growing trend toward developing increasingly compact models for speech enhancement. In this paper, we propose an Ultra-Lightweight U-net optimized by Network Architecture Search (UL-UNAS), which is suitable for implementation in low-footprint devices. Firstly, we explore the applicatio… ▽ More

    Submitted 28 February, 2025; originally announced March 2025.

    Comments: 13 pages, 8 figures, submitted to Neural Networks

  31. arXiv:2502.14224  [pdf, other

    eess.AS cs.SD

    Adaptive Convolution for CNN-based Speech Enhancement Models

    Authors: Dahan Wang, Xiaobin Rong, Shiruo Sun, Yuxiang Hu, Changbao Zhu, Jing Lu

    Abstract: Deep learning-based speech enhancement methods have significantly improved speech quality and intelligibility. Convolutional neural networks (CNNs) have been proven to be essential components of many high-performance models. In this paper, we introduce adaptive convolution, an efficient and versatile convolutional module that enhances the model's capability to adaptively represent speech signals.… ▽ More

    Submitted 19 February, 2025; originally announced February 2025.

    Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

  32. arXiv:2502.09631  [pdf, other

    eess.IV cs.GR

    Volumetric Temporal Texture Synthesis for Smoke Stylization using Neural Cellular Automata

    Authors: Dongqing Wang, Ehsan Pajouheshgar, Yitao Xu, Tong Zhang, Sabine Süsstrunk

    Abstract: Artistic stylization of 3D volumetric smoke data is still a challenge in computer graphics due to the difficulty of ensuring spatiotemporal consistency given a reference style image, and that within reasonable time and computational resources. In this work, we introduce Volumetric Neural Cellular Automata (VNCA), a novel model for efficient volumetric style transfer that synthesizes, in real-time,… ▽ More

    Submitted 5 February, 2025; originally announced February 2025.

  33. arXiv:2502.00421  [pdf, other

    cs.CL cs.SD eess.AS

    Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language

    Authors: Turi Abu, Ying Shi, Thomas Fang Zheng, Dong Wang

    Abstract: We present a novel Automatic Speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. The dataset was collected through a crowd-sourcing initiative, encompassing a diverse range of speakers and phonetic variations. It consists of 100 hours of real-world audio recordings paired with transcriptions, covering read speech in both clean and… ▽ More

    Submitted 1 February, 2025; originally announced February 2025.

    Comments: Accepted for ICASSP2025 (2025 IEEE International Conference on Acoustics, Speech, and Signal Processing)

  34. arXiv:2501.15368  [pdf, other

    cs.CL cs.SD eess.AS

    Baichuan-Omni-1.5 Technical Report

    Authors: Yadong Li, Jun Liu, Tao Zhang, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, Chong Li, Yuanbo Fang, Dongdong Kuang, Mingrui Wang, Chenglin Zhu, Youwei Zhang, Hongyu Guo, Fengyu Zhang, Yuran Wang, Bowen Ding, Wei Song, Xu Li, Yuqi Huo, Zheng Liang , et al. (68 additional authors not shown)

    Abstract: We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pip… ▽ More

    Submitted 25 January, 2025; originally announced January 2025.

  35. arXiv:2501.14234  [pdf, other

    eess.SP cs.IT

    STAR-RIS-Enabled Multi-Path Beam Routing with Passive Beam Splitting

    Authors: Bonan An, Weidong Mei, Yuanwei Liu, Dong Wang, Zhi Chen

    Abstract: Reconfigurable intelligent surfaces (RISs) can be densely deployed in the environment to create multi-reflection line-of-sight (LoS) links for signal coverage enhancement. However, conventional reflection-only RISs can only achieve half-space reflection, which limits the LoS path diversity. In contrast, simultaneously transmitting and reflecting RISs (STAR-RISs) can achieve full-space reflection a… ▽ More

    Submitted 19 May, 2025; v1 submitted 23 January, 2025; originally announced January 2025.

  36. arXiv:2501.13336  [pdf, other

    cs.CV eess.IV

    Gradient-Free Adversarial Purification with Diffusion Models

    Authors: Xuelong Dai, Dong Wang, Duan Mingxing, Bin Xiao

    Abstract: Adversarial training and adversarial purification are two effective and practical defense methods to enhance a model's robustness against adversarial attacks. However, adversarial training necessitates additional training, while adversarial purification suffers from low time efficiency. More critically, current defenses are designed under the perturbation-based adversarial threat model, which is i… ▽ More

    Submitted 22 January, 2025; originally announced January 2025.

  37. arXiv:2412.20371  [pdf, other

    eess.SP

    Cooperative ISAC-empowered Low-Altitude Economy

    Authors: Jun Tang, Yiming Yu, Cunhua Pan, Hong Ren, Dongming Wang, Jiangzhou Wang, Xiaohu You

    Abstract: This paper proposes a cooperative integrated sensing and communication (ISAC) scheme for the low-altitude sensing scenario, aiming at estimating the parameters of the unmanned aerial vehicles (UAVs) and enhancing the sensing performance via cooperation. The proposed scheme consists of two stages. In Stage I, we formulate the monostatic parameter estimation problem via using a tensor decomposition… ▽ More

    Submitted 29 December, 2024; originally announced December 2024.

  38. arXiv:2412.20349  [pdf, other

    eess.SP

    Two-Timescale Design for AP Mode Selection of Cooperative ISAC Networks

    Authors: Zhichu Ren, Cunhua Pan, Hong Ren, Dongming Wang, Lexi Xu, Jiangzhou Wang

    Abstract: As an emerging technology, cooperative bi-static integrated sensing and communication (ISAC) is promising to achieve high-precision sensing, high-rate communication as well as self-interference (SI) avoidance. This paper investigates the two-timescale design for access point (AP) mode selection to realize the full potential of the cooperative bi-static ISAC network with low system overhead, where… ▽ More

    Submitted 28 December, 2024; originally announced December 2024.

    Comments: 13 pages, 8 figures

  39. arXiv:2412.13891  [pdf, ps, other

    cs.LG eess.SP

    Graph-Driven Models for Gas Mixture Identification and Concentration Estimation on Heterogeneous Sensor Array Signals

    Authors: Ding Wang, Lei Wang, Huilin Yin, Guoqing Gu, Zhiping Lin, Wenwen Zhang

    Abstract: Accurately identifying gas mixtures and estimating their concentrations are crucial across various industrial applications using gas sensor arrays. However, existing models face challenges in generalizing across heterogeneous datasets, which limits their scalability and practical applicability. To address this problem, this study develops two novel deep-learning models that integrate temporal grap… ▽ More

    Submitted 18 December, 2024; originally announced December 2024.

  40. arXiv:2412.11614  [pdf

    eess.SP

    Acceleration and Parallelization Methods for ISRS EGN Model

    Authors: Ruiyang Xia, Guanjun Gao, Zanshan Zhao, Haoyu Wang, Kun Wen, Daobin Wang

    Abstract: The enhanced Gaussian noise (EGN) model, which accounts for inter-channel stimulated Raman scattering (ISRS), has been extensively utilized for evaluating nonlinear interference (NLI) within the C+L band. Compared to closed-form expressions and machine learning-based NLI evaluation models, it demonstrates broader applicability and its accuracy is not dependent on the support of large-scale dataset… ▽ More

    Submitted 16 December, 2024; originally announced December 2024.

    Comments: 12 pages, 12 figures, preprint submitted to IEEE for possible publication

  41. arXiv:2412.10489  [pdf, other

    cs.CV cs.AI eess.SP

    CognitionCapturer: Decoding Visual Stimuli From Human EEG Signal With Multimodal Information

    Authors: Kaifan Zhang, Lihuo He, Xin Jiang, Wen Lu, Di Wang, Xinbo Gao

    Abstract: Electroencephalogram (EEG) signals have attracted significant attention from researchers due to their non-invasive nature and high temporal sensitivity in decoding visual stimuli. However, most recent studies have focused solely on the relationship between EEG and image data pairs, neglecting the valuable ``beyond-image-modality" information embedded in EEG signals. This results in the loss of cri… ▽ More

    Submitted 24 December, 2024; v1 submitted 13 December, 2024; originally announced December 2024.

  42. arXiv:2412.09887  [pdf, other

    eess.AS cs.AI cs.SD eess.SP

    CSL-L2M: Controllable Song-Level Lyric-to-Melody Generation Based on Conditional Transformer with Fine-Grained Lyric and Musical Controls

    Authors: Li Chai, Donglin Wang

    Abstract: Lyric-to-melody generation is a highly challenging task in the field of AI music generation. Due to the difficulty of learning strict yet weak correlations between lyrics and melodies, previous methods have suffered from weak controllability, low-quality and poorly structured generation. To address these challenges, we propose CSL-L2M, a controllable song-level lyric-to-melody generation method ba… ▽ More

    Submitted 14 January, 2025; v1 submitted 13 December, 2024; originally announced December 2024.

    Comments: Accepted at AAAI-25

  43. arXiv:2411.13288  [pdf

    eess.SP

    EEG Signal Denoising Using pix2pix GAN: Enhancing Neurological Data Analysis

    Authors: Haoyi Wang, Xufang Chen, Yue Yang, Kewei Zhou, Meining Lv, Dongrui Wang, Wenjie Zhang

    Abstract: Electroencephalography (EEG) is essential in neuroscience and clinical practice, yet it suffers from physiological artifacts, particularly electromyography (EMG), which distort signals. We propose a deep learning model using pix2pixGAN to remove such noise and generate reliable EEG signals. Leveraging the EEGdenoiseNet dataset, we created synthetic datasets with controlled EMG noise levels for mod… ▽ More

    Submitted 20 November, 2024; originally announced November 2024.

    Comments: 17 pages,6 figures

    MSC Class: I.4.9

  44. arXiv:2411.08742  [pdf, other

    cs.CL cs.SD eess.AS

    A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models

    Authors: Dingdong Wang, Mingyu Cui, Dongchao Yang, Xueyuan Chen, Helen Meng

    Abstract: With the rise of Speech Large Language Models (Speech LLMs), there has been growing interest in discrete speech tokens for their ability to integrate with text-based tokens seamlessly. Compared to most studies that focus on continuous speech features, although discrete-token based LLMs have shown promising results on certain tasks, the performance gap between these two paradigms is rarely explored… ▽ More

    Submitted 13 November, 2024; originally announced November 2024.

    Comments: 5 tables, 4 figures

  45. arXiv:2411.07486  [pdf, other

    eess.SP

    Reference Signal-Based Waveform Design for Integrated Sensing and Communications System

    Authors: Ming Lyu, Hao Chen, Dan Wang, Guangyin Feng, Chen Qiu, Xiaodong Xu

    Abstract: Integrated sensing and communications (ISAC) as one of the key technologies is capable of supporting high-speed communication and high-precision sensing for the upcoming 6G. This paper studies a waveform strategy by designing the orthogonal frequency division multiplexing (OFDM)-based reference signal (RS) for sensing and communication in ISAC system. We derive the closed-form expressions of Cramé… ▽ More

    Submitted 11 November, 2024; originally announced November 2024.

    Comments: 6 pages, 4 figures

  46. arXiv:2411.07387  [pdf, other

    cs.CL eess.AS

    Isochrony-Controlled Speech-to-Text Translation: A study on translating from Sino-Tibetan to Indo-European Languages

    Authors: Midia Yousefi, Yao Qian, Junkun Chen, Gang Wang, Yanqing Liu, Dongmei Wang, Xiaofei Wang, Jian Xue

    Abstract: End-to-end speech translation (ST), which translates source language speech directly into target language text, has garnered significant attention in recent years. Many ST applications require strict length control to ensure that the translation duration matches the length of the source audio, including both speech and pause segments. Previous methods often controlled the number of words or charac… ▽ More

    Submitted 11 November, 2024; originally announced November 2024.

  47. arXiv:2411.07001  [pdf, other

    eess.SP

    DoF Analysis and Beamforming Design for Active IRS-aided Multi-user MIMO Wireless Communication in Rank-deficient Channels

    Authors: Feng Shu, Jinbing Jiang, Xuehui Wang, Ke Yang, Chong Shen, Qi Zhang, Dongming Wang, Jiangzhou Wang

    Abstract: Due to its ability of significantly improving data rate, intelligent reflecting surface (IRS) will be a potential crucial technique for the future generation wireless networks like 6G. In this paper, we will focus on the analysis of degree of freedom (DoF) in IRS-aided multi-user MIMO network. Firstly, the DoF upper bound of IRS-aided single-user MIMO network, i.e., the achievable maximum DoF of s… ▽ More

    Submitted 13 November, 2024; v1 submitted 11 November, 2024; originally announced November 2024.

    Comments: 12 pages, 9 figures

  48. arXiv:2411.05305  [pdf, other

    eess.SP

    Hybrid Precoding with Per-Beam Timing Advance for Asynchronous Cell-free mmWave Massive MIMO-OFDM Systems

    Authors: Pengzhe Xin, Yang Cao, Yue Wu, Dongming Wang, Xiaohu You, Jiangzhou Wang

    Abstract: Cell-free massive multiple-input-multiple-output (CF-mMIMO) is regarded as one of the promising technologies for next-generation wireless networks. However, due to its distributed architecture, geographically separated access points (APs) jointly serve a large number of user-equipments (UEs), there will inevitably be a discrepancies in the arrival time of transmitted signals. In this paper, we inv… ▽ More

    Submitted 7 November, 2024; originally announced November 2024.

  49. arXiv:2411.03723  [pdf

    eess.IV cs.CV

    Zero-shot Dynamic MRI Reconstruction with Global-to-local Diffusion Model

    Authors: Yu Guan, Kunlong Zhang, Qi Qi, Dong Wang, Ziwen Ke, Shaoyu Wang, Dong Liang, Qiegen Liu

    Abstract: Diffusion models have recently demonstrated considerable advancement in the generation and reconstruction of magnetic resonance imaging (MRI) data. These models exhibit great potential in handling unsampled data and reducing noise, highlighting their promise as generative models. However, their application in dynamic MRI remains relatively underexplored. This is primarily due to the substantial am… ▽ More

    Submitted 6 November, 2024; originally announced November 2024.

    Comments: 11 pages, 9 figures

  50. arXiv:2410.22362  [pdf, other

    eess.IV cs.AI cs.CV

    MMM-RS: A Multi-modal, Multi-GSD, Multi-scene Remote Sensing Dataset and Benchmark for Text-to-Image Generation

    Authors: Jialin Luo, Yuanzhi Wang, Ziqi Gu, Yide Qiu, Shuaizhen Yao, Fuyun Wang, Chunyan Xu, Wenhua Zhang, Dan Wang, Zhen Cui

    Abstract: Recently, the diffusion-based generative paradigm has achieved impressive general image generation capabilities with text prompts due to its accurate distribution modeling and stable training process. However, generating diverse remote sensing (RS) images that are tremendously different from general images in terms of scale and perspective remains a formidable challenge due to the lack of a compre… ▽ More

    Submitted 26 October, 2024; originally announced October 2024.

    Comments: Accepted by NeurIPS 2024