Skip to main content

Showing 1–50 of 460 results for author: Wu, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.07396  [pdf, ps, other

    cs.MM cs.LG cs.SD eess.AS

    IML-Spikeformer: Input-aware Multi-Level Spiking Transformer for Speech Processing

    Authors: Zeyang Song, Shimin Zhang, Yuhong Chou, Jibin Wu, Haizhou Li

    Abstract: Spiking Neural Networks (SNNs), inspired by biological neural mechanisms, represent a promising neuromorphic computing paradigm that offers energy-efficient alternatives to traditional Artificial Neural Networks (ANNs). Despite proven effectiveness, SNN architectures have struggled to achieve competitive performance on large-scale speech processing task. Two key challenges hinder progress: (1) the… ▽ More

    Submitted 9 July, 2025; originally announced July 2025.

    Comments: Under review of TNNLS

  2. arXiv:2507.02374  [pdf, ps, other

    eess.SP

    Predictive Control over LAWN: Joint Trajectory Design and Resource Allocation

    Authors: Haijia Jin, Jun Wu, Weijie Yuan, Ruizhi Ruan, Jiacheng Wang, Dusit Niyato, Dong In Kim, Abbas Jamalipour

    Abstract: Low-altitude wireless networks (LAWNs) have been envisioned as flexible and transformative platforms for enabling delay-sensitive control applications in Internet of Things (IoT) systems. In this work, we investigate the real-time wireless control over a LAWN system, where an aerial drone is employed to serve multiple mobile automated guided vehicles (AGVs) via finite blocklength (FBL) transmissio… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  3. arXiv:2507.01427  [pdf, ps, other

    eess.SP

    SDR-Empowered Environment Sensing Design and Experimental Validation Using OTFS-ISAC Signals

    Authors: Jun Wu, Yuye Shi, Weijie Yuan, Qingqing Cheng, Buyi Li, Xinyuan Wei

    Abstract: This paper investigates the system design and experimental validation of integrated sensing and communication (ISAC) for environmental sensing, which is expected to be a critical enabler for next-generation wireless networks. We advocate exploiting orthogonal time frequency space (OTFS) modulation for its inherent sparsity and stability in delay-Doppler (DD) domain channels, facilitating a low-ove… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  4. arXiv:2507.01348  [pdf, ps, other

    eess.AS cs.SD

    SpeechAccentLLM: A Unified Framework for Foreign Accent Conversion and Text to Speech

    Authors: Zhuangfei Cheng, Guangyan Zhang, Zehai Tu, Yangyang Song, Shuiyang Mao, Xiaoqi Jiao, Jingyu Li, Yiwen Guo, Jiasong Wu

    Abstract: Foreign accent conversion (FAC) in speech processing remains a challenging task. Building on the remarkable success of large language models (LLMs) in Text-to-Speech (TTS) tasks, this study investigates the adaptation of LLM-based techniques for FAC, which we term SpeechAccentLLM. At the core of this framework, we introduce SpeechCodeVAE, the first model to integrate connectionist temporal classif… ▽ More

    Submitted 8 July, 2025; v1 submitted 2 July, 2025; originally announced July 2025.

    Comments: 10 pages, includes references, 4 figures, 4 tables

    ACM Class: I.2.7

  5. arXiv:2506.23075  [pdf, ps, other

    cs.HC cs.LG eess.SP q-bio.NC

    CSBrain: A Cross-scale Spatiotemporal Brain Foundation Model for EEG Decoding

    Authors: Yuchen Zhou, Jiamin Wu, Zichen Ren, Zhouheng Yao, Weiheng Lu, Kunyu Peng, Qihao Zheng, Chunfeng Song, Wanli Ouyang, Chao Gou

    Abstract: Understanding and decoding brain activity from electroencephalography (EEG) signals is a fundamental challenge in neuroscience and AI, with applications in cognition, emotion recognition, diagnosis, and brain-computer interfaces. While recent EEG foundation models advance generalized decoding via unified architectures and large-scale pretraining, they adopt a scale-agnostic dense modeling paradigm… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

  6. arXiv:2506.20970  [pdf, ps, other

    eess.SP

    Co-Design of Sensing, Communications, and Control for Low-Altitude Wireless Networks

    Authors: Haijia Jin, Jun Wu, Weijie Yuan, Fan Liu, Yuanhao Cui

    Abstract: The rapid advancement of Internet of Things (IoT) services and the evolution toward the sixth generation (6G) have positioned unmanned aerial vehicles (UAVs) as critical enablers of low-altitude wireless networks (LAWNs). This work investigates the co-design of integrated sensing, communication, and control ($\mathbf{SC^{2}}$) for multi-UAV cooperative systems with finite blocklength (FBL) transmi… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

  7. arXiv:2506.17887  [pdf, ps, other

    eess.SP

    Near-Field Propagation and Spatial Non-Stationarity Channel Model for 6-24 GHz (FR3) Extremely Large-Scale MIMO: Adopted by 3GPP for 6G

    Authors: Huixin Xu, Jianhua Zhang, Pan Tang, Hongbo Xing, Haiyang Miao, Nan Zhang, Jian Li, Jianming Wu, Wenfei Yang, Zhening Zhang, Wei Jiang, Zijian He, Afshin Haghighat, Qixing Wang, Guangyi Liu

    Abstract: Next generation cellular deployments are expected to exploit the 6-24 GHz frequency range 3 (FR3) and extremely large-scale multiple-input multiple-output (XL-MIMO) to enable ultra-high data rates and reliability. However, the significantly enlarged antenna apertures and higher carrier frequencies render the far-field and spatial stationarity assumptions in the existing 3rd generation partnership… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

  8. arXiv:2506.12475  [pdf, ps, other

    eess.IV cs.CV

    Efficient Star Distillation Attention Network for Lightweight Image Super-Resolution

    Authors: Fangwei Hao, Ji Du, Desheng Kong, Jiesheng Wu, Jing Xu, Ping Li

    Abstract: In recent years, the performance of lightweight Single-Image Super-Resolution (SISR) has been improved significantly with the application of Convolutional Neural Networks (CNNs) and Large Kernel Attention (LKA). However, existing information distillation modules for lightweight SISR struggle to map inputs into High-Dimensional Non-Linear (HDNL) feature spaces, limiting their representation learnin… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  9. arXiv:2506.11297  [pdf, ps, other

    eess.IV cs.LG

    Score-based Generative Diffusion Models to Synthesize Full-dose FDG Brain PET from MRI in Epilepsy Patients

    Authors: Jiaqi Wu, Jiahong Ouyang, Farshad Moradi, Mohammad Mehdi Khalighi, Greg Zaharchuk

    Abstract: Fluorodeoxyglucose (FDG) PET to evaluate patients with epilepsy is one of the most common applications for simultaneous PET/MRI, given the need to image both brain structure and metabolism, but is suboptimal due to the radiation dose in this young population. Little work has been done synthesizing diagnostic quality PET images from MRI data or MRI data with ultralow-dose PET using advanced generat… ▽ More

    Submitted 29 June, 2025; v1 submitted 12 June, 2025; originally announced June 2025.

  10. arXiv:2506.09650  [pdf, ps, other

    cs.CV cs.LG cs.MM cs.RO eess.IV

    HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

    Authors: Kunyu Peng, Junchao Huang, Xiangsheng Huang, Di Wen, Junwei Zheng, Yufan Chen, Kailun Yang, Jiamin Wu, Chongqing Hao, Rainer Stiefelhagen

    Abstract: Action segmentation is a core challenge in high-level video understanding, aiming to partition untrimmed videos into segments and assign each a label from a predefined action set. Existing methods primarily address single-person activities with fixed action sequences, overlooking multi-person scenarios. In this work, we pioneer textual reference-guided human action segmentation in multi-person set… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: The code is available at https://github.com/KPeng9510/HopaDIFF.git

  11. arXiv:2506.08967  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

    Authors: Ailin Huang, Bingxin Li, Bruce Wang, Boyong Wu, Chao Yan, Chengli Feng, Heng Wang, Hongyu Zhou, Hongyuan Wang, Jingbei Li, Jianjian Sun, Joanna Wang, Mingrui Chen, Peng Liu, Ruihang Miao, Shilei Jiang, Tian Fei, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Ge, Zheng Gong, Zhewei Huang , et al. (51 additional authors not shown)

    Abstract: Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a du… ▽ More

    Submitted 13 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

    Comments: 12 pages, 3 figures

  12. arXiv:2506.04779  [pdf, ps, other

    cs.CL cs.SD eess.AS

    MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

    Authors: Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, Helen Meng

    Abstract: Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent mu… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Comments: MMSU benchmark is available at https://huggingface.co/datasets/ddwang2000/MMSU. Evaluation Code is available at https://github.com/dingdongwang/MMSU_Bench

  13. arXiv:2506.00385  [pdf, ps, other

    cs.SD cs.AI cs.LG eess.AS

    MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation

    Authors: Yakun Song, Jiawei Chen, Xiaobin Zhuang, Chenpeng Du, Ziyang Ma, Jian Wu, Jian Cong, Dongya Jia, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen

    Abstract: Neural audio codecs have made significant strides in efficiently mapping raw audio waveforms into discrete token representations, which are foundational for contemporary audio generative models. However, most existing codecs are optimized primarily for reconstruction quality, often at the expense of the downstream modelability of the encoded tokens. Motivated by the need to overcome this bottlenec… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

    Comments: 18 pages, 3 figures. The code and pre-trained models are available at https://github.com/Ereboas/MagiCodec

  14. arXiv:2505.24160  [pdf, ps, other

    eess.IV cs.CV

    Beyond the LUMIR challenge: The pathway to foundational registration models

    Authors: Junyu Chen, Shuwen Wei, Joel Honkamaa, Pekka Marttinen, Hang Zhang, Min Liu, Yichao Zhou, Zuopeng Tan, Zhuoyuan Wang, Yi Wang, Hongchao Zhou, Shunbo Hu, Yi Zhang, Qian Tao, Lukas Förner, Thomas Wendler, Bailiang Jian, Benedikt Wiestler, Tim Hable, Jin Kim, Dan Ruan, Frederic Madesta, Thilo Sentker, Wiebke Heyer, Lianrui Zuo , et al. (11 additional authors not shown)

    Abstract: Medical image challenges have played a transformative role in advancing the field, catalyzing algorithmic innovation and establishing new performance standards across diverse clinical applications. Image registration, a foundational task in neuroimaging pipelines, has similarly benefited from the Learn2Reg initiative. Building on this foundation, we introduce the Large-scale Unsupervised Brain MRI… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  15. arXiv:2505.19294  [pdf, other

    cs.SD cs.CL cs.HC cs.MM eess.AS

    Towards Reliable Large Audio Language Model

    Authors: Ziyang Ma, Xiquan Li, Yakun Song, Wenxi Chen, Chenpeng Du, Jian Wu, Yuanzhe Chen, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen

    Abstract: Recent advancements in large audio language models (LALMs) have demonstrated impressive results and promising prospects in universal understanding and reasoning across speech, music, and general sound. However, these models still lack the ability to recognize their knowledge boundaries and refuse to answer questions they don't know proactively. While there have been successful attempts to enhance… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

    Comments: ACL 2025 Findings

  16. arXiv:2505.17472  [pdf, ps, other

    eess.IV cs.CV

    SUFFICIENT: A scan-specific unsupervised deep learning framework for high-resolution 3D isotropic fetal brain MRI reconstruction

    Authors: Jiangjie Wu, Lixuan Chen, Zhenghao Li, Xin Li, Saban Ozturk, Lihui Wang, Rongpin Wang, Hongjiang Wei, Yuyao Zhang

    Abstract: High-quality 3D fetal brain MRI reconstruction from motion-corrupted 2D slices is crucial for clinical diagnosis. Reliable slice-to-volume registration (SVR)-based motion correction and super-resolution reconstruction (SRR) methods are essential. Deep learning (DL) has demonstrated potential in enhancing SVR and SRR when compared to conventional methods. However, it requires large-scale external t… ▽ More

    Submitted 25 May, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

  17. arXiv:2505.17426  [pdf, ps, other

    cs.SD cs.AI eess.AS

    UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information

    Authors: Rui Wang, Qianguo Sun, Tianrong Chen, Zhiyun Zeng, Junlong Wu, Jiaxing Zhang

    Abstract: The emergence of multi-codebook neutral audio codecs such as Residual Vector Quantization (RVQ) and Group Vector Quantization (GVQ) has significantly advanced Large-Language-Model (LLM) based Text-to-Speech (TTS) systems. These codecs are crucial in separating semantic and acoustic information while efficiently harnessing semantic priors. However, since semantic and acoustic information cannot be… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

  18. arXiv:2505.13070  [pdf, ps, other

    eess.SY

    RSS-Based Localization: Ensuring Consistency and Asymptotic Efficiency

    Authors: Shenghua Hu, Guangyang Zeng, Wenchao Xue, Haitao Fang, Junfeng Wu, Biqiang Mu

    Abstract: We study the problem of signal source localization using received signal strength measurements. We begin by presenting verifiable geometric conditions for sensor deployment that ensure the model's asymptotic localizability. Then we establish the consistency and asymptotic efficiency of the maximum likelihood (ML) estimator. However, computing the ML estimator is challenging due to its reliance on… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  19. arXiv:2505.12332  [pdf, other

    cs.SD cs.AI cs.CV cs.MM eess.AS

    VoiceCloak: A Multi-Dimensional Defense Framework against Unauthorized Diffusion-based Voice Cloning

    Authors: Qianyue Hu, Junyan Wu, Wei Lu, Xiangyang Luo

    Abstract: Diffusion Models (DMs) have achieved remarkable success in realistic voice cloning (VC), while they also increase the risk of malicious misuse. Existing proactive defenses designed for traditional VC models aim to disrupt the forgery process, but they have been proven incompatible with DMs due to the intricate generative mechanisms of diffusion. To bridge this gap, we introduce VoiceCloak, a multi… ▽ More

    Submitted 20 May, 2025; v1 submitted 18 May, 2025; originally announced May 2025.

  20. arXiv:2505.08581  [pdf, other

    cs.CV eess.IV q-bio.TO

    ReSurgSAM2: Referring Segment Anything in Surgical Video via Credible Long-term Tracking

    Authors: Haofeng Liu, Mingqi Gao, Xuxiao Luo, Ziyue Wang, Guanyi Qin, Junde Wu, Yueming Jin

    Abstract: Surgical scene segmentation is critical in computer-assisted surgery and is vital for enhancing surgical quality and patient outcomes. Recently, referring surgical segmentation is emerging, given its advantage of providing surgeons with an interactive experience to segment the target object. However, existing methods are limited by low efficiency and short-term tracking, hindering their applicabil… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: Early accepted by MICCAI 2025

  21. arXiv:2505.06803  [pdf, other

    cs.SD cs.CL cs.CV cs.MM eess.AS

    Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation

    Authors: Xilin Jiang, Junkai Wu, Vishal Choudhari, Nima Mesgarani

    Abstract: Audio large language models (LLMs) are considered experts at recognizing sound objects, yet their performance relative to LLMs in other sensory modalities, such as visual or audio-visual LLMs, and to humans using their ears, eyes, or both remains unexplored. To investigate this, we systematically evaluate audio, visual, and audio-visual LLMs, specifically Qwen2-Audio, Qwen2-VL, and Qwen2.5-Omni, a… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

  22. arXiv:2505.05477  [pdf

    eess.SP cs.CV

    ECGDeDRDNet: A deep learning-based method for Electrocardiogram noise removal using a double recurrent dense network

    Authors: Sainan xiao, Wangdong Yang, Buwen Cao, Jintao Wu

    Abstract: Electrocardiogram (ECG) signals are frequently corrupted by noise, such as baseline wander (BW), muscle artifacts (MA), and electrode motion (EM), which significantly degrade their diagnostic utility. To address this issue, we propose ECGDeDRDNet, a deep learning-based ECG Denoising framework leveraging a Double Recurrent Dense Network architecture. In contrast to traditional approaches, we introd… ▽ More

    Submitted 22 April, 2025; originally announced May 2025.

  23. arXiv:2505.01880  [pdf, other

    cs.SD cs.CV cs.MM eess.AS

    Weakly-supervised Audio Temporal Forgery Localization via Progressive Audio-language Co-learning Network

    Authors: Junyan Wu, Wenbo Xu, Wei Lu, Xiangyang Luo, Rui Yang, Shize Guo

    Abstract: Audio temporal forgery localization (ATFL) aims to find the precise forgery regions of the partial spoof audio that is purposefully modified. Existing ATFL methods rely on training efficient networks using fine-grained annotations, which are obtained costly and challenging in real-world scenarios. To meet this challenge, in this paper, we propose a progressive audio-language co-learning network (L… ▽ More

    Submitted 7 May, 2025; v1 submitted 3 May, 2025; originally announced May 2025.

    Comments: 9pages, 5figures. This paper has been accepted for IJCAI2025

  24. arXiv:2504.13131  [pdf, other

    eess.IV cs.AI cs.CV

    NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: Methods and Results

    Authors: Xin Li, Kun Yuan, Bingchen Li, Fengbin Guan, Yizhen Shao, Zihao Yu, Xijun Wang, Yiting Lu, Wei Luo, Suhang Yao, Ming Sun, Chao Zhou, Zhibo Chen, Radu Timofte, Yabin Zhang, Ao-Xiang Zhang, Tianwu Zhi, Jianzhao Liu, Yang Li, Jingwen Xu, Yiting Liao, Yushen Zuo, Mingyang Wu, Renjie Li, Shengyun Zhong , et al. (88 additional authors not shown)

    Abstract: This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating re… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    Comments: Challenge Report of NTIRE 2025; Methods from 18 Teams; Accepted by CVPR Workshop; 21 pages

  25. arXiv:2504.11966  [pdf, other

    cs.CV cs.LG cs.RO eess.IV

    Exploring Video-Based Driver Activity Recognition under Noisy Labels

    Authors: Linjuan Fan, Di Wen, Kunyu Peng, Kailun Yang, Jiaming Zhang, Ruiping Liu, Yufan Chen, Junwei Zheng, Jiamin Wu, Xudong Han, Rainer Stiefelhagen

    Abstract: As an open research topic in the field of deep learning, learning with noisy labels has attracted much attention and grown rapidly over the past ten years. Learning with label noise is crucial for driver distraction behavior recognition, as real-world video data often contains mislabeled samples, impacting model reliability and performance. However, label noise learning is barely explored in the d… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

    Comments: The source code is available at https://github.com/ilonafan/DAR-noisy-labels

  26. arXiv:2504.10686  [pdf, other

    cs.CV eess.IV

    The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report

    Authors: Bin Ren, Hang Guo, Lei Sun, Zongwei Wu, Radu Timofte, Yawei Li, Yao Zhang, Xinning Chai, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Li Song, Hongyuan Yu, Pufan Xu, Cheng Wan, Zhijuan Huang, Peng Guo, Shuyuan Cui, Chenjun Li, Xuehai Hu, Pan Pan, Xin Zhang, Heng Zhang, Qing Luo, Linyan Jiang , et al. (122 additional authors not shown)

    Abstract: This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: Accepted by CVPR2025 NTIRE Workshop, Efficient Super-Resolution Challenge Report. 50 pages

  27. arXiv:2504.08743  [pdf, other

    cs.IR cs.LG eess.SY math.OC stat.AP

    Dynamic Topic Analysis in Academic Journals using Convex Non-negative Matrix Factorization Method

    Authors: Yang Yang, Tong Zhang, Jian Wu, Lijie Su

    Abstract: With the rapid advancement of large language models, academic topic identification and topic evolution analysis are crucial for enhancing AI's understanding capabilities. Dynamic topic analysis provides a powerful approach to capturing and understanding the temporal evolution of topics in large-scale datasets. This paper presents a two-stage dynamic topic analysis framework that incorporates conve… ▽ More

    Submitted 23 March, 2025; originally announced April 2025.

    Comments: 11 pages, 7 figures, 6 tables

  28. arXiv:2504.07442  [pdf, other

    eess.SP

    RIS-Aided Integrated Sensing and Communication Waveform Design With Tunable PAPR

    Authors: Jinlong Wu, Lixin Li, Wensheng Lin, Wei Liang, Decan Zhao, Zhu Han

    Abstract: Low peak-to-average power ratio (PAPR) transmission is an important and favorable requirement prevalent in radar and communication systems, especially in transmission links integrated with high power amplifiers. Meanwhile, motivated by the advantages of reconfigurable intelligent surface (RIS) in mitigating multi-user interference (MUI) to enhance the communication rate, this paper investigates th… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

  29. arXiv:2504.07406  [pdf, other

    cs.SD eess.AS

    Towards Generalizability to Tone and Content Variations in the Transcription of Amplifier Rendered Electric Guitar Audio

    Authors: Yu-Hua Chen, Yuan-Chiao Cheng, Yen-Tung Yeh, Jui-Te Wu, Jyh-Shing Roger Jang, Yi-Hsuan Yang

    Abstract: Transcribing electric guitar recordings is challenging due to the scarcity of diverse datasets and the complex tone-related variations introduced by amplifiers, cabinets, and effect pedals. To address these issues, we introduce EGDB-PG, a novel dataset designed to capture a wide range of tone-related characteristics across various amplifier-cabinet configurations. In addition, we propose the Tone-… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.

  30. arXiv:2503.19611  [pdf, other

    cs.SD cs.AI cs.MM eess.AS eess.SP

    Analyzable Chain-of-Musical-Thought Prompting for High-Fidelity Music Generation

    Authors: Max W. Y. Lam, Yijin Xing, Weiya You, Jingcheng Wu, Zongyu Yin, Fuqiang Jiang, Hangyu Liu, Feng Liu, Xingda Li, Wei-Tsung Lu, Hanyu Chen, Tong Feng, Tianwei Zhao, Chien-Hung Liu, Xuchen Song, Yang Li, Yahui Zhou

    Abstract: Autoregressive (AR) models have demonstrated impressive capabilities in generating high-fidelity music. However, the conventional next-token prediction paradigm in AR models does not align with the human creative process in music composition, potentially compromising the musicality of generated samples. To overcome this limitation, we introduce MusiCoT, a novel chain-of-thought (CoT) prompting tec… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

    Comments: Preprint

  31. arXiv:2503.13011  [pdf, other

    cs.RO eess.IV

    Sensorless Remote Center of Motion Misalignment Estimation

    Authors: Hao Yang, Lidia Al-Zogbi, Ahmet Yildiz, Nabil Simaan, Jie Ying Wu

    Abstract: Laparoscopic surgery constrains instrument motion around a fixed pivot point at the incision into a patient to minimize tissue trauma. Surgical robots achieve this through either hardware to software-based remote center of motion (RCM) constraints. However, accurate RCM alignment is difficult due to manual trocar placement, patient motion, and tissue deformation. Misalignment between the robot's R… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

  32. arXiv:2503.12876  [pdf, other

    cs.RO eess.SY

    A Hierarchical Region-Based Approach for Efficient Multi-Robot Exploration

    Authors: Di Meng, Tianhao Zhao, Chaoyu Xue, Jun Wu, Qiuguo Zhu

    Abstract: Multi-robot autonomous exploration in an unknown environment is an important application in robotics.Traditional exploration methods only use information around frontier points or viewpoints, ignoring spatial information of unknown areas. Moreover, finding the exact optimal solution for multi-robot task allocation is NP-hard, resulting in significant computational time consumption. To address thes… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

  33. arXiv:2503.12482  [pdf, other

    eess.SP

    Fuzzy Clustering for Low-Complexity Time Domain Chromatic Dispersion Compensation Scheme in Coherent Optical Fiber Communication Systems

    Authors: Wenkai Wan, Aiying Yang, Peng Guo, Zhe Zhao, Tianjia Xu, Jinxuan Wu, Zhiheng Liu

    Abstract: Chromatic dispersion compensation (CDC), implemented in either the time-domain or frequency-domain, is crucial for enhancing power efficiency in the digital signal processing of modern optical fiber communication systems. Developing low-complexity CDC schemes is essential for hardware implemention, particularly for high-speed and long-haul optical fiber communication systems. In this work, we prop… ▽ More

    Submitted 16 March, 2025; originally announced March 2025.

  34. A Unified Approach to Enforce Non-Negativity Constraint in Neural Network Approximation for Optimal Voltage Regulation (preprint)

    Authors: Jiaqi Wu, Jingyi Yuan, Yang Weng, Guangwen Wang

    Abstract: Power system voltage regulation is crucial to maintain power quality while integrating intermittent renewable resources in distribution grids. However, the system model on the grid edge is often unknown, making it difficult to model physical equations for optimal control. Therefore, previous work proposes structured data-driven methods like input convex neural networks (ICNN) for "optimal" control… ▽ More

    Submitted 6 May, 2025; v1 submitted 16 March, 2025; originally announced March 2025.

    Comments: Submitted to the 58th Hawaii International Conference on System Sciences (HICSS-58)

    Journal ref: HICSS'58 (2025) 3018-3027

  35. arXiv:2503.11124  [pdf, other

    cs.RO eess.SY physics.flu-dyn

    Flow-Aware Navigation of Magnetic Micro-Robots in Complex Fluids via PINN-Based Prediction

    Authors: Yongyi Jia, Shu Miao, Jiayu Wu, Ming Yang, Chengzhi Hu, Xiang Li

    Abstract: While magnetic micro-robots have demonstrated significant potential across various applications, including drug delivery and microsurgery, the open issue of precise navigation and control in complex fluid environments is crucial for in vivo implementation. This paper introduces a novel flow-aware navigation and control strategy for magnetic micro-robots that explicitly accounts for the impact of f… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

    Comments: 8

  36. arXiv:2503.08802  [pdf, other

    eess.IV cs.CV

    Deformable Registration Framework for Augmented Reality-based Surgical Guidance in Head and Neck Tumor Resection

    Authors: Qingyun Yang, Fangjie Li, Jiayi Xu, Zixuan Liu, Sindhura Sridhar, Whitney Jin, Jennifer Du, Jon Heiselman, Michael Miga, Michael Topf, Jie Ying Wu

    Abstract: Head and neck squamous cell carcinoma (HNSCC) has one of the highest rates of recurrence cases among solid malignancies. Recurrence rates can be reduced by improving positive margins localization. Frozen section analysis (FSA) of resected specimens is the gold standard for intraoperative margin assessment. However, because of the complex 3D anatomy and the significant shrinkage of resected specime… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

  37. arXiv:2503.03346  [pdf, other

    eess.SY

    SEAL: Safety Enhanced Trajectory Planning and Control Framework for Quadrotor Flight in Complex Environments

    Authors: Yiming Wang, Jianbin Ma, Junda Wu, Huizhe Li, Zhexuan Zhou, Youmin Gong, Jie Mei, Guangfu Ma

    Abstract: For quadrotors, achieving safe and autonomous flight in complex environments with wind disturbances and dynamic obstacles still faces significant challenges. Most existing methods address wind disturbances in either trajectory planning or control, which may lead to hazardous situations during flight. The emergence of dynamic obstacles would further worsen the situation. Therefore, we propose an ef… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

  38. arXiv:2503.02769  [pdf, ps, other

    cs.SD cs.CL cs.HC eess.AS

    InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training

    Authors: Dingdong Wang, Jin Xu, Ruihang Chu, Zhifang Guo, Xiong Wang, Jincenzi Wu, Dongchao Yang, Shengpeng Ji, Junyang Lin

    Abstract: Recent advancements in speech large language models (SpeechLLMs) have attracted considerable attention. Nonetheless, current methods exhibit suboptimal performance in adhering to speech instructions. Notably, the intelligence of models significantly diminishes when processing speech-form input as compared to direct text-form input. Prior work has attempted to mitigate this semantic inconsistency b… ▽ More

    Submitted 4 June, 2025; v1 submitted 4 March, 2025; originally announced March 2025.

    Comments: Accepted to ACL 2025; Data is available at: https://huggingface.co/datasets/ddwang2000/SpeechInstructBench

  39. arXiv:2502.21260  [pdf, other

    eess.IV

    PET Image Denoising via Text-Guided Diffusion: Integrating Anatomical Priors through Text Prompts

    Authors: Boxiao Yu, Savas Ozdemir, Jiong Wu, Yizhou Chen, Ruogu Fang, Kuangyu Shi, Kuang Gong

    Abstract: Low-dose Positron Emission Tomography (PET) imaging presents a significant challenge due to increased noise and reduced image quality, which can compromise its diagnostic accuracy and clinical utility. Denoising diffusion probabilistic models (DDPMs) have demonstrated promising performance for PET image denoising. However, existing DDPM-based methods typically overlook valuable metadata such as pa… ▽ More

    Submitted 28 February, 2025; originally announced February 2025.

  40. arXiv:2502.18777  [pdf, other

    eess.IV

    Hyperspectral image reconstruction by deep learning with super-Rayleigh speckles

    Authors: Ziyan Chen, Zhentao Liu, Jianrong Wu, Shensheng Han

    Abstract: Ghost imaging via sparsity constraints (GISC) spectral camera modulates the three-dimensional (3D) hyperspectral image into a two-dimensional (2D) compressive image with speckles in a single shot. It obtains a 3D hyperspectral image (HSI) by reconstruction algorithms. The rapid development of deep learning has provided a new method for 3D HSI reconstruction. Moreover, the imaging performance of th… ▽ More

    Submitted 25 February, 2025; originally announced February 2025.

  41. arXiv:2502.11946  [pdf, other

    cs.CL cs.AI cs.HC cs.SD eess.AS

    Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

    Authors: Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin Zhang, Hongyu Zhou, Jianjian Sun, Brian Li, Chengting Feng, Changyi Wan, Hanpeng Hu , et al. (120 additional authors not shown)

    Abstract: Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contribu… ▽ More

    Submitted 18 February, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

  42. arXiv:2502.05448  [pdf, other

    eess.SY math.OC

    Distributionally Robust Model Predictive Control with Mixture of Gaussian Processes

    Authors: Jingyi Wu, Chao Ning

    Abstract: Despite the success of Gaussian process based Model Predictive Control (MPC) in robotic control, its applicability scope is greatly hindered by multimodal disturbances that are prevalent in real-world settings. Here we propose a novel Mixture of Gaussian Processes based Distributionally Robust MPC (MoGP-DR-MPC) framework for linear time invariant systems subject to potentially multimodal state-dep… ▽ More

    Submitted 7 February, 2025; originally announced February 2025.

    Comments: 6 pages

  43. arXiv:2502.03930  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation

    Authors: Dongya Jia, Zhuo Chen, Jiawei Chen, Chenpeng Du, Jian Wu, Jian Cong, Xiaobin Zhuang, Chumin Li, Zhen Wei, Yuping Wang, Yuxuan Wang

    Abstract: Several recent studies have attempted to autoregressively generate continuous speech representations without discrete speech tokens by combining diffusion and autoregressive models, yet they often face challenges with excessive computational loads or suboptimal outcomes. In this work, we propose Diffusion Transformer Autoregressive Modeling (DiTAR), a patch-based autoregressive framework combining… ▽ More

    Submitted 25 May, 2025; v1 submitted 6 February, 2025; originally announced February 2025.

    Comments: Accepted by ICML 2025

  44. arXiv:2502.00474  [pdf

    cs.CV cs.LG eess.IV

    A framework for river connectivity classification using temporal image processing and attention based neural networks

    Authors: Timothy James Becker, Derin Gezgin, Jun Yi He Wu, Mary Becker

    Abstract: Measuring the connectivity of water in rivers and streams is essential for effective water resource management. Increased extreme weather events associated with climate change can result in alterations to river and stream connectivity. While traditional stream flow gauges are costly to deploy and limited to large river bodies, trail camera methods are a low-cost and easily deployed alternative to… ▽ More

    Submitted 1 February, 2025; originally announced February 2025.

    Comments: 15 pages, 8 figures

    ACM Class: I.4.3; I.4.1; I.5.1

  45. arXiv:2501.19058  [pdf, other

    cs.RO eess.SY

    Gravity Compensation of the dVRK-Si Patient Side Manipulator based on Dynamic Model Identification

    Authors: Haoying Zhou, Hao Yang, Anton Deguet, Loris Fichera, Jie Ying Wu, Peter Kazanzides

    Abstract: The da Vinci Research Kit (dVRK, also known as dVRK Classic) is an open-source teleoperated surgical robotic system whose hardware is obtained from the first generation da Vinci Surgical System (Intuitive, Sunnyvale, CA, USA). The dVRK has greatly facilitated research in robot-assisted surgery over the past decade and helped researchers address multiple major challenges in this domain. Recently, t… ▽ More

    Submitted 5 February, 2025; v1 submitted 31 January, 2025; originally announced January 2025.

    Journal ref: 2025 Hamlyn Symposium on Medical Robotics

  46. arXiv:2501.16780  [pdf, ps, other

    cs.SD cs.HC cs.MM eess.AS

    AVE Speech: A Comprehensive Multi-Modal Dataset for Speech Recognition Integrating Audio, Visual, and Electromyographic Signals

    Authors: Dongliang Zhou, Yakun Zhang, Jinghan Wu, Xingyu Zhang, Liang Xie, Erwei Yin

    Abstract: The global aging population faces considerable challenges, particularly in communication, due to the prevalence of hearing and speech impairments. To address these, we introduce the AVE speech, a comprehensive multi-modal dataset for speech recognition tasks. The dataset includes a 100-sentence Mandarin corpus with audio signals, lip-region video recordings, and six-channel electromyography (EMG)… ▽ More

    Submitted 5 July, 2025; v1 submitted 28 January, 2025; originally announced January 2025.

    Comments: The paper has been accepted by IEEE Transactions on Human-Machine Systems

  47. arXiv:2501.15368  [pdf, other

    cs.CL cs.SD eess.AS

    Baichuan-Omni-1.5 Technical Report

    Authors: Yadong Li, Jun Liu, Tao Zhang, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, Chong Li, Yuanbo Fang, Dongdong Kuang, Mingrui Wang, Chenglin Zhu, Youwei Zhang, Hongyu Guo, Fengyu Zhang, Yuran Wang, Bowen Ding, Wei Song, Xu Li, Yuqi Huo, Zheng Liang , et al. (68 additional authors not shown)

    Abstract: We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pip… ▽ More

    Submitted 25 January, 2025; originally announced January 2025.

  48. arXiv:2501.11079  [pdf, ps, other

    cs.LG cs.AI eess.SP

    Federated Deep Reinforcement Learning for Energy Efficient Multi-Functional RIS-Assisted Low-Earth Orbit Networks

    Authors: Li-Hsiang Shen, Jyun-Jhe Huang, Kai-Ten Feng, Lie-Liang Yang, Jen-Ming Wu

    Abstract: In this paper, a novel network architecture that deploys the multi-functional reconfigurable intelligent surface (MF-RIS) in low-Earth orbit (LEO) is proposed. Unlike traditional RIS with only signal reflection capability, the MF-RIS can reflect, refract, and amplify signals, as well as harvest energy from wireless signals. Given the high energy demands in shadow regions where solar energy is unav… ▽ More

    Submitted 19 January, 2025; originally announced January 2025.

  49. arXiv:2501.09972  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions

    Authors: Heda Zuo, Weitao You, Junxian Wu, Shihong Ren, Pei Chen, Mingxu Zhou, Yujia Lu, Lingyun Sun

    Abstract: Composing music for video is essential yet challenging, leading to a growing interest in automating music generation for video applications. Existing approaches often struggle to achieve robust music-video correspondence and generative diversity, primarily due to inadequate feature alignment methods and insufficient datasets. In this study, we present General Video-to-Music Generation model (GVMGe… ▽ More

    Submitted 17 January, 2025; originally announced January 2025.

    Comments: Accepted by the 39th AAAI Conference on Artificial Intelligence (AAAI-25)

  50. arXiv:2501.09838  [pdf, other

    cs.CV cs.AI eess.IV

    CrossModalityDiffusion: Multi-Modal Novel View Synthesis with Unified Intermediate Representation

    Authors: Alex Berian, Daniel Brignac, JhihYang Wu, Natnael Daba, Abhijit Mahalanobis

    Abstract: Geospatial imaging leverages data from diverse sensing modalities-such as EO, SAR, and LiDAR, ranging from ground-level drones to satellite views. These heterogeneous inputs offer significant opportunities for scene understanding but present challenges in interpreting geometry accurately, particularly in the absence of precise ground truth data. To address this, we propose CrossModalityDiffusion,… ▽ More

    Submitted 16 January, 2025; originally announced January 2025.

    Comments: Accepted in the 2025 WACV workshop GeoCV