Skip to main content

Showing 1–50 of 946 results for author: Wang, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.06593  [pdf, ps, other

    cs.CV eess.IV

    Capturing Stable HDR Videos Using a Dual-Camera System

    Authors: Qianyu Zhang, Bolun Zheng, Hangjia Pan, Lingyu Zhu, Zunjie Zhu, Zongpeng Li, Shiqi Wang

    Abstract: In HDR video reconstruction, exposure fluctuations in reference images from alternating exposure methods often result in flickering. To address this issue, we propose a dual-camera system (DCS) for HDR video acquisition, where one camera is assigned to capture consistent reference sequences, while the other is assigned to capture non-reference sequences for information supplementation. To tackle t… ▽ More

    Submitted 9 July, 2025; originally announced July 2025.

  2. arXiv:2507.06441  [pdf, ps, other

    eess.SY cs.RO

    VisioPath: Vision-Language Enhanced Model Predictive Control for Safe Autonomous Navigation in Mixed Traffic

    Authors: Shanting Wang, Panagiotis Typaldos, Chenjun Li, Andreas A. Malikopoulos

    Abstract: In this paper, we introduce VisioPath, a novel framework combining vision-language models (VLMs) with model predictive control (MPC) to enable safe autonomous driving in dynamic traffic environments. The proposed approach leverages a bird's-eye view video processing pipeline and zero-shot VLM capabilities to obtain structured information about surrounding vehicles, including their positions, dimen… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

  3. arXiv:2507.04598  [pdf, ps, other

    cs.SD eess.AS

    Multi-Step Prediction and Control of Hierarchical Emotion Distribution in Text-to-Speech Synthesis

    Authors: Sho Inoue, Kun Zhou, Shuai Wang, Haizhou Li

    Abstract: We investigate hierarchical emotion distribution (ED) for achieving multi-level quantitative control of emotion rendering in text-to-speech synthesis (TTS). We introduce a novel multi-step hierarchical ED prediction module that quantifies emotion variance at the utterance, word, and phoneme levels. By predicting emotion variance in a multi-step manner, we leverage global emotional context to refin… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

    Comments: Accepted to APSIPA Transactions on Signal and Information Processing

  4. arXiv:2507.03887  [pdf, ps, other

    eess.AS

    Traceable TTS: Toward Watermark-Free TTS with Strong Traceability

    Authors: Yuxiang Zhao, Yunchong Xiao, Yushen Chen, Zhikang Niu, Shuai Wang, Kai Yu, Xie Chen

    Abstract: Recent advances in Text-To-Speech (TTS) technology have enabled synthetic speech to mimic human voices with remarkable realism, raising significant security concerns. This underscores the need for traceable TTS models-systems capable of tracing their synthesized speech without compromising quality or security. However, existing methods predominantly rely on explicit watermarking on speech or on vo… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

  5. arXiv:2507.03341  [pdf, ps, other

    eess.IV cs.CV physics.med-ph

    UltraDfeGAN: Detail-Enhancing Generative Adversarial Networks for High-Fidelity Functional Ultrasound Synthesis

    Authors: Zhuo Li, Xuhang Chen, Shuqiang Wang

    Abstract: Functional ultrasound (fUS) is a neuroimaging technique known for its high spatiotemporal resolution, enabling non-invasive observation of brain activity through neurovascular coupling. Despite its potential in clinical applications such as neonatal monitoring and intraoperative guidance, the development of fUS faces challenges related to data scarcity and limitations in generating realistic fUS i… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

  6. arXiv:2507.03043  [pdf, ps, other

    cs.CL cs.AI cs.SD eess.AS

    K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function

    Authors: Shuhe Li, Chenxu Guo, Jiachen Lian, Cheol Jun Cho, Wenshuo Zhao, Xuanru Zhou, Dingkun Zhou, Sam Wang, Grace Wang, Jingze Yang, Jingyi Xu, Ruohan Bao, Elise Brenner, Brandon In, Francesca Pei, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli

    Abstract: Early evaluation of children's language is frustrated by the high pitch, long phones, and sparse data that derail automatic speech recognisers. We introduce K-Function, a unified framework that combines accurate sub-word transcription, objective scoring, and actionable feedback. Its core, Kids-WFST, merges a Wav2Vec2 phoneme encoder with a phoneme-similarity Dysfluent-WFST to capture child-specifi… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  7. arXiv:2507.02289  [pdf, ps, other

    eess.IV cs.CV

    CineMyoPS: Segmenting Myocardial Pathologies from Cine Cardiac MR

    Authors: Wangbin Ding, Lei Li, Junyi Qiu, Bogen Lin, Mingjing Yang, Liqin Huang, Lianming Wu, Sihan Wang, Xiahai Zhuang

    Abstract: Myocardial infarction (MI) is a leading cause of death worldwide. Late gadolinium enhancement (LGE) and T2-weighted cardiac magnetic resonance (CMR) imaging can respectively identify scarring and edema areas, both of which are essential for MI risk stratification and prognosis assessment. Although combining complementary information from multi-sequence CMR is useful, acquiring these sequences can… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  8. arXiv:2507.01055  [pdf, ps, other

    eess.IV cs.AI cs.CV

    Prompt Mechanisms in Medical Imaging: A Comprehensive Survey

    Authors: Hao Yang, Xinlong Liang, Zhang Li, Yue Sun, Zheyu Hu, Xinghe Xie, Behdad Dashtbozorg, Jincheng Huang, Shiwei Zhu, Luyi Han, Jiong Zhang, Shanshan Wang, Ritse Mann, Qifeng Yu, Tao Tan

    Abstract: Deep learning offers transformative potential in medical imaging, yet its clinical adoption is frequently hampered by challenges such as data scarcity, distribution shifts, and the need for robust task generalization. Prompt-based methodologies have emerged as a pivotal strategy to guide deep learning models, providing flexible, domain-specific adaptations that significantly enhance model performa… ▽ More

    Submitted 27 June, 2025; originally announced July 2025.

  9. arXiv:2506.22810  [pdf, ps, other

    cs.SD eess.AS

    A Self-Training Approach for Whisper to Enhance Long Dysarthric Speech Recognition

    Authors: Shiyao Wang, Jiaming Zhou, Shiwan Zhao, Yong Qin

    Abstract: Dysarthric speech recognition (DSR) enhances the accessibility of smart devices for dysarthric speakers with limited mobility. Previously, DSR research was constrained by the fact that existing datasets typically consisted of isolated words, command phrases, and a limited number of sentences spoken by a few individuals. This constrained research to command-interaction systems and speaker adaptatio… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

    Comments: accepted by Interspeech 2025

    Journal ref: INTERSPEECH 2025

  10. arXiv:2506.21269  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Integrating Vehicle Acoustic Data for Enhanced Urban Traffic Management: A Study on Speed Classification in Suzhou

    Authors: Pengfei Fan, Yuli Zhang, Xinheng Wang, Ruiyuan Jiang, Hankang Gu, Dongyao Jia, Shangbo Wang

    Abstract: This study presents and publicly releases the Suzhou Urban Road Acoustic Dataset (SZUR-Acoustic Dataset), which is accompanied by comprehensive data-acquisition protocols and annotation guidelines to ensure transparency and reproducibility of the experimental workflow. To model the coupling between vehicular noise and driving speed, we propose a bimodal-feature-fusion deep convolutional neural net… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  11. arXiv:2506.20513  [pdf, ps, other

    physics.geo-ph cs.LG eess.SP

    Fast ground penetrating radar dual-parameter full waveform inversion method accelerated by hybrid compilation of CUDA kernel function and PyTorch

    Authors: Lei Liu, Chao Song, Liangsheng He, Silin Wang, Xuan Feng, Cai Liu

    Abstract: This study proposes a high-performance dual-parameter full waveform inversion framework (FWI) for ground-penetrating radar (GPR), accelerated through the hybrid compilation of CUDA kernel functions and PyTorch. The method leverages the computational efficiency of GPU programming while preserving the flexibility and usability of Python-based deep learning frameworks. By integrating customized CUDA… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

  12. arXiv:2506.19774  [pdf, ps, other

    eess.AS cs.AI cs.CL cs.SD

    Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation

    Authors: Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangjing Zhou, Pengfei Cai, Jiahui Zhao, Nan Li, Zihan Li, Yuzhe Liang, Xiaopeng Wang, Haorui Zheng, Ming Wen, Kang Yin, Yiran Wang, Nan Li, Feng Deng, Liang Dong, Chen Zhang, Di Zhang, Kun Gai

    Abstract: We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content. In Kling-Foley, we introduce multimodal diffusion transformers to model the interactions between video, audio, and text modalities, and combine it with a visual semantic representation module and an audio-visual synchronization module to enhance alig… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  13. arXiv:2506.19266  [pdf

    q-bio.NC cs.CV eess.IV

    Convergent and divergent connectivity patterns of the arcuate fasciculus in macaques and humans

    Authors: Jiahao Huang, Ruifeng Li, Wenwen Yu, Anan Li, Xiangning Li, Mingchao Yan, Lei Xie, Qingrun Zeng, Xueyan Jia, Shuxin Wang, Ronghui Ju, Feng Chen, Qingming Luo, Hui Gong, Andrew Zalesky, Xiaoquan Yang, Yuanjing Feng, Zheng Wang

    Abstract: The organization and connectivity of the arcuate fasciculus (AF) in nonhuman primates remain contentious, especially concerning how its anatomy diverges from that of humans. Here, we combined cross-scale single-neuron tracing - using viral-based genetic labeling and fluorescence micro-optical sectioning tomography in macaques (n = 4; age 3 - 11 years) - with whole-brain tractography from 11.7T dif… ▽ More

    Submitted 2 July, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

    Comments: 34 pages, 6 figures

  14. arXiv:2506.18172  [pdf, ps, other

    eess.IV cs.AI cs.CV

    STACT-Time: Spatio-Temporal Cross Attention for Cine Thyroid Ultrasound Time Series Classification

    Authors: Irsyad Adam, Tengyue Zhang, Shrayes Raman, Zhuyu Qiu, Brandon Taraku, Hexiang Feng, Sile Wang, Ashwath Radhachandran, Shreeram Athreya, Vedrana Ivezic, Peipei Ping, Corey Arnold, William Speier

    Abstract: Thyroid cancer is among the most common cancers in the United States. Thyroid nodules are frequently detected through ultrasound (US) imaging, and some require further evaluation via fine-needle aspiration (FNA) biopsy. Despite its effectiveness, FNA often leads to unnecessary biopsies of benign nodules, causing patient discomfort and anxiety. To address this, the American College of Radiology Thy… ▽ More

    Submitted 22 June, 2025; originally announced June 2025.

  15. arXiv:2506.17924  [pdf, ps, other

    math.OC eess.SY

    Inverse Chance Constrained Optimal Power Flow

    Authors: Shenglu Wang, Kairui Feng, Mengqi Xue, Yue Song

    Abstract: The chance constrained optimal power flow (CC-OPF) essentially finds the low-cost generation dispatch scheme ensuring operational constraints are met with a specified probability, termed the security level. While the security level is a crucial input parameter, how it shapes the CC-OPF feasibility boundary has not been revealed. Changing the security level from a parameter to a decision variable,… ▽ More

    Submitted 22 June, 2025; originally announced June 2025.

    Comments: 3 pages, 1 figure

  16. arXiv:2506.15125  [pdf, ps, other

    eess.SP

    Fiber Signal Denoising Algorithm using Hybrid Deep Learning Networks

    Authors: Linlin Wang, Wei Wang, Dezhao Wang, Shanwen Wang

    Abstract: With the applicability of optical fiber-based distributed acoustic sensing (DAS) systems, effective signal processing and analysis approaches are needed to promote its popularization in the field of intelligent transportation systems (ITS). This paper presents a signal denoising algorithm using a hybrid deep-learning network (HDLNet). Without annotated data and time-consuming labeling, this self-s… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

    Comments: 15 pages, 10 figures

  17. arXiv:2506.13709  [pdf, ps, other

    eess.AS cs.SD

    SpeechRefiner: Towards Perceptual Quality Refinement for Front-End Algorithms

    Authors: Sirui Li, Shuai Wang, Zhijun Liu, Zhongjie Jiang, Yannan Wang, Haizhou Li

    Abstract: Speech pre-processing techniques such as denoising, de-reverberation, and separation, are commonly employed as front-ends for various downstream speech processing tasks. However, these methods can sometimes be inadequate, resulting in residual noise or the introduction of new artifacts. Such deficiencies are typically not captured by metrics like SI-SNR but are noticeable to human listeners. To ad… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: Accepted by Interspeech 2025

  18. arXiv:2506.11823  [pdf, ps, other

    eess.IV cs.CV

    Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution

    Authors: Zhangkai Ni, Yang Zhang, Wenhan Yang, Hanli Wang, Shiqi Wang, Sam Kwong

    Abstract: Major efforts in data-driven image super-resolution (SR) primarily focus on expanding the receptive field of the model to better capture contextual information. However, these methods are typically implemented by stacking deeper networks or leveraging transformer-based attention mechanisms, which consequently increases model complexity. In contrast, model-driven methods based on the unfolding para… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: Accepted to IEEE Transactions on Image Processing

  19. arXiv:2506.11671  [pdf, ps, other

    eess.IV cs.CV

    Brain Network Analysis Based on Fine-tuned Self-supervised Model for Brain Disease Diagnosis

    Authors: Yifei Tang, Hongjie Jiang, Changhong Jing, Hieu Pham, Shuqiang Wang

    Abstract: Functional brain network analysis has become an indispensable tool for brain disease analysis. It is profoundly impacted by deep learning methods, which can characterize complex connections between ROIs. However, the research on foundation models of brain network is limited and constrained to a single dimension, which restricts their extensive application in neuroscience. In this study, we propose… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: 13 pages, 3 figures, International Conference on Neural Computing for Advanced Applications

  20. arXiv:2506.11150  [pdf, ps, other

    eess.IV cs.CV

    ADAgent: LLM Agent for Alzheimer's Disease Analysis with Collaborative Coordinator

    Authors: Wenlong Hou, Guangqian Yang, Ye Du, Yeung Lau, Lihao Liu, Junjun He, Ling Long, Shujun Wang

    Abstract: Alzheimer's disease (AD) is a progressive and irreversible neurodegenerative disease. Early and precise diagnosis of AD is crucial for timely intervention and treatment planning to alleviate the progressive neurodegeneration. However, most existing methods rely on single-modality data, which contrasts with the multifaceted approach used by medical experts. While some deep learning approaches proce… ▽ More

    Submitted 15 June, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

  21. Rethinking Brain Tumor Segmentation from the Frequency Domain Perspective

    Authors: Minye Shao, Zeyu Wang, Haoran Duan, Yawen Huang, Bing Zhai, Shizheng Wang, Yang Long, Yefeng Zheng

    Abstract: Precise segmentation of brain tumors, particularly contrast-enhancing regions visible in post-contrast MRI (areas highlighted by contrast agent injection), is crucial for accurate clinical diagnosis and treatment planning but remains challenging. However, current methods exhibit notable performance degradation in segmenting these enhancing brain tumor areas, largely due to insufficient considerati… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: Accepted by IEEE Transactions on Medical Imaging

  22. arXiv:2506.09792  [pdf, ps, other

    cs.SD cs.LG cs.MM eess.AS

    Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction

    Authors: Wenxuan Wu, Shuai Wang, Xixin Wu, Helen Meng, Haizhou Li

    Abstract: Audio-visual target speaker extraction (AV-TSE) models primarily rely on target visual cues to isolate the target speaker's voice from others. We know that humans leverage linguistic knowledge, such as syntax and semantics, to support speech perception. Inspired by this, we explore the potential of pre-trained speech-language models (PSLMs) and pre-trained language models (PLMs) as auxiliary knowl… ▽ More

    Submitted 15 June, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

    Comments: Accepted by Interspeech 2025

  23. arXiv:2506.09273  [pdf, ps, other

    eess.SY cs.DM math.OC nlin.AO

    Data-Driven Nonlinear Regulation: Gaussian Process Learning

    Authors: Telema Harry, Martin Guay, Shimin Wang, Richard D. Braatz

    Abstract: This article addresses the output regulation problem for a class of nonlinear systems using a data-driven approach. An output feedback controller is proposed that integrates a traditional control component with a data-driven learning algorithm based on Gaussian Process (GP) regression to learn the nonlinear internal model. Specifically, a data-driven technique is employed to directly approximate t… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  24. arXiv:2506.07709  [pdf, ps, other

    eess.IV cs.CV

    Fine-Grained Motion Compression and Selective Temporal Fusion for Neural B-Frame Video Coding

    Authors: Xihua Sheng, Peilin Chen, Meng Wang, Li Zhang, Shiqi Wang, Dapeng Oliver Wu

    Abstract: With the remarkable progress in neural P-frame video coding, neural B-frame coding has recently emerged as a critical research direction. However, most existing neural B-frame codecs directly adopt P-frame coding tools without adequately addressing the unique challenges of B-frame compression, leading to suboptimal performance. To bridge this gap, we propose novel enhancements for motion compressi… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  25. arXiv:2506.07634  [pdf, ps, other

    eess.AS cs.MM

    SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement

    Authors: Chenyu Yang, Shuai Wang, Hangting Chen, Wei Tan, Jianwei Yu, Haizhou Li

    Abstract: Generating music with coherent structure, harmonious instrumental and vocal elements remains a significant challenge in song generation. Existing language models and diffusion-based methods often struggle to balance global coherence with local fidelity, resulting in outputs that lack musicality or suffer from incoherent progression and mismatched lyrics. This paper introduces $\textbf{SongBloom}$,… ▽ More

    Submitted 23 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

    Comments: Submitted to NeurIPS2025

  26. arXiv:2506.07520  [pdf, ps, other

    cs.SD cs.AI eess.AS

    LeVo: High-Quality Song Generation with Multi-Preference Alignment

    Authors: Shun Lei, Yaoxun Xu, Zhiwei Lin, Huaicheng Zhang, Wei Tan, Hangting Chen, Jianwei Yu, Yixuan Zhang, Chenyu Yang, Haina Zhu, Shuai Wang, Zhiyong Wu, Dong Yu

    Abstract: Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in sound quality, musicality, instruction following, and vocal-instrument harmony. To address… ▽ More

    Submitted 15 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

  27. arXiv:2506.06038  [pdf, ps, other

    eess.SY cs.RO

    Trajectory Optimization for UAV-Based Medical Delivery with Temporal Logic Constraints and Convex Feasible Set Collision Avoidance

    Authors: Kaiyuan Chen, Yuhan Suo, Shaowei Cui, Yuanqing Xia, Wannian Liang, Shuo Wang

    Abstract: This paper addresses the problem of trajectory optimization for unmanned aerial vehicles (UAVs) performing time-sensitive medical deliveries in urban environments. Specifically, we consider a single UAV with 3 degree-of-freedom dynamics tasked with delivering blood packages to multiple hospitals, each with a predefined time window and priority. Mission objectives are encoded using Signal Temporal… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

    Comments: 7 pages, 4 figures

  28. arXiv:2506.06012  [pdf, ps, other

    cs.RO eess.SY math.OC

    Enhanced Trust Region Sequential Convex Optimization for Multi-Drone Thermal Screening Trajectory Planning in Urban Environments

    Authors: Kaiyuan Chen, Zhengjie Hu, Shaolin Zhang, Yuanqing Xia, Wannian Liang, Shuo Wang

    Abstract: The rapid detection of abnormal body temperatures in urban populations is essential for managing public health risks, especially during outbreaks of infectious diseases. Multi-drone thermal screening systems offer promising solutions for fast, large-scale, and non-intrusive human temperature monitoring. However, trajectory planning for multiple drones in complex urban environments poses significan… ▽ More

    Submitted 19 June, 2025; v1 submitted 6 June, 2025; originally announced June 2025.

  29. arXiv:2506.04116  [pdf, ps, other

    eess.IV cs.AI cs.CV

    A Diffusion-Driven Temporal Super-Resolution and Spatial Consistency Enhancement Framework for 4D MRI imaging

    Authors: Xuanru Zhou, Jiarun Liu, Shoujun Yu, Hao Yang, Cheng Li, Tao Tan, Shanshan Wang

    Abstract: In medical imaging, 4D MRI enables dynamic 3D visualization, yet the trade-off between spatial and temporal resolution requires prolonged scan time that can compromise temporal fidelity--especially during rapid, large-amplitude motion. Traditional approaches typically rely on registration-based interpolation to generate intermediate frames. However, these methods struggle with large deformations,… ▽ More

    Submitted 8 June, 2025; v1 submitted 4 June, 2025; originally announced June 2025.

  30. arXiv:2506.04028  [pdf

    eess.SY

    An Improved Finite Element Modeling Method for Triply Periodic Minimal Surface Structures Based on Element Size and Minimum Jacobian

    Authors: Siqi Wang, Chuangyu Jiang, Xiaodong Zhang, Yilong Zhang, Baoqiang Zhang, Huageng Luo

    Abstract: Triply periodic minimal surface (TPMS) structures, a type of lattice structure, have garnered significant attention due to their lightweight nature, controllability, and excellent mechanical properties. Voxel-based modeling is a widely used method for investigating the mechanical behavior of such lattice structures through finite element simulations. This study proposes a two-parameter voxel metho… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  31. arXiv:2506.02039  [pdf, other

    eess.AS cs.AI cs.SD

    No Audiogram: Leveraging Existing Scores for Personalized Speech Intelligibility Prediction

    Authors: Haoshuai Zhou, Changgeng Mo, Boxuan Cao, Linkai Li, Shan Xiang Wang

    Abstract: Personalized speech intelligibility prediction is challenging. Previous approaches have mainly relied on audiograms, which are inherently limited in accuracy as they only capture a listener's hearing threshold for pure tones. Rather than incorporating additional listener features, we propose a novel approach that leverages an individual's existing intelligibility data to predict their performance… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

    Comments: Accepted at Interspeech 2025

  32. arXiv:2506.00045  [pdf, other

    cs.SD eess.AS

    ACE-Step: A Step Towards Music Generation Foundation Model

    Authors: Junmin Gong, Sean Zhao, Sen Wang, Shengyuan Xu, Joe Guo

    Abstract: We introduce ACE-Step, a novel open-source foundation model for music generation that overcomes key limitations of existing approaches and achieves state-of-the-art performance through a holistic architectural design. Current methods face inherent trade-offs between generation speed, musical coherence, and controllability. For example, LLM-based models (e.g. Yue, SongGen) excel at lyric alignment… ▽ More

    Submitted 28 May, 2025; originally announced June 2025.

    Comments: 14 pages, 5 figures, ace-step's tech report

  33. arXiv:2505.24804  [pdf, ps, other

    eess.SP cs.IT

    Coordinated Beamforming for RIS-Empowered ISAC Systems over Secure Low-Altitude Networks

    Authors: Chunjie Wang, Xuhui Zhang, Wenchao Liu, Jinke Ren, Huijun Xing, Shuqiang Wang, Yanyan Shen

    Abstract: Emerging as a cornerstone for next-generation wireless networks, integrated sensing and communication (ISAC) systems demand innovative solutions to balance spectral efficiency and sensing accuracy. In this paper, we propose a coordinated beamforming framework for a reconfigurable intelligent surface (RIS)-empowered ISAC system, where the active precoding at the dual-functional base station (DFBS)… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

    Comments: This manuscript has been submitted to the IEEE

  34. arXiv:2505.23821  [pdf, ps, other

    cs.CR cs.SD eess.AS

    SpeechVerifier: Robust Acoustic Fingerprint against Tampering Attacks via Watermarking

    Authors: Lingfeng Yao, Chenpei Huang, Shengyao Wang, Junpei Xue, Hanqing Guo, Jiang Liu, Xun Chen, Miao Pan

    Abstract: With the surge of social media, maliciously tampered public speeches, especially those from influential figures, have seriously affected social stability and public trust. Existing speech tampering detection methods remain insufficient: they either rely on external reference data or fail to be both sensitive to attacks and robust to benign operations, such as compression and resampling. To tackle… ▽ More

    Submitted 1 June, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

  35. arXiv:2505.23036  [pdf, ps, other

    cs.SD eess.AS

    AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition

    Authors: Yuhang Dai, He Wang, Xingchen Li, Zihan Zhang, Shuiyuan Wang, Lei Xie, Xin Xu, Hongxiao Guo, Shaoji Zhang, Hui Bu, Wei Chen

    Abstract: This paper delineates AISHELL-5, the first open-source in-car multi-channel multi-speaker Mandarin automatic speech recognition (ASR) dataset. AISHLL-5 includes two parts: (1) over 100 hours of multi-channel speech data recorded in an electric vehicle across more than 60 real driving scenarios. This audio data consists of four far-field speech signals captured by microphones located on each car do… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: 5 pages, 1 figures, 3 tables, accepted by InterSpeech 2025

  36. arXiv:2505.22266  [pdf, ps, other

    cs.SD cs.MM eess.AS

    FGAS: Fixed Decoder Network-Based Audio Steganography with Adversarial Perturbation Generation

    Authors: Jialin Yan, Yu Cheng, Zhaoxia Yin, Xinpeng Zhang, Shilin Wang, Tanfeng Sun, Xinghao Jiang

    Abstract: The rapid development of Artificial Intelligence Generated Content (AIGC) has made high-fidelity generated audio widely available across the Internet, providing diverse cover signals for covert communication. Driven by advances in deep learning, current audio steganography schemes are mainly based on encoding-decoding network architectures. While these methods greatly improve the security of audio… ▽ More

    Submitted 5 June, 2025; v1 submitted 28 May, 2025; originally announced May 2025.

  37. arXiv:2505.21838  [pdf, ps, other

    eess.SY cs.AI math.OC nlin.CD

    Nonadaptive Output Regulation of Second-Order Nonlinear Uncertain Systems

    Authors: Maobin Lu, Martin Guay, Telema Harry, Shimin Wang, Jordan Cooper

    Abstract: This paper investigates the robust output regulation problem of second-order nonlinear uncertain systems with an unknown exosystem. Instead of the adaptive control approach, this paper resorts to a robust control methodology to solve the problem and thus avoid the bursting phenomenon. In particular, this paper constructs generic internal models for the steady-state state and input variables of the… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: 8 pages, 3 figures

  38. arXiv:2505.21818  [pdf, other

    eess.SY

    Learning-Based Tracking Perimeter Control for Two-region Macroscopic Traffic Dynamics

    Authors: Can Chen, Yunping Huang, Hongwei Zhang, Shimin Wang, Martin Guay, Shu-Chien Hsu, Renxin Zhong

    Abstract: Leveraging the concept of the macroscopic fundamental diagram (MFD), perimeter control can alleviate network-level congestion by identifying critical intersections and regulating them effectively. Considering the time-varying nature of travel demand and the equilibrium of accumulation state, we extend the conventional set-point perimeter control (SPC) problem for the two-region MFD system as an op… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  39. arXiv:2505.21530  [pdf, ps, other

    eess.IV cs.AI cs.CV

    High-Fidelity Functional Ultrasound Reconstruction via A Visual Auto-Regressive Framework

    Authors: Xuhang Chen, Zhuo Li, Yanyan Shen, Mufti Mahmud, Hieu Pham, Chi-Man Pun, Shuqiang Wang

    Abstract: Functional ultrasound (fUS) imaging provides exceptional spatiotemporal resolution for neurovascular mapping, yet its practical application is significantly hampered by critical challenges. Foremost among these are data scarcity, arising from ethical considerations and signal degradation through the cranium, which collectively limit dataset diversity and compromise the fairness of downstream machi… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  40. arXiv:2505.20678  [pdf, ps, other

    eess.AS cs.SD eess.SP

    PromptEVC: Controllable Emotional Voice Conversion with Natural Language Prompts

    Authors: Tianhua Qi, Shiyan Wang, Cheng Lu, Tengfei Song, Hao Yang, Zhanglin Wu, Wenming Zheng

    Abstract: Controllable emotional voice conversion (EVC) aims to manipulate emotional expressions to increase the diversity of synthesized speech. Existing methods typically rely on predefined labels, reference audios, or prespecified factor values, often overlooking individual differences in emotion perception and expression. In this paper, we introduce PromptEVC that utilizes natural language prompts for p… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: Accepted to INTERSPEECH2025

  41. arXiv:2505.19947  [pdf, ps, other

    cs.LG cs.AI eess.SY

    Dynamically Learned Test-Time Model Routing in Language Model Zoos with Service Level Guarantees

    Authors: Herbert Woisetschläger, Ryan Zhang, Shiqiang Wang, Hans-Arno Jacobsen

    Abstract: Open-weight LLM zoos provide access to numerous high-quality models, but selecting the appropriate model for specific tasks remains challenging and requires technical expertise. Most users simply want factually correct, safe, and satisfying responses without concerning themselves with model technicalities, while inference service providers prioritize minimizing operating costs. These competing int… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: Preprint. Under review

    ACM Class: I.2; I.2.7; I.2.8

  42. arXiv:2505.18484  [pdf, ps, other

    cs.SD eess.AS

    Token-Level Logits Matter: A Closer Look at Speech Foundation Models for Ambiguous Emotion Recognition

    Authors: Jule Valendo Halim, Siyi Wang, Hong Jia, Ting Dang

    Abstract: Emotional intelligence in conversational AI is crucial across domains like human-computer interaction. While numerous models have been developed, they often overlook the complexity and ambiguity inherent in human emotions. In the era of large speech foundation models (SFMs), understanding their capability in recognizing ambiguous emotions is essential for the development of next-generation emotion… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

    Comments: Accepted at INTERSPEECH 2025

  43. arXiv:2505.16152  [pdf, other

    eess.IV cs.CV

    Compressing Human Body Video with Interactive Semantics: A Generative Approach

    Authors: Bolin Chen, Shanzhi Yin, Hanwei Zhu, Lingyu Zhu, Zihan Zhang, Jie Chen, Ru-Ling Liao, Shiqi Wang, Yan Ye

    Abstract: In this paper, we propose to compress human body video with interactive semantics, which can facilitate video coding to be interactive and controllable by manipulating semantic-level representations embedded in the coded bitstream. In particular, the proposed encoder employs a 3D human model to disentangle nonlinear dynamics and complex motion of human body signal into a series of configurable emb… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

  44. arXiv:2505.14753  [pdf, ps, other

    eess.IV cs.AI cs.CV

    TransMedSeg: A Transferable Semantic Framework for Semi-Supervised Medical Image Segmentation

    Authors: Mengzhu Wang, Jiao Li, Shanshan Wang, Long Lan, Huibin Tan, Liang Yang, Guoli Yang

    Abstract: Semi-supervised learning (SSL) has achieved significant progress in medical image segmentation (SSMIS) through effective utilization of limited labeled data. While current SSL methods for medical images predominantly rely on consistency regularization and pseudo-labeling, they often overlook transferable semantic relationships across different clinical domains and imaging modalities. To address th… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  45. arXiv:2505.14356  [pdf, other

    cs.SD cs.CL eess.AS

    PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs

    Authors: Sho Inoue, Shai Wang, Haizhou Li

    Abstract: Despite significant progress in neural spoken dialog systems, personality-aware conversation agents -- capable of adapting behavior based on personalities -- remain underexplored due to the absence of personality annotations in speech datasets. We propose a pipeline that preprocesses raw audio recordings to create a dialogue dataset annotated with timestamps, response types, and emotion/sentiment… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

    Comments: This is accepted to Interspeech 2025; Added an extra page for supplementary figures; Project page: https://github.com/shinshoji01/Personality-Prediction-for-Conversation-Agents

  46. arXiv:2505.09985  [pdf

    eess.IV cs.CV

    Ordered-subsets Multi-diffusion Model for Sparse-view CT Reconstruction

    Authors: Pengfei Yu, Bin Huang, Minghui Zhang, Weiwen Wu, Shaoyu Wang, Qiegen Liu

    Abstract: Score-based diffusion models have shown significant promise in the field of sparse-view CT reconstruction. However, the projection dataset is large and riddled with redundancy. Consequently, applying the diffusion model to unprocessed data results in lower learning effectiveness and higher learning difficulty, frequently leading to reconstructed images that lack fine details. To address these issu… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  47. arXiv:2505.08215  [pdf, ps, other

    cs.AI cs.SD eess.AS

    Unveiling the Best Practices for Applying Speech Foundation Models to Speech Intelligibility Prediction for Hearing-Impaired People

    Authors: Haoshuai Zhou, Boxuan Cao, Changgeng Mo, Linkai Li, Shan Xiang Wang

    Abstract: Speech foundation models (SFMs) have demonstrated strong performance across a variety of downstream tasks, including speech intelligibility prediction for hearing-impaired people (SIP-HI). However, optimizing SFMs for SIP-HI has been insufficiently explored. In this paper, we conduct a comprehensive study to identify key design factors affecting SIP-HI performance with 5 SFMs, focusing on encoder… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  48. arXiv:2505.05768  [pdf, other

    eess.IV cs.AI cs.CV

    Predicting Diabetic Macular Edema Treatment Responses Using OCT: Dataset and Methods of APTOS Competition

    Authors: Weiyi Zhang, Peranut Chotcomwongse, Yinwen Li, Pusheng Xu, Ruijie Yao, Lianhao Zhou, Yuxuan Zhou, Hui Feng, Qiping Zhou, Xinyue Wang, Shoujin Huang, Zihao Jin, Florence H. T. Chung, Shujun Wang, Yalin Zheng, Mingguang He, Danli Shi, Paisan Ruamviboonsuk

    Abstract: Diabetic macular edema (DME) significantly contributes to visual impairment in diabetic patients. Treatment responses to intravitreal therapies vary, highlighting the need for patient stratification to predict therapeutic benefits and enable personalized strategies. To our knowledge, this study is the first to explore pre-treatment stratification for predicting DME treatment responses. To advance… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

    Comments: 42 pages,5 tables, 12 figures, challenge report

  49. arXiv:2505.04982  [pdf, other

    cs.RO eess.SY

    A Vehicle System for Navigating Among Vulnerable Road Users Including Remote Operation

    Authors: Oscar de Groot, Alberto Bertipaglia, Hidde Boekema, Vishrut Jain, Marcell Kegl, Varun Kotian, Ted Lentsch, Yancong Lin, Chrysovalanto Messiou, Emma Schippers, Farzam Tajdari, Shiming Wang, Zimin Xia, Mubariz Zaffar, Ronald Ensing, Mario Garzon, Javier Alonso-Mora, Holger Caesar, Laura Ferranti, Riender Happee, Julian F. P. Kooij, Georgios Papaioannou, Barys Shyrokau, Dariu M. Gavrila

    Abstract: We present a vehicle system capable of navigating safely and efficiently around Vulnerable Road Users (VRUs), such as pedestrians and cyclists. The system comprises key modules for environment perception, localization and mapping, motion planning, and control, integrated into a prototype vehicle. A key innovation is a motion planner based on Topology-driven Model Predictive Control (T-MPC). The gu… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: Intelligent Vehicles Symposium 2025

  50. arXiv:2504.19401  [pdf

    physics.med-ph cs.CV cs.GR eess.IV

    Innovative Integration of 4D Cardiovascular Reconstruction and Hologram: A New Visualization Tool for Coronary Artery Bypass Grafting Planning

    Authors: Shuo Wang, Tong Ren, Nan Cheng, Li Zhang, Rong Wang

    Abstract: Background: Coronary artery bypass grafting (CABG) planning requires advanced spatial visualization and consideration of coronary artery depth, calcification, and pericardial adhesions. Objective: To develop and evaluate a dynamic cardiovascular holographic visualization tool for preoperative CABG planning. Methods: Using 4D cardiac computed tomography angiography data from 14 CABG candidates, we… ▽ More

    Submitted 27 April, 2025; originally announced April 2025.

    Comments: 35 pages, 9 figures

    ACM Class: J.3; I.3.8