Skip to main content

Showing 1–50 of 1,441 results for author: Chen, Y

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.06116  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis

    Authors: Xintong Hu, Yixuan Chen, Rui Yang, Wenxiang Guo, Changhao Pan

    Abstract: Automatic speech quality assessment plays a crucial role in the development of speech synthesis systems, but existing models exhibit significant performance variations across different granularity levels of prediction tasks. This paper proposes an enhanced MOS prediction system based on self-supervised learning speech models, incorporating a Mixture of Experts (MoE) classification head and utilizi… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

  2. arXiv:2507.04622  [pdf, ps, other

    eess.IV cs.CV

    A Deep Unfolding Framework for Diffractive Snapshot Spectral Imaging

    Authors: Zhengyue Zhuge, Jiahui Xu, Shiqi Chen, Hao Xu, Yueting Chen, Zhihai Xu, Huajun Feng

    Abstract: Snapshot hyperspectral imaging systems acquire spectral data cubes through compressed sensing. Recently, diffractive snapshot spectral imaging (DSSI) methods have attracted significant attention. While various optical designs and improvements continue to emerge, research on reconstruction algorithms remains limited. Although numerous networks and deep unfolding methods have been applied on similar… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

  3. arXiv:2507.03887  [pdf, ps, other

    eess.AS

    Traceable TTS: Toward Watermark-Free TTS with Strong Traceability

    Authors: Yuxiang Zhao, Yunchong Xiao, Yushen Chen, Zhikang Niu, Shuai Wang, Kai Yu, Xie Chen

    Abstract: Recent advances in Text-To-Speech (TTS) technology have enabled synthetic speech to mimic human voices with remarkable realism, raising significant security concerns. This underscores the need for traceable TTS models-systems capable of tracing their synthesized speech without compromising quality or security. However, existing methods predominantly rely on explicit watermarking on speech or on vo… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

  4. arXiv:2507.03640  [pdf, ps, other

    physics.optics eess.IV

    Subpixel correction of diffraction pattern shifts in ptychography via automatic differentiation

    Authors: Zhengkang Xu, Yanqi Chen, Hao Xu, Qingxin Wang, Jin Niu, Lei Huang, Jiyue Tang, Yongjun Ma, Yutong Wang, Yishi Shi, Changjun Ke, Jie Li, Zhongwei Fan

    Abstract: Ptychography, a coherent diffraction imaging technique, has become an indispensable tool in materials characterization, biological imaging, and nanostructure analysis due to its capability for high-resolution, lensless reconstruction of complex-valued images. In typical workflows, raw diffraction patterns are commonly cropped to isolate the valid central region before reconstruction. However, if t… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

  5. arXiv:2507.02437  [pdf, ps, other

    cs.CV eess.IV

    F^2TTA: Free-Form Test-Time Adaptation on Cross-Domain Medical Image Classification via Image-Level Disentangled Prompt Tuning

    Authors: Wei Li, Jingyang Zhang, Lihao Liu, Guoan Wang, Junjun He, Yang Chen, Lixu Gu

    Abstract: Test-Time Adaptation (TTA) has emerged as a promising solution for adapting a source model to unseen medical sites using unlabeled test data, due to the high cost of data annotation. Existing TTA methods consider scenarios where data from one or multiple domains arrives in complete domain units. However, in clinical practice, data usually arrives in domain fragments of arbitrary lengths and in ran… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: This paper has been submitted to relevant journals

  6. arXiv:2507.01608  [pdf, ps, other

    cs.CV eess.IV

    Perception-Oriented Latent Coding for High-Performance Compressed Domain Semantic Inference

    Authors: Xu Zhang, Ming Lu, Yan Chen, Zhan Ma

    Abstract: In recent years, compressed domain semantic inference has primarily relied on learned image coding models optimized for mean squared error (MSE). However, MSE-oriented optimization tends to yield latent spaces with limited semantic richness, which hinders effective semantic inference in downstream tasks. Moreover, achieving high performance with these models often requires fine-tuning the entire v… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: International Conference on Multimedia and Expo (ICME), 2025

  7. arXiv:2507.01360  [pdf, ps, other

    cs.NI eess.SP

    MmBack: Clock-free Multi-Sensor Backscatter with Synchronous Acquisition and Multiplexing

    Authors: Yijie Li, Weichong Ling, Taiting Lu, Yi-Chao Chen, Vaishnavi Ranganathan, Lili Qiu, Jingxian Wang

    Abstract: Backscatter tags provide a low-power solution for sensor applications, yet many real-world scenarios require multiple sensors-often of different types-for complex sensing tasks. However, existing designs support only a single sensor per tag, increasing spatial overhead. State-of-the-art approaches to multiplexing multiple sensor streams on a single tag rely on onboard clocks or multiple modulation… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: 16 pages, 14 figures

  8. arXiv:2507.00209  [pdf, ps, other

    eess.IV cs.AI cs.CV cs.RO

    SurgiSR4K: A High-Resolution Endoscopic Video Dataset for Robotic-Assisted Minimally Invasive Procedures

    Authors: Fengyi Jiang, Xiaorui Zhang, Lingbo Jin, Ruixing Liang, Yuxin Chen, Adi Chola Venkatesh, Jason Culman, Tiantian Wu, Lirong Shao, Wenqing Sun, Cong Gao, Hallie McNamara, Jingpei Lu, Omid Mohareri

    Abstract: High-resolution imaging is crucial for enhancing visual clarity and enabling precise computer-assisted guidance in minimally invasive surgery (MIS). Despite the increasing adoption of 4K endoscopic systems, there remains a significant gap in publicly available native 4K datasets tailored specifically for robotic-assisted MIS. We introduce SurgiSR4K, the first publicly accessible surgical imaging a… ▽ More

    Submitted 7 July, 2025; v1 submitted 30 June, 2025; originally announced July 2025.

  9. arXiv:2506.24014  [pdf

    eess.IV

    Simultaneous Super-Resolution of Spatial and Spectral Imaging with a Camera Array and Notch Filters

    Authors: Peng Lin, Xuesong Wang, Yating Chen, Xianyu Wu, Feng Huang, Shouqian Chen

    Abstract: This study proposes an algorithm based on a notch filter camera array system for simultaneous super-resolution imaging and spectral reconstruction, enhancing the spatial resolution and multispectral imaging capabilities of targets. In this study, multi-aperture super-resolution algorithms, pan-sharpening techniques, and spectral reconstruction algorithms were investigated and integrated. The sub-p… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  10. arXiv:2506.23484  [pdf, ps, other

    cs.MM cs.CV eess.IV

    TAG-WM: Tamper-Aware Generative Image Watermarking via Diffusion Inversion Sensitivity

    Authors: Yuzhuo Chen, Zehua Ma, Han Fang, Weiming Zhang, Nenghai Yu

    Abstract: AI-generated content (AIGC) enables efficient visual creation but raises copyright and authenticity risks. As a common technique for integrity verification and source tracing, digital image watermarking is regarded as a potential solution to above issues. Among these, watermarking methods capable of preserving the generation quality are receiving increased attention. However, the proliferation and… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: Accepted by ICCV 2025 (2025 IEEE/CVF International Conference on Computer Vision)

    ACM Class: I.3.3; I.4.9

  11. arXiv:2506.23472  [pdf, ps, other

    eess.SP

    Automatic Phase Calibration for High-resolution mmWave Sensing via Ambient Radio Anchors

    Authors: Ruixu Geng, Yadong Li, Dongheng Zhang, Pengcheng Huang, Binquan Wang, Binbin Zhang, Zhi Lu, Yang Hu, Yan Chen

    Abstract: Millimeter-wave (mmWave) radar systems with large array have pushed radar sensing into a new era, thanks to their high angular resolution. However, our long-term experiments indicate that array elements exhibit phase drift over time and require periodic phase calibration to maintain high-resolution, creating an obstacle for practical high-resolution mmWave sensing. Unfortunately, existing calibrat… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: 13 pages, 21 figures

  12. arXiv:2506.22972  [pdf, ps, other

    eess.AS

    Adaptable Non-parametric Approach for Speech-based Symptom Assessment: Isolating Private Medical Data in a Retrieval Datastore

    Authors: Yu-Wen Chen, Julia Hirschberg

    Abstract: The automatic assessment of health-related acoustic cues has the potential to improve healthcare accessibility and affordability. Although parametric models are promising, they face challenges in privacy and adaptability. To address these, we propose a NoN-Parametric framework for Speech-based symptom Assessment (NoNPSA). By isolating medical data in a retrieval datastore, NoNPSA avoids encoding p… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

    Comments: IEEE MLSP 2025

  13. arXiv:2506.22882  [pdf, ps, other

    eess.IV cs.CV cs.LG

    CA-Diff: Collaborative Anatomy Diffusion for Brain Tissue Segmentation

    Authors: Qilong Xing, Zikai Song, Yuteng Ye, Yuke Chen, Youjia Zhang, Na Feng, Junqing Yu, Wei Yang

    Abstract: Segmentation of brain structures from MRI is crucial for evaluating brain morphology, yet existing CNN and transformer-based methods struggle to delineate complex structures accurately. While current diffusion models have shown promise in image segmentation, they are inadequate when applied directly to brain MRI due to neglecting anatomical information. To address this, we propose Collaborative An… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

    Comments: ICME 2025

  14. arXiv:2506.22790  [pdf, ps, other

    eess.IV cs.CV cs.MM

    ICME 2025 Generalizable HDR and SDR Video Quality Measurement Grand Challenge

    Authors: Yixu Chen, Bowen Chen, Hai Wei, Alan C. Bovik, Baojun Li, Wei Sun, Linhan Cao, Kang Fu, Dandan Zhu, Jun Jia, Menghan Hu, Xiongkuo Min, Guangtao Zhai, Dounia Hammou, Fei Yin, Rafal Mantiuk, Amritha Premkumar, Prajit T Rajendran, Vignesh V Menon

    Abstract: This paper reports IEEE International Conference on Multimedia \& Expo (ICME) 2025 Grand Challenge on Generalizable HDR and SDR Video Quality Measurement. With the rapid development of video technology, especially High Dynamic Range (HDR) and Standard Dynamic Range (SDR) contents, the need for robust and generalizable Video Quality Assessment (VQA) methods has become increasingly demanded. Existin… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

    Comments: ICME 2025 Grand Challenges

  15. arXiv:2506.22467  [pdf

    eess.SP cs.CV

    SegmentAnyMuscle: A universal muscle segmentation model across different locations in MRI

    Authors: Roy Colglazier, Jisoo Lee, Haoyu Dong, Hanxue Gu, Yaqian Chen, Joseph Cao, Zafer Yildiz, Zhonghao Liu, Nicholas Konz, Jichen Yang, Jikai Zhang, Yuwen Chen, Lin Li, Adrian Camarena, Maciej A. Mazurowski

    Abstract: The quantity and quality of muscles are increasingly recognized as important predictors of health outcomes. While MRI offers a valuable modality for such assessments, obtaining precise quantitative measurements of musculature remains challenging. This study aimed to develop a publicly available model for muscle segmentation in MRIs and demonstrate its applicability across various anatomical locati… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

    Comments: 24 pages, 6 figures

  16. arXiv:2506.19328  [pdf, ps, other

    eess.SY

    Peer-to-Peer Energy Markets With Uniform Pricing: A Dynamic Operating Envelope Approach

    Authors: Zeinab Salehi, Yijun Chen, Ian R. Petersen, Guodong Shi, Duncan S. Callaway, Elizabeth L. Ratnam

    Abstract: The recent widespread adoption of rooftop solar backed by battery storage is enabling energy customers to both produce and consume electricity (i.e., prosumers of electricity). To facilitate prosumer participation in the electric grid, new market mechanisms are required. In this paper, we design peer-to-peer energy markets where prosumers trade their excess energy with peers to gain profit while s… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  17. arXiv:2506.18324  [pdf, ps, other

    eess.SP

    ARSAR-Net: Intelligent SAR Imaging with Adaptive Regularization

    Authors: Shiping Fu, Yufan Chen, Zhe Zhang, Xiaolan Qiu, Qixiang Ye

    Abstract: Deep unfolding networks have recently emerged as a promising approach for synthetic aperture radar (SAR) imaging. However, baseline unfolding networks, typically derived from iterative reconstruction algorithms such as the alternating direction method of multipliers (ADMM), lack generalization capability across scenes, primarily because their regularizers are empirically designed rather than learn… ▽ More

    Submitted 26 June, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

  18. arXiv:2506.18094  [pdf

    eess.SY

    G-SEED: A Spatio-temporal Encoding Framework for Forest and Grassland Data Based on GeoSOT

    Authors: Xuan Ouyang, Xinwen Yu, Yan Chen, Guang Deng, Xuanxin Liu

    Abstract: In recent years, the rapid development of remote sensing, Unmanned Aerial Vehicles, and IoT technologies has led to an explosive growth in spatio-temporal forest and grassland data, which are increasingly multimodal, heterogeneous, and subject to continuous updates. However, existing Geographic Information Systems (GIS)-based systems struggle to integrate and manage of such large-scale and diverse… ▽ More

    Submitted 22 June, 2025; originally announced June 2025.

    Comments: 11 pages, 2 figures. Previously submitted to a non-academic conference (ICGARSA 2025) and formally withdrawn

  19. arXiv:2506.16751  [pdf, ps, other

    eess.AS

    H-QuEST: Accelerating Query-by-Example Spoken Term Detection with Hierarchical Indexing

    Authors: Akanksha Singh, Yi-Ping Phoebe Chen, Vipul Arora

    Abstract: Query-by-example spoken term detection (QbE-STD) searches for matching words or phrases in an audio dataset using a sample spoken query. When annotated data is limited or unavailable, QbE-STD is often done using template matching methods like dynamic time warping (DTW), which are computationally expensive and do not scale well. To address this, we propose H-QuEST (Hierarchical Query-by-Example Spo… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Journal ref: Interspeech 2025

  20. arXiv:2506.15882  [pdf, ps, other

    cs.LG cs.AI cs.CL eess.SP

    Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute

    Authors: Sheng Liu, Tianlang Chen, Pan Lu, Haotian Ye, Yizheng Chen, Lei Xing, James Zou

    Abstract: Test-time compute has emerged as a powerful paradigm for improving the performance of large language models (LLMs), where generating multiple outputs or refining individual chains can significantly boost answer accuracy. However, existing methods like Best-of-N, majority voting, and self-reflection typically apply reasoning in a uniform way across inputs, overlooking the fact that different proble… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

    Comments: 18 pages, 5 figures, Project website: https://shengliu66.github.io/fractreason/

  21. arXiv:2506.15748  [pdf, ps, other

    eess.IV cs.CV

    Diffusion-based Counterfactual Augmentation: Towards Robust and Interpretable Knee Osteoarthritis Grading

    Authors: Zhe Wang, Yuhua Ru, Aladine Chetouani, Tina Shiang, Fang Chen, Fabian Bauer, Liping Zhang, Didier Hans, Rachid Jennane, William Ewing Palmer, Mohamed Jarraya, Yung Hsin Chen

    Abstract: Automated grading of Knee Osteoarthritis (KOA) from radiographs is challenged by significant inter-observer variability and the limited robustness of deep learning models, particularly near critical decision boundaries. To address these limitations, this paper proposes a novel framework, Diffusion-based Counterfactual Augmentation (DCA), which enhances model robustness and interpretability by gene… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  22. arXiv:2506.14201  [pdf, ps, other

    cs.RO eess.SY

    Pose State Perception of Interventional Robot for Cardio-cerebrovascular Procedures

    Authors: Shunhan Ji, Yanxi Chen, Zhongyu Yang, Quan Zhang, Xiaohang Nie, Jingqian Sun, Yichao Tang

    Abstract: In response to the increasing demand for cardiocerebrovascular interventional surgeries, precise control of interventional robots has become increasingly important. Within these complex vascular scenarios, the accurate and reliable perception of the pose state for interventional robots is particularly crucial. This paper presents a novel vision-based approach without the need of additional sensors… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

  23. arXiv:2506.13415  [pdf, other

    eess.IV cs.AI cs.CV

    Simple is what you need for efficient and accurate medical image segmentation

    Authors: Xiang Yu, Yayan Chen, Guannan He, Qing Zeng, Yue Qin, Meiling Liang, Dandan Luo, Yimei Liao, Zeyu Ren, Cheng Kang, Delong Yang, Bocheng Liang, Bin Pu, Ying Yuan, Shengli Li

    Abstract: While modern segmentation models often prioritize performance over practicality, we advocate a design philosophy prioritizing simplicity and efficiency, and attempted high performance segmentation model design. This paper presents SimpleUNet, a scalable ultra-lightweight medical image segmentation model with three key innovations: (1) A partial feature selection mechanism in skip connections for r… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: 15 pages, 11 figures

    ACM Class: I.4.6

  24. arXiv:2506.13317  [pdf, ps, other

    cs.IT eess.SP

    A Contemporary Survey on Fluid Antenna Systems: Fundamentals and Networking Perspectives

    Authors: Hanjiang Hong, Kai-Kit Wong, Hao Xu, Xinghao Guo, Farshad Rostami Ghadi, Yu Chen, Yin Xu, Chan-Byoung Chae, Baiyang Liu, Kin-Fai Tong, Yangyang Zhang

    Abstract: The explosive growth of teletraffic, fueled by the convergence of cyber-physical systems and data-intensive applications, such as the Internet of Things (IoT), autonomous systems, and immersive communications, demands a multidisciplinary suite of innovative solutions across the physical and network layers. Fluid antenna systems (FAS) represent a transformative advancement in antenna design, offeri… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  25. arXiv:2506.12186  [pdf, ps, other

    eess.IV cs.AI cs.CV cs.LG

    MRI-CORE: A Foundation Model for Magnetic Resonance Imaging

    Authors: Haoyu Dong, Yuwen Chen, Hanxue Gu, Nicholas Konz, Yaqian Chen, Qihang Li, Maciej A. Mazurowski

    Abstract: The widespread use of Magnetic Resonance Imaging (MRI) and the rise of deep learning have enabled the development of powerful predictive models for a wide range of diagnostic tasks in MRI, such as image classification or object segmentation. However, training models for specific new tasks often requires large amounts of labeled data, which is difficult to obtain due to high annotation costs and da… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: 19 pages, 5 figures

  26. arXiv:2506.10325  [pdf, ps, other

    eess.IV cs.CV cs.LG

    SWDL: Stratum-Wise Difference Learning with Deep Laplacian Pyramid for Semi-Supervised 3D Intracranial Hemorrhage Segmentation

    Authors: Cheng Wang, Siqi Chen, Donghua Mi, Yang Chen, Yudong Zhang, Yinsheng Li

    Abstract: Recent advances in medical imaging have established deep learning-based segmentation as the predominant approach, though it typically requires large amounts of manually annotated data. However, obtaining annotations for intracranial hemorrhage (ICH) remains particularly challenging due to the tedious and costly labeling process. Semi-supervised learning (SSL) has emerged as a promising solution to… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: 11 pages, 4 figures, 6 Tables

    MSC Class: 92C55; 68T45 ACM Class: I.4.6; I.4.9; I.2.6; J.3

  27. arXiv:2506.09999  [pdf, other

    cs.LG cs.MM cs.SD eess.AS

    Leveraging Pre-Trained Models for Multimodal Class-Incremental Learning under Adaptive Fusion

    Authors: Yukun Chen, Zihuan Qiu, Fanman Meng, Hongliang Li, Linfeng Xu, Qingbo Wu

    Abstract: Unlike traditional Multimodal Class-Incremental Learning (MCIL) methods that focus only on vision and text, this paper explores MCIL across vision, audio and text modalities, addressing challenges in integrating complementary information and mitigating catastrophic forgetting. To tackle these issues, we propose an MCIL method based on multimodal pre-trained models. Firstly, a Multimodal Incrementa… ▽ More

    Submitted 6 February, 2025; originally announced June 2025.

  28. arXiv:2506.09650  [pdf, ps, other

    cs.CV cs.LG cs.MM cs.RO eess.IV

    HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

    Authors: Kunyu Peng, Junchao Huang, Xiangsheng Huang, Di Wen, Junwei Zheng, Yufan Chen, Kailun Yang, Jiamin Wu, Chongqing Hao, Rainer Stiefelhagen

    Abstract: Action segmentation is a core challenge in high-level video understanding, aiming to partition untrimmed videos into segments and assign each a label from a predefined action set. Existing methods primarily address single-person activities with fixed action sequences, overlooking multi-person scenarios. In this work, we pioneer textual reference-guided human action segmentation in multi-person set… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: The code is available at https://github.com/KPeng9510/HopaDIFF.git

  29. arXiv:2506.09344  [pdf, ps, other

    cs.AI cs.CL cs.CV cs.LG cs.SD eess.AS

    Ming-Omni: A Unified Multimodal Model for Perception and Generation

    Authors: Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jun Peng, Kaixiang Ji, Kaiyou Song, Kaimeng Ren, Libin Wang, Lixiang Ru, Lele Xie, Longhua Tan , et al. (33 additional authors not shown)

    Abstract: We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

    Comments: 18 pages,8 figures

  30. arXiv:2506.07876  [pdf, ps, other

    cs.RO eess.SY

    Versatile Loco-Manipulation through Flexible Interlimb Coordination

    Authors: Xinghao Zhu, Yuxin Chen, Lingfeng Sun, Farzad Niroui, Simon Le Cleac'h, Jiuguang Wang, Kuan Fang

    Abstract: The ability to flexibly leverage limbs for loco-manipulation is essential for enabling autonomous robots to operate in unstructured environments. Yet, prior work on loco-manipulation is often constrained to specific tasks or predetermined limb configurations. In this work, we present Reinforcement Learning for Interlimb Coordination (ReLIC), an approach that enables versatile loco-manipulation thr… ▽ More

    Submitted 10 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

  31. arXiv:2506.05919  [pdf, ps, other

    eess.SY cs.IT

    RSMA-Enabled Covert Communications Against Multiple Spatially Random Wardens

    Authors: Xinyue Pei, Jihao Liu, Xuewen Luo, Xingwei Wang, Yingyang Chen, Miaowen Wen, Theodoros A. Tsiftsis

    Abstract: This work investigates covert communication in a rate-splitting multiple access (RSMA)-based multi-user multiple-input single-output system, where the random locations of the wardens follow a homogeneous Poisson point process. To demonstrate practical deployment scenarios, imperfect channel state information at the transmitter is considered. Closed-form expressions for the statistics of the receiv… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

  32. arXiv:2506.03511  [pdf, ps, other

    astro-ph.EP astro-ph.IM cs.AI eess.IV

    POLARIS: A High-contrast Polarimetric Imaging Benchmark Dataset for Exoplanetary Disk Representation Learning

    Authors: Fangyi Cao, Bin Ren, Zihao Wang, Shiwei Fu, Youbin Mo, Xiaoyang Liu, Yuzhou Chen, Weixin Yao

    Abstract: With over 1,000,000 images from more than 10,000 exposures using state-of-the-art high-contrast imagers (e.g., Gemini Planet Imager, VLT/SPHERE) in the search for exoplanets, can artificial intelligence (AI) serve as a transformative tool in imaging Earth-like exoplanets in the coming decade? In this paper, we introduce a benchmark and explore this question from a polarimetric image representation… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: 9 pages main text with 5 figures, 9 pages appendix with 9 figures. Submitted to NeurIPS 2025

  33. arXiv:2506.03502  [pdf

    cs.CV eess.SY

    CHIME: Conditional Hallucination and Integrated Multi-scale Enhancement for Time Series Diffusion Model

    Authors: Yuxuan Chen, Haipeng Xie

    Abstract: The denoising diffusion probabilistic model has become a mainstream generative model, achieving significant success in various computer vision tasks. Recently, there has been initial exploration of applying diffusion models to time series tasks. However, existing studies still face challenges in multi-scale feature alignment and generative capabilities across different entities and long-time scale… ▽ More

    Submitted 4 July, 2025; v1 submitted 3 June, 2025; originally announced June 2025.

  34. arXiv:2506.02197  [pdf, ps, other

    eess.IV cs.CV

    NTIRE 2025 Challenge on RAW Image Restoration and Super-Resolution

    Authors: Marcos V. Conde, Radu Timofte, Zihao Lu, Xiangyu Kong, Xiaoxia Xing, Fan Wang, Suejin Han, MinKyu Park, Tianyu Zhang, Xin Luo, Yeda Chen, Dong Liu, Li Pang, Yuhang Yang, Hongzhong Wang, Xiangyong Cao, Ruixuan Jiang, Senyan Xu, Siyuan Jiang, Xueyang Fu, Zheng-Jun Zha, Tianyu Hao, Yuhong He, Ruoqi Li, Yueqi Yang , et al. (14 additional authors not shown)

    Abstract: This paper reviews the NTIRE 2025 RAW Image Restoration and Super-Resolution Challenge, highlighting the proposed solutions and results. New methods for RAW Restoration and Super-Resolution could be essential in modern Image Signal Processing (ISP) pipelines, however, this problem is not as explored as in the RGB domain. The goal of this challenge is two fold, (i) restore RAW images with blur and… ▽ More

    Submitted 4 June, 2025; v1 submitted 2 June, 2025; originally announced June 2025.

    Comments: CVPR 2025 - New Trends in Image Restoration and Enhancement (NTIRE)

  35. arXiv:2506.00466  [pdf, ps, other

    eess.AS cs.SD

    M3ANet: Multi-scale and Multi-Modal Alignment Network for Brain-Assisted Target Speaker Extraction

    Authors: Cunhang Fan, Ying Chen, Jian Zhou, Zexu Pan, Jingjing Zhang, Youdian Gao, Xiaoke Yang, Zhengqi Wen, Zhao Lv

    Abstract: The brain-assisted target speaker extraction (TSE) aims to extract the attended speech from mixed speech by utilizing the brain neural activities, for example Electroencephalography (EEG). However, existing models overlook the issue of temporal misalignment between speech and EEG modalities, which hampers TSE performance. In addition, the speech encoder in current models typically uses basic tempo… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

    Comments: Accepted to IJCAI 2025

  36. arXiv:2506.00211  [pdf, ps, other

    eess.SP

    The Impact of Uniform Circular Array on Near-field ISAC

    Authors: Na Xue, Xidong Mu, Yue Chen, Yuanwei Liu

    Abstract: A novel uniform circular array (UCA) based near-field (NF) integrated sensing and communication (ISAC) framework is proposed, where the Cylindrical coordinate is invoked to evaluate the joint positioning performance. The joint squared position error bound (SPEB) of the sensing target (ST) is derived for the coplanar and non-coplanar cases. For the coplanar case, where the ST is located in the copl… ▽ More

    Submitted 30 May, 2025; originally announced June 2025.

    Comments: Submitted to IEEE Trans, 13 pages, 9 figures

  37. arXiv:2506.00192  [pdf, other

    eess.SP

    STARS-assisted Near-field ISAC: Sensor Deployment and Beamforming Design

    Authors: Na Xue, Xidong Mu, Yue Chen, Yuanwei Liu

    Abstract: A simultaneously transmitting and reflecting surface (STARS) assisted near-field (NF) integrated sensing and communication (ISAC) framework is proposed, where the radio sensors are installed on the STARS to directly conduct the distance-domain sensing by exploiting the characteristic spherical wavefront. A new squared position error bound (SPEB) expression is derived to reveal the dependence on be… ▽ More

    Submitted 30 May, 2025; originally announced June 2025.

    Comments: Submitted to IEEE Trans, 15 pages, 7 figures,

  38. arXiv:2505.24496  [pdf, other

    eess.AS

    Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation

    Authors: Wenrui Liu, Qian Chen, Wen Wang, Yafeng Chen, Jin Xu, Zhifang Guo, Guanrou Yang, Weiqin Li, Xiaoda Yang, Tao Jin, Minghui Fang, Jialong Zuo, Bai Jionghao, Zemin Liu

    Abstract: Neural audio codecs, used as speech tokenizers, have demonstrated remarkable potential in the field of speech generation. However, to ensure high-fidelity audio reconstruction, neural audio codecs typically encode audio into long sequences of speech tokens, posing a significant challenge for downstream language models in long-context modeling. We observe that speech token sequences exhibit short-r… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  39. arXiv:2505.24224  [pdf, ps, other

    eess.AS

    MOPSA: Mixture of Prompt-Experts Based Speaker Adaptation for Elderly Speech Recognition

    Authors: Chengxi Deng, Xurong Xie, Shujie Hu, Mengzhe Geng, Yicong Jiang, Jiankun Zhao, Jiajun Deng, Guinan Li, Youjun Chen, Huimeng Wang, Haoning Xu, Mingyu Cui, Xunying Liu

    Abstract: This paper proposes a novel Mixture of Prompt-Experts based Speaker Adaptation approach (MOPSA) for elderly speech recognition. It allows zero-shot, real-time adaptation to unseen speakers, and leverages domain knowledge tailored to elderly speakers. Top-K most distinctive speaker prompt clusters derived using K-means serve as experts. A router network is trained to dynamically combine clustered p… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

    Comments: Accepted by Interspeech 2025

  40. arXiv:2505.23236  [pdf, ps, other

    cs.SD cs.HC eess.AS

    Towards LLM-Empowered Fine-Grained Speech Descriptors for Explainable Emotion Recognition

    Authors: Youjun Chen, Xurong Xie, Haoning Xu, Mengzhe Geng, Guinan Li, Chengxi Deng, Huimeng Wang, Shujie Hu, Xunying Liu

    Abstract: This paper presents a novel end-to-end LLM-empowered explainable speech emotion recognition (SER) approach. Fine-grained speech emotion descriptor (SED) features, e.g., pitch, tone and emphasis, are disentangled from HuBERT SSL representations via alternating LLM fine-tuning to joint SER-SED prediction and ASR tasks. VAE compressed HuBERT features derived via Information Bottleneck (IB) are used t… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: Accepted by INTERSPEECH2025

  41. arXiv:2505.22685  [pdf, ps, other

    eess.IV cs.AI cs.CV

    DeepMultiConnectome: Deep Multi-Task Prediction of Structural Connectomes Directly from Diffusion MRI Tractography

    Authors: Marcus J. Vroemen, Yuqian Chen, Yui Lo, Tengfei Xue, Weidong Cai, Fan Zhang, Josien P. W. Pluim, Lauren J. O'Donnell

    Abstract: Diffusion MRI (dMRI) tractography enables in vivo mapping of brain structural connections, but traditional connectome generation is time-consuming and requires gray matter parcellation, posing challenges for large-scale studies. We introduce DeepMultiConnectome, a deep-learning model that predicts structural connectomes directly from tractography, bypassing the need for gray matter parcellation wh… ▽ More

    Submitted 11 June, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

    Comments: 15 pages, 5 figures

  42. arXiv:2505.22608  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Effective and Efficient One-pass Compression of Speech Foundation Models Using Sparsity-aware Self-pinching Gates

    Authors: Haoning Xu, Zhaoqing Li, Youjun Chen, Huimeng Wang, Guinan Li, Mengzhe Geng, Chengxi Deng, Xunying Liu

    Abstract: This paper presents a novel approach for speech foundation models compression that tightly integrates model pruning and parameter update into a single stage. Highly compact layer-level tied self-pinching gates each containing only a single learnable threshold are jointly trained with uncompressed models and used in fine-grained neuron level pruning. Experiments conducted on the LibriSpeech-100hr c… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: Submitted to Interspeech 2025

  43. arXiv:2505.22399  [pdf, other

    math.OC eess.SY

    Learning to Pursue AC Optimal Power Flow Solutions with Feasibility Guarantees

    Authors: Damola Ajeyemi, Yiting Chen, Antonin Colot, Jorge Cortes, Emiliano Dall'Anese

    Abstract: This paper focuses on an AC optimal power flow (OPF) problem for distribution feeders equipped with controllable distributed energy resources (DERs). We consider a solution method that is based on a continuous approximation of the projected gradient flow - referred to as the safe gradient flow - that incorporates voltage and current information obtained either through real-time measurements or pow… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  44. arXiv:2505.22106  [pdf, ps, other

    cs.SD cs.AI eess.AS

    AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion

    Authors: Junqi Zhao, Jinzheng Zhao, Haohe Liu, Yun Chen, Lu Han, Xubo Liu, Mark Plumbley, Wenwu Wang

    Abstract: Diffusion models have significantly improved the quality and diversity of audio generation but are hindered by slow inference speed. Rectified flow enhances inference speed by learning straight-line ordinary differential equation (ODE) paths. However, this approach requires training a flow-matching model from scratch and tends to perform suboptimally, or even poorly, at low step counts. To address… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  45. arXiv:2505.22045  [pdf, ps, other

    cs.MM cs.CV cs.SD eess.AS

    Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning

    Authors: Le Xu, Chenxing Li, Yong Ren, Yujie Chen, Yu Gu, Ruibo Fu, Shan Yang, Dong Yu

    Abstract: Current vision-guided audio captioning systems frequently fail to address audiovisual misalignment in real-world scenarios, such as dubbed content or off-screen sounds. To bridge this critical gap, we present an entropy-aware gated fusion framework that dynamically modulates visual information flow through cross-modal uncertainty quantification. Our novel approach employs attention entropy analysi… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: Accepted by INTERSPEECH 2025

  46. arXiv:2505.21245  [pdf, ps, other

    cs.SD eess.AS

    Towards One-bit ASR: Extremely Low-bit Conformer Quantization Using Co-training and Stochastic Precision

    Authors: Zhaoqing Li, Haoning Xu, Zengrui Jin, Lingwei Meng, Tianzi Wang, Huimeng Wang, Youjun Chen, Mingyu Cui, Shujie Hu, Xunying Liu

    Abstract: Model compression has become an emerging need as the sizes of modern speech systems rapidly increase. In this paper, we study model weight quantization, which directly reduces the memory footprint to accommodate computationally resource-constrained applications. We propose novel approaches to perform extremely low-bit (i.e., 2-bit and 1-bit) quantization of Conformer automatic speech recognition s… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: Accepted by Interspeech2025

  47. arXiv:2505.19931  [pdf, ps, other

    eess.AS cs.SD

    Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling

    Authors: Qixi Zheng, Yushen Chen, Zhikang Niu, Ziyang Ma, Xiaofei Wang, Kai Yu, Xie Chen

    Abstract: Flow-matching-based text-to-speech (TTS) models, such as Voicebox, E2 TTS, and F5-TTS, have attracted significant attention in recent years. These models require multiple sampling steps to reconstruct speech from noise, making inference speed a key challenge. Reducing the number of sampling steps can greatly improve inference efficiency. To this end, we introduce Fast F5-TTS, a training-free appro… ▽ More

    Submitted 4 June, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

  48. arXiv:2505.19486  [pdf, ps, other

    eess.SY cs.LG cs.MA

    VLMLight: Traffic Signal Control via Vision-Language Meta-Control and Dual-Branch Reasoning

    Authors: Maonan Wang, Yirong Chen, Aoyu Pang, Yuxin Cai, Chung Shue Chen, Yuheng Kan, Man-On Pun

    Abstract: Traffic signal control (TSC) is a core challenge in urban mobility, where real-time decisions must balance efficiency and safety. Existing methods - ranging from rule-based heuristics to reinforcement learning (RL) - often struggle to generalize to complex, dynamic, and safety-critical scenarios. We introduce VLMLight, a novel TSC framework that integrates vision-language meta-control with dual-br… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: 25 pages, 15 figures

  49. arXiv:2505.19446  [pdf, other

    eess.AS

    Leveraging Cascaded Binary Classification and Multimodal Fusion for Dementia Detection through Spontaneous Speech

    Authors: Yin-Long Liu, Yuanchao Li, Rui Feng, Liu He, Jia-Xin Chen, Yi-Ming Wang, Yu-Ang Chen, Yan-Han Peng, Jia-Hong Yuan, Zhen-Hua Ling

    Abstract: This paper presents our submission to the PROCESS Challenge 2025, focusing on spontaneous speech analysis for early dementia detection. For the three-class classification task (Healthy Control, Mild Cognitive Impairment, and Dementia), we propose a cascaded binary classification framework that fine-tunes pre-trained language models and incorporates pause encoding to better capture disfluencies. Th… ▽ More

    Submitted 26 May, 2025; v1 submitted 25 May, 2025; originally announced May 2025.

    Comments: Accepted by Interspeech 2025

  50. arXiv:2505.19294  [pdf, other

    cs.SD cs.CL cs.HC cs.MM eess.AS

    Towards Reliable Large Audio Language Model

    Authors: Ziyang Ma, Xiquan Li, Yakun Song, Wenxi Chen, Chenpeng Du, Jian Wu, Yuanzhe Chen, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen

    Abstract: Recent advancements in large audio language models (LALMs) have demonstrated impressive results and promising prospects in universal understanding and reasoning across speech, music, and general sound. However, these models still lack the ability to recognize their knowledge boundaries and refuse to answer questions they don't know proactively. While there have been successful attempts to enhance… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

    Comments: ACL 2025 Findings