Skip to main content

Showing 1–50 of 844 results for author: Zhang, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.07803  [pdf, ps, other

    cs.CL cs.SD eess.AS

    StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model

    Authors: Shoutao Guo, Xiang Li, Shaolei Zhang, Mengge Liu, Wei Chen, Yang Feng

    Abstract: Streaming speech translation (StreamST) requires determining appropriate timing, known as policy, to generate translations while continuously receiving source speech inputs, balancing low latency with high translation quality. However, existing StreamST methods typically operate on sentence-level speech segments, referred to as simultaneous speech translation (SimulST). In practice, they require c… ▽ More

    Submitted 10 July, 2025; originally announced July 2025.

    Comments: The code is at https://github.com/ictnlp/StreamUni; The model is at https://huggingface.co/ICTNLP/StreamUni-Phi4

  2. arXiv:2507.07526  [pdf, ps, other

    cs.SD eess.AS

    DMF2Mel: A Dynamic Multiscale Fusion Network for EEG-Driven Mel Spectrogram Reconstruction

    Authors: Cunhang Fan, Sheng Zhang, Jingjing Zhang, Enrui Liu, Xinhui Li, Minggang Zhao, Zhao Lv

    Abstract: Decoding speech from brain signals is a challenging research problem. Although existing technologies have made progress in reconstructing the mel spectrograms of auditory stimuli at the word or letter level, there remain core challenges in the precise reconstruction of minute-level continuous imagined speech: traditional models struggle to balance the efficiency of temporal dependency modeling and… ▽ More

    Submitted 10 July, 2025; originally announced July 2025.

    Comments: Accepted by ACM MM 2025

  3. arXiv:2507.07396  [pdf, ps, other

    cs.MM cs.LG cs.SD eess.AS

    IML-Spikeformer: Input-aware Multi-Level Spiking Transformer for Speech Processing

    Authors: Zeyang Song, Shimin Zhang, Yuhong Chou, Jibin Wu, Haizhou Li

    Abstract: Spiking Neural Networks (SNNs), inspired by biological neural mechanisms, represent a promising neuromorphic computing paradigm that offers energy-efficient alternatives to traditional Artificial Neural Networks (ANNs). Despite proven effectiveness, SNN architectures have struggled to achieve competitive performance on large-scale speech processing task. Two key challenges hinder progress: (1) the… ▽ More

    Submitted 9 July, 2025; originally announced July 2025.

    Comments: Under review of TNNLS

  4. arXiv:2507.07270  [pdf, ps, other

    cs.SD cs.MM eess.AS

    Audio-Visual Speech Separation via Bottleneck Iterative Network

    Authors: Sidong Zhang, Shiv Shankar, Trang Nguyen, Andrea Fanelli, Madalina Fiterau

    Abstract: Integration of information from non-auditory cues can significantly improve the performance of speech-separation models. Often such models use deep modality-specific networks to obtain unimodal features, and risk being too costly or lightweight but lacking capacity. In this work, we present an iterative representation refinement approach called Bottleneck Iterative Network (BIN), a technique that… ▽ More

    Submitted 9 July, 2025; originally announced July 2025.

    Comments: Accepted to the 42nd International Conference on Machine Learning Workshop on Machine Learning for Audio

  5. arXiv:2507.05911  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Differentiable Reward Optimization for LLM based TTS system

    Authors: Changfeng Gao, Zhihao Du, Shiliang Zhang

    Abstract: This paper proposes a novel Differentiable Reward Optimization (DiffRO) method aimed at enhancing the performance of neural codec language models based text-to-speech (TTS) systems. In contrast to conventional reinforcement learning from human feedback (RLHF) approaches applied to TTS, DiffRO directly compute the rewards based on neural codec tokens, rather than relying on synthesized audio. Furth… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

  6. arXiv:2507.00902  [pdf, ps, other

    eess.SY cs.AI eess.SP

    Constellation as a Service: Tailored Connectivity Management in Direct-Satellite-to-Device Networks

    Authors: Feng Wang, Shengyu Zhang, Een-Kee Hong, Tony Q. S. Quek

    Abstract: Direct-satellite-to-device (DS2D) communication is emerging as a promising solution for global mobile service extension, leveraging the deployment of satellite constellations. However, the challenge of managing DS2D connectivity for multi-constellations becomes outstanding, including high interference and frequent handovers caused by multi-coverage overlap and rapid satellite movement. Moreover, e… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: To appear in IEEE Communications Magazine

  7. arXiv:2506.19384  [pdf, ps, other

    cs.LG eess.SP physics.comp-ph

    Deep Electromagnetic Structure Design Under Limited Evaluation Budgets

    Authors: Shijian Zheng, Fangxiao Jin, Shuhai Zhang, Quan Xue, Mingkui Tan

    Abstract: Electromagnetic structure (EMS) design plays a critical role in developing advanced antennas and materials, but remains challenging due to high-dimensional design spaces and expensive evaluations. While existing methods commonly employ high-quality predictors or generators to alleviate evaluations, they are often data-intensive and struggle with real-world scale and budget constraints. To address… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: ICML 2025 (accepted)

  8. arXiv:2506.17983  [pdf, ps, other

    eess.IV cs.CV

    LVPNet: A Latent-variable-based Prediction-driven End-to-end Framework for Lossless Compression of Medical Images

    Authors: Chenyue Song, Chen Hui, Qing Lin, Wei Zhang, Siqiao Li, Haiqi Zhu, Zhixuan Li, Shengping Zhang, Shaohui Liu, Feng Jiang, Xiang Li

    Abstract: Autoregressive Initial Bits is a framework that integrates sub-image autoregression and latent variable modeling, demonstrating its advantages in lossless medical image compression. However, in existing methods, the image segmentation process leads to an even distribution of latent variable information across each sub-image, which in turn causes posterior collapse and inefficient utilization of la… ▽ More

    Submitted 25 June, 2025; v1 submitted 22 June, 2025; originally announced June 2025.

    Comments: Accepted to MICCAI 2025

  9. arXiv:2506.17164  [pdf, ps, other

    cs.IT eess.SP

    Codeword-Segmentation Rate-Splitting Multiple Access and Evaluation under Suboptimal Decoding

    Authors: Sibo Zhang, Bruno Clerckx, David Vargas

    Abstract: Rate-Splitting Multiple Access (RSMA) has been recognized as a promising multiple access technique. We propose a novel architecture for downlink RSMA, namely Codeword-Segmentation RSMA (CS-RSMA). Different from conventional RSMA which splits users' messages into common and private parts before encoding, CS-RSMA encodes the users' messages directly, segments the codewords into common and private pa… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: Submitted to IEEE for publication

  10. arXiv:2506.16803  [pdf, ps, other

    eess.IV cs.CV

    Temperature calibration of surface emissivities with an improved thermal image enhancement network

    Authors: Ning Chu, Siya Zheng, Shanqing Zhang, Li Li, Caifang Cai, Ali Mohammad-Djafari, Feng Zhao, Yuanbo Song

    Abstract: Infrared thermography faces persistent challenges in temperature accuracy due to material emissivity variations, where existing methods often neglect the joint optimization of radiometric calibration and image degradation. This study introduces a physically guided neural framework that unifies temperature correction and image enhancement through a symmetric skip-CNN architecture and an emissivity-… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

  11. arXiv:2506.15124  [pdf, ps, other

    eess.SY

    A Force Feedback Exoskeleton for Teleoperation Using Magnetorheological Clutches

    Authors: Zhongyuan Kong, Lei Li, Erwin Ang Tien Yew, Zirui Chen, Wenbo Li, Shiwu Zhang, Jian Yang, Shuaishuai Sun

    Abstract: This paper proposes an upper-limb exoskeleton teleoperation system based on magnetorheological (MR) clutches, aiming to improve operational accuracy and enhance the immersive experience during lunar sampling tasks. Conventional exoskeleton teleoperation systems commonly employ active force feedback solutions, such as servo motors, which typically suffer from high system complexity and increased en… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

  12. arXiv:2506.15082  [pdf, ps, other

    eess.SY

    Make Your AUV Adaptive: An Environment-Aware Reinforcement Learning Framework For Underwater Tasks

    Authors: Yimian Ding, Jingzehua Xu, Guanwen Xie, Shuai Zhang, Yi Li

    Abstract: This study presents a novel environment-aware reinforcement learning (RL) framework designed to augment the operational capabilities of autonomous underwater vehicles (AUVs) in underwater environments. Departing from traditional RL architectures, the proposed framework integrates an environment-aware network module that dynamically captures flow field data, effectively embedding this critical envi… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

    Comments: This paper has been accepted by IROS 2025

  13. arXiv:2506.13642  [pdf, ps, other

    cs.AI cs.CL cs.CV cs.SD eess.AS

    Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model

    Authors: Shaolei Zhang, Shoutao Guo, Qingkai Fang, Yan Zhou, Yang Feng

    Abstract: The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone. While sequence-dimension concatenation is straightforward for… ▽ More

    Submitted 22 June, 2025; v1 submitted 16 June, 2025; originally announced June 2025.

    Comments: Code: https://github.com/ictnlp/Stream-Omni , Model: https://huggingface.co/ICTNLP/stream-omni-8b

  14. arXiv:2506.12668  [pdf, ps, other

    cs.IT eess.SP

    SIC-Free Rate-Splitting Multiple Access: Constellation-Constrained Optimization and Application to Large-Scale Systems

    Authors: Sibo Zhang, Bruno Clerckx, David Vargas

    Abstract: Rate-Splitting Multiple Access (RSMA) has been recognized as a promising multiple access technique for future wireless communication systems. Recent research demonstrates that RSMA can maintain its superiority without relying on Successive Interference Cancellation (SIC) receivers. In practical systems, SIC-free receivers are more attractive than SIC receivers because of their low complexity and l… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

    Comments: Submitted to IEEE for publication

  15. arXiv:2506.12646  [pdf, ps, other

    cs.IT eess.SP

    Optimal and Suboptimal Decoders under Finite-Alphabet Interference: A Mismatched Decoding Perspective

    Authors: Sibo Zhang, Bruno Clerckx

    Abstract: Interference widely exists in communication systems and is often not optimally treated at the receivers due to limited knowledge and/or computational burden. Evolutions of receivers have been proposed to balance complexity and spectral efficiency, for example, for 6G, while commonly used performance metrics, such as capacity and mutual information, fail to capture the suboptimal treatment of inter… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

    Comments: Submitted to IEEE for publication

  16. arXiv:2506.12495  [pdf, ps, other

    cs.AI eess.SY

    Automated Heuristic Design for Unit Commitment Using Large Language Models

    Authors: Junjin Lv, Chenggang Cui, Shaodi Zhang, Hui Chen, Chunyang Gong, Jiaming Liu

    Abstract: The Unit Commitment (UC) problem is a classic challenge in the optimal scheduling of power systems. Years of research and practice have shown that formulating reasonable unit commitment plans can significantly improve the economic efficiency of power systems' operations. In recent years, with the introduction of technologies such as machine learning and the Lagrangian relaxation method, the soluti… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  17. arXiv:2506.10653  [pdf, ps, other

    eess.AS cs.CL cs.LG

    Robust Unsupervised Adaptation of a Speech Recogniser Using Entropy Minimisation and Speaker Codes

    Authors: Rogier C. van Dalen, Shucong Zhang, Titouan Parcollet, Sourav Bhattacharya

    Abstract: Speech recognisers usually perform optimally only in a specific environment and need to be adapted to work well in another. For adaptation to a new speaker, there is often too little data for fine-tuning to be robust, and that data is usually unlabelled. This paper proposes a combination of approaches to make adaptation to a single minute of data robust. First, instead of estimating the adaptation… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Journal ref: Interspeech 2025

  18. arXiv:2506.09377  [pdf, ps, other

    eess.IV

    An Interpretable Two-Stage Feature Decomposition Method for Deep Learning-based SAR ATR

    Authors: Chenwei Wang, Renjie Xu, Congwen Wu, Cunyi Yin, Ziyun Liao, Deqing Mao, Sitong Zhang, Hong Yan

    Abstract: Synthetic aperture radar automatic target recognition (SAR ATR) has seen significant performance improvements with deep learning. However, the black-box nature of deep SAR ATR introduces low confidence and high risks in decision-critical SAR applications, hindering practical deployment. To address this issue, deep SAR ATR should provide an interpretable reasoning basis $r_b$ and logic $λ_w$, formi… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  19. arXiv:2506.08366  [pdf, ps, other

    eess.SY

    Learning event-triggered controllers for linear parameter-varying systems from data

    Authors: Renjie Ma, Su Zhang, Wenjie Liu, Zhijian Hu, Peng Shi

    Abstract: Nonlinear dynamical behaviours in engineering applications can be approximated by linear-parameter varying (LPV) representations, but obtaining precise model knowledge to develop a control algorithm is difficult in practice. In this paper, we develop the data-driven control strategies for event-triggered LPV systems with stability verifications. First, we provide the theoretical analysis of $θ$-pe… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: 13 pages, 5 figures

  20. arXiv:2506.07869  [pdf, ps, other

    cs.IT eess.SP

    Hybrid Beamforming Optimization for MIMO ISAC Exploiting Prior Information: A PCRB-based Approach

    Authors: Yizhuo Wang, Shuowen Zhang

    Abstract: This paper considers a multiple-input multiple-output (MIMO) integrated sensing and communication (ISAC) system, where a multi-antenna base station (BS) with transceiver hybrid analog-digital arrays transmits dual-functional signals to communicate with a multi-antenna user and simultaneously sense the unknown and random location information of a target based on the reflected echo signals and the p… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: submitted for possible journal publication

  21. arXiv:2506.06012  [pdf, ps, other

    cs.RO eess.SY math.OC

    Enhanced Trust Region Sequential Convex Optimization for Multi-Drone Thermal Screening Trajectory Planning in Urban Environments

    Authors: Kaiyuan Chen, Zhengjie Hu, Shaolin Zhang, Yuanqing Xia, Wannian Liang, Shuo Wang

    Abstract: The rapid detection of abnormal body temperatures in urban populations is essential for managing public health risks, especially during outbreaks of infectious diseases. Multi-drone thermal screening systems offer promising solutions for fast, large-scale, and non-intrusive human temperature monitoring. However, trajectory planning for multiple drones in complex urban environments poses significan… ▽ More

    Submitted 19 June, 2025; v1 submitted 6 June, 2025; originally announced June 2025.

  22. arXiv:2506.03134  [pdf, ps, other

    eess.SP cs.CV

    Simulate Any Radar: Attribute-Controllable Radar Simulation via Waveform Parameter Embedding

    Authors: Weiqing Xiao, Hao Huang, Chonghao Zhong, Yujie Lin, Nan Wang, Xiaoxue Chen, Zhaoxi Chen, Saining Zhang, Shuocheng Yang, Pierre Merriaux, Lei Lei, Hao Zhao

    Abstract: We present SA-Radar (Simulate Any Radar), a radar simulation approach that enables controllable and efficient generation of radar cubes conditioned on customizable radar attributes. Unlike prior generative or physics-based simulators, SA-Radar integrates both paradigms through a waveform-parameterized attribute embedding. We design ICFAR-Net, a 3D U-Net conditioned on radar attributes encoded via… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: Code: https://github.com/zhuxing0/SA-Radar Project page: https://zhuxing0.github.io/projects/SA-Radar

  23. arXiv:2506.02604  [pdf

    cs.CV eess.IV

    Application of convolutional neural networks in image super-resolution

    Authors: Chunwei Tian, Mingjian Song, Wangmeng Zuo, Bo Du, Yanning Zhang, Shichao Zhang

    Abstract: Due to strong learning abilities of convolutional neural networks (CNNs), they have become mainstream methods for image super-resolution. However, there are big differences of different deep learning methods with different types. There is little literature to summarize relations and differences of different methods in image super-resolution. Thus, summarizing these literatures are important, accor… ▽ More

    Submitted 6 June, 2025; v1 submitted 3 June, 2025; originally announced June 2025.

    Comments: It has been accepted by CAAI transactions on intelligent systems, in Chinese language

  24. arXiv:2506.01947  [pdf, ps, other

    eess.IV cs.CV

    RAW Image Reconstruction from RGB on Smartphones. NTIRE 2025 Challenge Report

    Authors: Marcos V. Conde, Radu Timofte, Radu Berdan, Beril Besbinar, Daisuke Iso, Pengzhou Ji, Xiong Dun, Zeying Fan, Chen Wu, Zhansheng Wang, Pengbo Zhang, Jiazi Huang, Qinglin Liu, Wei Yu, Shengping Zhang, Xiangyang Ji, Kyungsik Kim, Minkyung Kim, Hwalmin Lee, Hekun Ma, Huan Zheng, Yanyan Wei, Zhao Zhang, Jing Fang, Meilin Gao , et al. (8 additional authors not shown)

    Abstract: Numerous low-level vision tasks operate in the RAW domain due to its linear properties, bit depth, and sensor designs. Despite this, RAW image datasets are scarce and more expensive to collect than the already large and public sRGB datasets. For this reason, many approaches try to generate realistic RAW images using sensor information and sRGB images. This paper covers the second challenge on RAW… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: CVPR 2025 - New Trends in Image Restoration and Enhancement (NTIRE)

  25. arXiv:2505.23036  [pdf, ps, other

    cs.SD eess.AS

    AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition

    Authors: Yuhang Dai, He Wang, Xingchen Li, Zihan Zhang, Shuiyuan Wang, Lei Xie, Xin Xu, Hongxiao Guo, Shaoji Zhang, Hui Bu, Wei Chen

    Abstract: This paper delineates AISHELL-5, the first open-source in-car multi-channel multi-speaker Mandarin automatic speech recognition (ASR) dataset. AISHLL-5 includes two parts: (1) over 100 hours of multi-channel speech data recorded in an electric vehicle across more than 60 real driving scenarios. This audio data consists of four far-field speech signals captured by microphones located on each car do… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: 5 pages, 1 figures, 3 tables, accepted by InterSpeech 2025

  26. arXiv:2505.22251  [pdf, ps, other

    eess.AS cs.CL

    Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition

    Authors: Yuan Tseng, Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya

    Abstract: Recent work suggests that large language models (LLMs) can improve performance of speech tasks compared to existing systems. To support their claims, results on LibriSpeech and Common Voice are often quoted. However, this work finds that a substantial amount of the LibriSpeech and Common Voice evaluation sets appear in public LLM pretraining corpora. This calls into question the reliability of fin… ▽ More

    Submitted 5 June, 2025; v1 submitted 28 May, 2025; originally announced May 2025.

  27. arXiv:2505.21578  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use

    Authors: Titouan Parcollet, Yuan Tseng, Shucong Zhang, Rogier van Dalen

    Abstract: Automatic speech recognition (ASR) research is driven by the availability of common datasets between industrial researchers and academics, encouraging comparisons and evaluations. LibriSpeech, despite its long success as an ASR benchmark, is now limited by its size and focus on clean, read speech, leading to near-zero word error rates. More recent datasets, including MOSEL, YODAS, Gigaspeech, OWSM… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: Accepted at Interspeech 2025

  28. arXiv:2505.20956  [pdf, ps, other

    cs.SD cs.AI cs.LG eess.AS

    Hybrid Disagreement-Diversity Active Learning for Bioacoustic Sound Event Detection

    Authors: Shiqi Zhang, Tuomas Virtanen

    Abstract: Bioacoustic sound event detection (BioSED) is crucial for biodiversity conservation but faces practical challenges during model development and training: limited amounts of annotated data, sparse events, species diversity, and class imbalance. To address these challenges efficiently with a limited labeling budget, we apply the mismatch-first farthest-traversal (MFFT), an active learning method int… ▽ More

    Submitted 28 May, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

    Comments: 5 pages, 1 figure, accepted by EUSIPCO 2025 v2: add our github repo

  29. arXiv:2505.17589  [pdf, ps, other

    cs.SD cs.AI eess.AS

    CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

    Authors: Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Chongjia Ni, Xian Shi, Keyu An, Guanrou Yang, Yabin Li, Yanni Chen, Zhifu Gao, Qian Chen, Yue Gu, Mengzhe Chen, Yafeng Chen, Shiliang Zhang, Wen Wang, Jieping Ye

    Abstract: In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-… ▽ More

    Submitted 27 May, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

    Comments: Preprint, work in progress

  30. arXiv:2505.16236  [pdf, ps, other

    cs.IT eess.SP

    Base Station Placement Optimization for Networked Sensing Exploiting Target Location Distribution

    Authors: Kaiyue Hou, Shuowen Zhang

    Abstract: This paper studies a networked sensing system with multiple base stations (BSs), which collaboratively sense the unknown and random three-dimensional (3D) location of a target based on the target-reflected echo signals received at the BSs. Considering a practical scenario where the target location distribution is known a priori for exploitation, we aim to design the placement of the multiple BSs t… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: Longer version of a paper submitted for possible publication

  31. arXiv:2505.16230  [pdf, ps, other

    cs.IT eess.SP

    Beyond Diagonal Intelligent Reflecting Surface Aided Integrated Sensing and Communication

    Authors: Shuo Zheng, Shuowen Zhang

    Abstract: Beyond diagonal intelligent reflecting surface (BD-IRS) is a new promising IRS architecture for which the reflection matrix is not limited to the diagonal structure as for conventional IRS. In this paper, we study a BD-IRS aided uplink integrated sensing and communication (ISAC) system where sensing is performed in a device-based manner. Specifically, we aim to estimate the unknown and random loca… ▽ More

    Submitted 24 June, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

    Comments: Accepted to appear in IEEE Transactions on Cognitive Communications and Networking, special issue on smart environment engineering for integrated sensing and communication

  32. arXiv:2505.16211  [pdf, ps, other

    cs.SD cs.AI cs.CL eess.AS

    AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

    Authors: Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng, Xuechao Zou, Zhe Wang, Xingjian Du, Shun Zhang, Hanjun Luo, Yingbin Jin, Xinxin Xing, Ziyang Ma, Yue Liu, Xiaojun Jia, Yifan Zhang, Junfeng Fang, Kun Wang, Yibo Yan, Haoyang Li, Yiming Li, Xiaobin Zhuang, Yang Liu, Haibo Hu, Zhizheng Wu , et al. (6 additional authors not shown)

    Abstract: The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safet… ▽ More

    Submitted 1 July, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

    Comments: Technical Report

  33. arXiv:2505.15279  [pdf, ps, other

    eess.SP

    Robust Secure Communications in Near-Field ISCAP Systems with Extremely Large-Scale Antenna Array

    Authors: Zixiang Ren, Siyao Zhang, Ling Qiu, Derrick Wing Kwan Ng, Jie Xu

    Abstract: This paper investigates robust secure communications in a near-field integrated sensing, communication, and powering (ISCAP) system, in which the base station (BS) is equipped with an extremely large-scale antenna array (ELAA). In this system, the BS transmits confidential messages to a single legitimate communication user (CU), simultaneously providing wireless power transfer to multiple energy r… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: 13 pages

  34. arXiv:2505.13762   

    cs.RO eess.SY

    From Structural Design to Dynamics Modeling: Control-Oriented Development of a 3-RRR Parallel Ankle Rehabilitation Robot

    Authors: Siyuan Zhang, Yufei Zhang, Junlin Lyu, Sunil K. Agrawal

    Abstract: This paper presents the development of a wearable ankle rehabilitation robot based on a 3-RRR spherical parallel mechanism (SPM) to support multi-DOF recovery through pitch, roll, and yaw motions. The system features a compact, ergonomic structure designed for comfort, safety, and compatibility with ankle biomechanics. A complete design-to-dynamics pipeline has been implemented, including structur… ▽ More

    Submitted 30 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

    Comments: This paper was originally submitted as a class project and included the name of a faculty member without prior permission. At the instructor's request, I am withdrawing the paper. The work may be resubmitted in the future after further development and testing

  35. arXiv:2505.12887   

    eess.IV cs.CV

    RetinaLogos: Fine-Grained Synthesis of High-Resolution Retinal Images Through Captions

    Authors: Junzhi Ning, Cheng Tang, Kaijin Zhou, Diping Song, Lihao Liu, Ming Hu, Wei Li, Yanzhou Su, Tianbing Li, Jiyao Liu, Yejin, Sheng Zhang, Yuanfeng Ji, Junjun He

    Abstract: The scarcity of high-quality, labelled retinal imaging data, which presents a significant challenge in the development of machine learning models for ophthalmology, hinders progress in the field. To synthesise Colour Fundus Photographs (CFPs), existing methods primarily relying on predefined disease labels face significant limitations. However, current methods remain limited, thus failing to gener… ▽ More

    Submitted 24 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

    Comments: The paper is being withdrawn due to issues with the authorship list. Specifically, one or more contributors were unintentionally omitted in the initial submission

  36. arXiv:2505.11200  [pdf, ps, other

    cs.SD cs.AI cs.CL cs.HC cs.LG eess.AS

    Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese

    Authors: Xihuai Wang, Ziyi Zhao, Siyu Ren, Shao Zhang, Song Li, Xiaoyu Li, Ziwen Wang, Lin Qiu, Guanglu Wan, Xuezhi Cao, Xunliang Cai, Weinan Zhang

    Abstract: Recent advances in large language models (LLMs) have significantly improved text-to-speech (TTS) systems, enhancing control over speech style, naturalness, and emotional expression, which brings TTS Systems closer to human-level performance. Although the Mean Opinion Score (MOS) remains the standard for TTS System evaluation, it suffers from subjectivity, environmental inconsistencies, and limited… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

    Comments: Under Review

  37. arXiv:2505.10834  [pdf, ps, other

    cs.AI cs.LG eess.IV eess.SP

    TACO: Rethinking Semantic Communications with Task Adaptation and Context Embedding

    Authors: Achintha Wijesinghe, Weiwei Wang, Suchinthaka Wanninayaka, Songyang Zhang, Zhi Ding

    Abstract: Recent advancements in generative artificial intelligence have introduced groundbreaking approaches to innovating next-generation semantic communication, which prioritizes conveying the meaning of a message rather than merely transmitting raw data. A fundamental challenge in semantic communication lies in accurately identifying and extracting the most critical semantic information while adapting t… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

    Comments: Submitted to the IEEE GlobeCom 2025

  38. arXiv:2505.10134  [pdf, other

    eess.SP cs.AI cs.LG

    Large Wireless Localization Model (LWLM): A Foundation Model for Positioning in 6G Networks

    Authors: Guangjin Pan, Kaixuan Huang, Hui Chen, Shunqing Zhang, Christian Häger, Henk Wymeersch

    Abstract: Accurate and robust localization is a critical enabler for emerging 5G and 6G applications, including autonomous driving, extended reality (XR), and smart manufacturing. While data-driven approaches have shown promise, most existing models require large amounts of labeled data and struggle to generalize across deployment scenarios and wireless configurations. To address these limitations, we propo… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

    Comments: 13 pages,16 figures.This work has been submitted to the IEEE for possible publication

  39. arXiv:2505.04068  [pdf, other

    cs.NI eess.SP

    Shadow Wireless Intelligence: Large Language Model-Driven Reasoning in Covert Communications

    Authors: Yuanai Xie, Zhaozhi Liu, Xiao Zhang, Shihua Zhang, Rui Hou, Minrui Xu, Ruichen Zhang, Dusit Niyato

    Abstract: Covert Communications (CC) can secure sensitive transmissions in industrial, military, and mission-critical applications within 6G wireless networks. However, traditional optimization methods based on Artificial Noise (AN), power control, and channel manipulation might not adapt to dynamic and adversarial environments due to the high dimensionality, nonlinearity, and stringent real-time covertness… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  40. arXiv:2505.02625  [pdf, ps, other

    cs.CL cs.AI cs.SD eess.AS

    LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis

    Authors: Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, Yang Feng

    Abstract: Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achievin… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

    Comments: Preprint. Project: https://github.com/ictnlp/LLaMA-Omni2

  41. arXiv:2505.00578  [pdf, other

    eess.IV q-bio.QM

    AI-Driven Segmentation and Analysis of Microbial Cells

    Authors: Shuang Zhang, Carleton Coffin, Karyn L. Rogers, Catherine Ann Royer, Ge Wang

    Abstract: Studying the growth and metabolism of microbes provides critical insights into their evolutionary adaptations to harsh environments, which are essential for microbial research and biotechnology applications. In this study, we developed an AI-driven image analysis system to efficiently segment individual cells and quantitatively analyze key cellular features. This system is comprised of four main m… ▽ More

    Submitted 5 May, 2025; v1 submitted 1 May, 2025; originally announced May 2025.

  42. arXiv:2504.16037  [pdf, other

    cs.RO eess.SY

    Adaptive Fault-tolerant Control of Underwater Vehicles with Thruster Failures

    Authors: Haolin Liu, Shiliang Zhang, Shangbin Jiao, Xiaohui Zhang, Xuehui Ma, Yan Yan, Wenchuan Cui, Youmin Zhang

    Abstract: This paper presents a fault-tolerant control for the trajectory tracking of autonomous underwater vehicles (AUVs) against thruster failures. We formulate faults in AUV thrusters as discrete switching events during a UAV mission, and develop a soft-switching approach in facilitating shift of control strategies across fault scenarios. We mathematically define AUV thruster fault scenarios, and develo… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

  43. arXiv:2504.15520  [pdf, other

    eess.SP

    Element-Grouping Strategy for Intelligent Reflecting Surface: Performance Analysis and Algorithm Optimization

    Authors: Shengsheng Zhang, Taotao Ji, Meng Hua, Yongming Huang, Luxi Yang

    Abstract: As a revolutionary paradigm for intelligently controlling wireless channels, intelligent reflecting surface (IRS) has emerged as a promising technology for future sixth-generation (6G) wireless communications. While IRS-aided communication systems can achieve attractive high channel gains, existing schemes require plenty of IRS elements to mitigate the ``multiplicative fading'' effect in cascaded… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

  44. arXiv:2504.14906  [pdf, ps, other

    eess.AS cs.CV cs.SD

    OmniAudio: Generating Spatial Audio from 360-Degree Video

    Authors: Huadai Liu, Tianyi Luo, Kaicheng Luo, Qikai Jiang, Peiwen Sun, Jialei Wang, Rongjie Huang, Qian Chen, Wen Wang, Xiangtai Li, Shiliang Zhang, Zhijie Yan, Zhou Zhao, Wei Xue

    Abstract: Traditional video-to-audio generation techniques primarily focus on perspective video and non-spatial audio, often missing the spatial cues necessary for accurately representing sound sources in 3D environments. To address this limitation, we introduce a novel task, 360V2SA, to generate spatial audio from 360-degree videos, specifically producing First-order Ambisonics (FOA) audio - a standard for… ▽ More

    Submitted 2 June, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

    Comments: ICML 2025

  45. arXiv:2504.14894  [pdf, other

    cs.RO eess.SY

    Never too Cocky to Cooperate: An FIM and RL-based USV-AUV Collaborative System for Underwater Tasks in Extreme Sea Conditions

    Authors: Jingzehua Xu, Guanwen Xie, Jiwei Tang, Yimian Ding, Weiyi Liu, Shuai Zhang, Yi Li

    Abstract: This paper develops a novel unmanned surface vehicle (USV)-autonomous underwater vehicle (AUV) collaborative system designed to enhance underwater task performance in extreme sea conditions. The system integrates a dual strategy: (1) high-precision multi-AUV localization enabled by Fisher information matrix-optimized USV path planning, and (2) reinforcement learning-based cooperative planning and… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

  46. arXiv:2504.12867  [pdf, other

    eess.AS cs.AI cs.CL

    EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting

    Authors: Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, Xie Chen

    Abstract: Human speech goes beyond the mere transfer of information; it is a profound exchange of emotions and a connection between individuals. While Text-to-Speech (TTS) models have made huge progress, they still face challenges in controlling the emotional expression in the generated speech. In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLM… ▽ More

    Submitted 21 April, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

  47. arXiv:2504.12711  [pdf, other

    cs.CV cs.AI eess.IV

    NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results

    Authors: Xin Li, Yeying Jin, Xin Jin, Zongwei Wu, Bingchen Li, Yufei Wang, Wenhan Yang, Yu Li, Zhibo Chen, Bihan Wen, Robby T. Tan, Radu Timofte, Qiyu Rong, Hongyuan Jing, Mengmeng Zhang, Jinglong Li, Xiangyu Lu, Yi Ren, Yuting Liu, Meng Zhang, Xiang Chen, Qiyuan Guan, Jiangxin Dong, Jinshan Pan, Conglin Gou , et al. (112 additional authors not shown)

    Abstract: This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includ… ▽ More

    Submitted 19 April, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

    Comments: Challenge Report of CVPR NTIRE 2025; 26 pages; Methods from 32 teams

  48. arXiv:2504.11162  [pdf, ps, other

    eess.SP cs.IT

    Scalable Transceiver Design for Multi-User Communication in FDD Massive MIMO Systems via Deep Learning

    Authors: Lin Zhu, Weifeng Zhu, Shuowen Zhang, Shuguang Cui, Liang Liu

    Abstract: This paper addresses the joint transceiver design, including pilot transmission, channel feature extraction and feedback, as well as precoding, for low-overhead downlink massive multiple-input multiple-output (MIMO) communication in frequency-division duplex (FDD) systems. Although deep learning (DL) has shown great potential in tackling this problem, existing methods often suffer from poor scalab… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

  49. arXiv:2504.10911  [pdf, ps, other

    eess.SP cs.IT

    Low-Overhead Channel Estimation Framework for Beyond Diagonal Reconfigurable Intelligent Surface Assisted Multi-User MIMO Communication

    Authors: Rui Wang, Shuowen Zhang, Bruno Clerckx, Liang Liu

    Abstract: Beyond diagonal reconfigurable intelligent surface (BD-RIS) refers to a family of RIS architectures characterized by scattering matrices not limited to being diagonal and enables higher wave manipulation flexibility and large performance gains over conventional (diagonal) RIS. To achieve those promising gains, accurate channel state information (CSI) needs to be acquired in BD-RIS assisted communi… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

  50. arXiv:2504.10060  [pdf, other

    eess.SP

    Learning to Beamform for Cooperative Localization and Communication: A Link Heterogeneous GNN-Based Approach

    Authors: Lixiang Lian, Chuanqi Bai, Yihan Xu, Huanyu Dong, Rui Cheng, Shunqing Zhang

    Abstract: Integrated sensing and communication (ISAC) has emerged as a key enabler for next-generation wireless networks, supporting advanced applications such as high-precision localization and environment reconstruction. Cooperative ISAC (CoISAC) further enhances these capabilities by enabling multiple base stations (BSs) to jointly optimize communication and sensing performance through coordination. Howe… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.