Search | arXiv e-print repository

StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model

Authors: Shoutao Guo, Xiang Li, Shaolei Zhang, Mengge Liu, Wei Chen, Yang Feng

Abstract: Streaming speech translation (StreamST) requires determining appropriate timing, known as policy, to generate translations while continuously receiving source speech inputs, balancing low latency with high translation quality. However, existing StreamST methods typically operate on sentence-level speech segments, referred to as simultaneous speech translation (SimulST). In practice, they require c… ▽ More Streaming speech translation (StreamST) requires determining appropriate timing, known as policy, to generate translations while continuously receiving source speech inputs, balancing low latency with high translation quality. However, existing StreamST methods typically operate on sentence-level speech segments, referred to as simultaneous speech translation (SimulST). In practice, they require collaboration with segmentation models to accomplish StreamST, where the truncated speech segments constrain SimulST models to make policy decisions and generate translations based on limited contextual information. Moreover, SimulST models struggle to learn effective policies due to the complexity of speech inputs and cross-lingual generation. To address these challenges, we propose StreamUni, which achieves StreamST through a unified Large Speech-Language Model (LSLM). Specifically, StreamUni incorporates speech Chain-of-Thought (CoT) in guiding the LSLM to generate multi-stage outputs. Leveraging these multi-stage outputs, StreamUni simultaneously accomplishes speech segmentation, policy decision, and translation generation, completing StreamST without requiring massive policy-specific training. Additionally, we propose a streaming CoT training method that enhances low-latency policy decisions and generation capabilities using limited CoT data. Experiments demonstrate that our approach achieves state-of-the-art performance on StreamST tasks. △ Less

Submitted 10 July, 2025; originally announced July 2025.

Comments: The code is at https://github.com/ictnlp/StreamUni; The model is at https://huggingface.co/ICTNLP/StreamUni-Phi4

arXiv:2507.07526 [pdf, ps, other]

DMF2Mel: A Dynamic Multiscale Fusion Network for EEG-Driven Mel Spectrogram Reconstruction

Authors: Cunhang Fan, Sheng Zhang, Jingjing Zhang, Enrui Liu, Xinhui Li, Minggang Zhao, Zhao Lv

Abstract: Decoding speech from brain signals is a challenging research problem. Although existing technologies have made progress in reconstructing the mel spectrograms of auditory stimuli at the word or letter level, there remain core challenges in the precise reconstruction of minute-level continuous imagined speech: traditional models struggle to balance the efficiency of temporal dependency modeling and… ▽ More Decoding speech from brain signals is a challenging research problem. Although existing technologies have made progress in reconstructing the mel spectrograms of auditory stimuli at the word or letter level, there remain core challenges in the precise reconstruction of minute-level continuous imagined speech: traditional models struggle to balance the efficiency of temporal dependency modeling and information retention in long-sequence decoding. To address this issue, this paper proposes the Dynamic Multiscale Fusion Network (DMF2Mel), which consists of four core components: the Dynamic Contrastive Feature Aggregation Module (DC-FAM), the Hierarchical Attention-Guided Multi-Scale Network (HAMS-Net), the SplineMap attention mechanism, and the bidirectional state space module (convMamba). Specifically, the DC-FAM separates speech-related "foreground features" from noisy "background features" through local convolution and global attention mechanisms, effectively suppressing interference and enhancing the representation of transient signals. HAMS-Net, based on the U-Net framework,achieves cross-scale fusion of high-level semantics and low-level details. The SplineMap attention mechanism integrates the Adaptive Gated Kolmogorov-Arnold Network (AGKAN) to combine global context modeling with spline-based local fitting. The convMamba captures long-range temporal dependencies with linear complexity and enhances nonlinear dynamic modeling capabilities. Results on the SparrKULee dataset show that DMF2Mel achieves a Pearson correlation coefficient of 0.074 in mel spectrogram reconstruction for known subjects (a 48% improvement over the baseline) and 0.048 for unknown subjects (a 35% improvement over the baseline).Code is available at: https://github.com/fchest/DMF2Mel. △ Less

Submitted 10 July, 2025; originally announced July 2025.

Comments: Accepted by ACM MM 2025

arXiv:2507.07396 [pdf, ps, other]

IML-Spikeformer: Input-aware Multi-Level Spiking Transformer for Speech Processing

Authors: Zeyang Song, Shimin Zhang, Yuhong Chou, Jibin Wu, Haizhou Li

Abstract: Spiking Neural Networks (SNNs), inspired by biological neural mechanisms, represent a promising neuromorphic computing paradigm that offers energy-efficient alternatives to traditional Artificial Neural Networks (ANNs). Despite proven effectiveness, SNN architectures have struggled to achieve competitive performance on large-scale speech processing task. Two key challenges hinder progress: (1) the… ▽ More Spiking Neural Networks (SNNs), inspired by biological neural mechanisms, represent a promising neuromorphic computing paradigm that offers energy-efficient alternatives to traditional Artificial Neural Networks (ANNs). Despite proven effectiveness, SNN architectures have struggled to achieve competitive performance on large-scale speech processing task. Two key challenges hinder progress: (1) the high computational overhead during training caused by multi-timestep spike firing, and (2) the absence of large-scale SNN architectures tailored to speech processing tasks. To overcome the issues, we introduce Input-aware Multi-Level Spikeformer, i.e. IML-Spikeformer, a spiking Transformer architecture specifically designed for large-scale speech processing. Central to our design is the Input-aware Multi-Level Spike (IMLS) mechanism, which simulate multi-timestep spike firing within a single timestep using an adaptive, input-aware thresholding scheme. IML-Spikeformer further integrates a Reparameterized Spiking Self-Attention (RepSSA) module with a Hierarchical Decay Mask (HDM), forming the HD-RepSSA module. This module enhances the precision of attention maps and enables modeling of multi-scale temporal dependencies in speech signals. Experiments demonstrate that IML-Spikeformer achieves word error rates of 6.0\% on AiShell-1 and 3.4\% on Librispeech-960, comparable to conventional ANN transformers while reducing theoretical inference energy consumption by 4.64$\times$ and 4.32$\times$ respectively. IML-Spikeformer marks an advance of scalable SNN architectures for large-scale speech processing in both task performance and energy efficiency. △ Less

Submitted 9 July, 2025; originally announced July 2025.

Comments: Under review of TNNLS

arXiv:2507.07270 [pdf, ps, other]

Audio-Visual Speech Separation via Bottleneck Iterative Network

Authors: Sidong Zhang, Shiv Shankar, Trang Nguyen, Andrea Fanelli, Madalina Fiterau

Abstract: Integration of information from non-auditory cues can significantly improve the performance of speech-separation models. Often such models use deep modality-specific networks to obtain unimodal features, and risk being too costly or lightweight but lacking capacity. In this work, we present an iterative representation refinement approach called Bottleneck Iterative Network (BIN), a technique that… ▽ More Integration of information from non-auditory cues can significantly improve the performance of speech-separation models. Often such models use deep modality-specific networks to obtain unimodal features, and risk being too costly or lightweight but lacking capacity. In this work, we present an iterative representation refinement approach called Bottleneck Iterative Network (BIN), a technique that repeatedly progresses through a lightweight fusion block, while bottlenecking fusion representations by fusion tokens. This helps improve the capacity of the model, while avoiding major increase in model size and balancing between the model performance and training cost. We test BIN on challenging noisy audio-visual speech separation tasks, and show that our approach consistently outperforms state-of-the-art benchmark models with respect to SI-SDRi on NTCD-TIMIT and LRS3+WHAM! datasets, while simultaneously achieving a reduction of more than 50% in training and GPU inference time across nearly all settings. △ Less

Submitted 9 July, 2025; originally announced July 2025.

Comments: Accepted to the 42nd International Conference on Machine Learning Workshop on Machine Learning for Audio

arXiv:2507.05911 [pdf, ps, other]

Differentiable Reward Optimization for LLM based TTS system

Authors: Changfeng Gao, Zhihao Du, Shiliang Zhang

Abstract: This paper proposes a novel Differentiable Reward Optimization (DiffRO) method aimed at enhancing the performance of neural codec language models based text-to-speech (TTS) systems. In contrast to conventional reinforcement learning from human feedback (RLHF) approaches applied to TTS, DiffRO directly compute the rewards based on neural codec tokens, rather than relying on synthesized audio. Furth… ▽ More This paper proposes a novel Differentiable Reward Optimization (DiffRO) method aimed at enhancing the performance of neural codec language models based text-to-speech (TTS) systems. In contrast to conventional reinforcement learning from human feedback (RLHF) approaches applied to TTS, DiffRO directly compute the rewards based on neural codec tokens, rather than relying on synthesized audio. Furthermore, we employ the Gumbel-Softmax technique to render the reward function differentiable, thereby streamlining the RLHF training process. Additionally, we introduce a multi-task reward (MTR) model which can provide feedback from different perspectives and find that it can augment the system's capability to follow instructions effectively.Experimental results indicate that DiffRO significantly improves the pronunciation accuracy of the TTS system, achieving state-of-the-art (SOTA) WER results on the seed-tts-eval benchmark. Moreover, with the integration of the MTR model, we demonstrate the ability to control emotional and quality attributes in a zero-shot manner. △ Less

Submitted 8 July, 2025; originally announced July 2025.

arXiv:2507.00902 [pdf, ps, other]

Constellation as a Service: Tailored Connectivity Management in Direct-Satellite-to-Device Networks

Authors: Feng Wang, Shengyu Zhang, Een-Kee Hong, Tony Q. S. Quek

Abstract: Direct-satellite-to-device (DS2D) communication is emerging as a promising solution for global mobile service extension, leveraging the deployment of satellite constellations. However, the challenge of managing DS2D connectivity for multi-constellations becomes outstanding, including high interference and frequent handovers caused by multi-coverage overlap and rapid satellite movement. Moreover, e… ▽ More Direct-satellite-to-device (DS2D) communication is emerging as a promising solution for global mobile service extension, leveraging the deployment of satellite constellations. However, the challenge of managing DS2D connectivity for multi-constellations becomes outstanding, including high interference and frequent handovers caused by multi-coverage overlap and rapid satellite movement. Moreover, existing approaches primarily operate within single-constellation shell, which inherently limits the ability to exploit the vast potential of multi-constellation connectivity provision, resulting in suboptimal DS2D service performances. To address these challenges, this article proposes a Constellation as a Service (CaaS) framework, which treats the entire multi-constellation infrastructure as a shared resource pool and dynamically forms optimal sub-constellations (SCs) for each DS2D service region. The formation of each SC integrates satellites from various orbits to provide tailored connectivity based on user demands, guided by two innovative strategies: predictive satellite beamforming using generative artificial intelligence (GenAI) and pre-configured handover path for efficient satellite access and mobility management. Simulation results demonstrate that CaaS significantly improves satellite service rates while reducing handover overhead, making it an efficient and continuable solution for managing DS2D connectivity in multi-constellation environments. △ Less

Submitted 1 July, 2025; originally announced July 2025.

Comments: To appear in IEEE Communications Magazine

arXiv:2506.19384 [pdf, ps, other]

Deep Electromagnetic Structure Design Under Limited Evaluation Budgets

Authors: Shijian Zheng, Fangxiao Jin, Shuhai Zhang, Quan Xue, Mingkui Tan

Abstract: Electromagnetic structure (EMS) design plays a critical role in developing advanced antennas and materials, but remains challenging due to high-dimensional design spaces and expensive evaluations. While existing methods commonly employ high-quality predictors or generators to alleviate evaluations, they are often data-intensive and struggle with real-world scale and budget constraints. To address… ▽ More Electromagnetic structure (EMS) design plays a critical role in developing advanced antennas and materials, but remains challenging due to high-dimensional design spaces and expensive evaluations. While existing methods commonly employ high-quality predictors or generators to alleviate evaluations, they are often data-intensive and struggle with real-world scale and budget constraints. To address this, we propose a novel method called Progressive Quadtree-based Search (PQS). Rather than exhaustively exploring the high-dimensional space, PQS converts the conventional image-like layout into a quadtree-based hierarchical representation, enabling a progressive search from global patterns to local details. Furthermore, to lessen reliance on highly accurate predictors, we introduce a consistency-driven sample selection mechanism. This mechanism quantifies the reliability of predictions, balancing exploitation and exploration when selecting candidate designs. We evaluate PQS on two real-world engineering tasks, i.e., Dual-layer Frequency Selective Surface and High-gain Antenna. Experimental results show that our method can achieve satisfactory designs under limited computational budgets, outperforming baseline methods. In particular, compared to generative approaches, it cuts evaluation costs by 75-85%, effectively saving 20.27-38.80 days of product designing cycle. △ Less

Submitted 24 June, 2025; originally announced June 2025.

Comments: ICML 2025 (accepted)

arXiv:2506.17983 [pdf, ps, other]

LVPNet: A Latent-variable-based Prediction-driven End-to-end Framework for Lossless Compression of Medical Images

Authors: Chenyue Song, Chen Hui, Qing Lin, Wei Zhang, Siqiao Li, Haiqi Zhu, Zhixuan Li, Shengping Zhang, Shaohui Liu, Feng Jiang, Xiang Li

Abstract: Autoregressive Initial Bits is a framework that integrates sub-image autoregression and latent variable modeling, demonstrating its advantages in lossless medical image compression. However, in existing methods, the image segmentation process leads to an even distribution of latent variable information across each sub-image, which in turn causes posterior collapse and inefficient utilization of la… ▽ More Autoregressive Initial Bits is a framework that integrates sub-image autoregression and latent variable modeling, demonstrating its advantages in lossless medical image compression. However, in existing methods, the image segmentation process leads to an even distribution of latent variable information across each sub-image, which in turn causes posterior collapse and inefficient utilization of latent variables. To deal with these issues, we propose a prediction-based end-to-end lossless medical image compression method named LVPNet, leveraging global latent variables to predict pixel values and encoding predicted probabilities for lossless compression. Specifically, we introduce the Global Multi-scale Sensing Module (GMSM), which extracts compact and informative latent representations from the entire image, effectively capturing spatial dependencies within the latent space. Furthermore, to mitigate the information loss introduced during quantization, we propose the Quantization Compensation Module (QCM), which learns the distribution of quantization errors and refines the quantized features to compensate for quantization loss. Extensive experiments on challenging benchmarks demonstrate that our method achieves superior compression efficiency compared to state-of-the-art lossless image compression approaches, while maintaining competitive inference speed. The code is at https://github.com/scy-Jackel/LVPNet. △ Less

Submitted 25 June, 2025; v1 submitted 22 June, 2025; originally announced June 2025.

Comments: Accepted to MICCAI 2025

arXiv:2506.17164 [pdf, ps, other]

Codeword-Segmentation Rate-Splitting Multiple Access and Evaluation under Suboptimal Decoding

Authors: Sibo Zhang, Bruno Clerckx, David Vargas

Abstract: Rate-Splitting Multiple Access (RSMA) has been recognized as a promising multiple access technique. We propose a novel architecture for downlink RSMA, namely Codeword-Segmentation RSMA (CS-RSMA). Different from conventional RSMA which splits users' messages into common and private parts before encoding, CS-RSMA encodes the users' messages directly, segments the codewords into common and private pa… ▽ More Rate-Splitting Multiple Access (RSMA) has been recognized as a promising multiple access technique. We propose a novel architecture for downlink RSMA, namely Codeword-Segmentation RSMA (CS-RSMA). Different from conventional RSMA which splits users' messages into common and private parts before encoding, CS-RSMA encodes the users' messages directly, segments the codewords into common and private parts, and transmits the codeword segments using common and private streams. In addition to the principle of CS-RSMA, a novel performance analysis framework is proposed. This framework utilizes a recent discovery in mismatched decoding under finite-alphabet input and interference, and can better capture the receiver's complexity limits. Precoder optimization under finite alphabets and suboptimal decoders for conventional RSMA and CS-RSMA to maximize the Sum-Rate (SR) and the Max-Min Fairness (MMF) is also addressed. The numerical results reveal the theoretical performance of conventional RSMA and CS-RSMA. We observe that CS-RSMA leads to better performance than conventional RSMA in SR, and similar performance in MMF. Furthermore, a physical-layer implementation of CS-RSMA is proposed and evaluated through link-level simulations. Aside performance benefits, we also demonstrate that CS-RSMA brings significant benefits on the encoding/decoding, control signaling, and retransmission process compared to conventional RSMA. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: Submitted to IEEE for publication

arXiv:2506.16803 [pdf, ps, other]

Temperature calibration of surface emissivities with an improved thermal image enhancement network

Authors: Ning Chu, Siya Zheng, Shanqing Zhang, Li Li, Caifang Cai, Ali Mohammad-Djafari, Feng Zhao, Yuanbo Song

Abstract: Infrared thermography faces persistent challenges in temperature accuracy due to material emissivity variations, where existing methods often neglect the joint optimization of radiometric calibration and image degradation. This study introduces a physically guided neural framework that unifies temperature correction and image enhancement through a symmetric skip-CNN architecture and an emissivity-… ▽ More Infrared thermography faces persistent challenges in temperature accuracy due to material emissivity variations, where existing methods often neglect the joint optimization of radiometric calibration and image degradation. This study introduces a physically guided neural framework that unifies temperature correction and image enhancement through a symmetric skip-CNN architecture and an emissivity-aware attention module. The pre-processing stage segments the ROIs of the image and and initially corrected the firing rate. A novel dual-constrained loss function strengthens the statistical consistency between the target and reference regions through mean-variance alignment and histogram matching based on Kullback-Leibler dispersion. The method works by dynamically fusing thermal radiation features and spatial context, and the model suppresses emissivity artifacts while recovering structural details. After validating the industrial blower system under different conditions, the improved network realizes the dynamic fusion of thermal radiation characteristics and spatial background, with accurate calibration results in various industrial conditions. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.15124 [pdf, ps, other]

A Force Feedback Exoskeleton for Teleoperation Using Magnetorheological Clutches

Authors: Zhongyuan Kong, Lei Li, Erwin Ang Tien Yew, Zirui Chen, Wenbo Li, Shiwu Zhang, Jian Yang, Shuaishuai Sun

Abstract: This paper proposes an upper-limb exoskeleton teleoperation system based on magnetorheological (MR) clutches, aiming to improve operational accuracy and enhance the immersive experience during lunar sampling tasks. Conventional exoskeleton teleoperation systems commonly employ active force feedback solutions, such as servo motors, which typically suffer from high system complexity and increased en… ▽ More This paper proposes an upper-limb exoskeleton teleoperation system based on magnetorheological (MR) clutches, aiming to improve operational accuracy and enhance the immersive experience during lunar sampling tasks. Conventional exoskeleton teleoperation systems commonly employ active force feedback solutions, such as servo motors, which typically suffer from high system complexity and increased energy consumption. Furthermore, force feedback devices utilizing motors and gear reducers generally compromise backdrivability and pose safety risks to operators due to active force output. To address these limitations, we propose a semi-active force feedback strategy based on MR clutches. Dynamic magnetic field control enables precise adjustment of joint stiffness and damping, thereby providing smooth and high-resolution force feedback. The designed MR clutch exhibits outstanding performance across key metrics, achieving a torque-to-mass ratio (TMR) of 93.6 Nm/kg, a torque-to-volume ratio (TVR) of 4.05 x 10^5 Nm/m^3, and a torque-to-power ratio (TPR) of 4.15 Nm/W. Notably, the TMR represents an improvement of approximately 246% over a representative design in prior work. Experimental results validate the system's capability to deliver high-fidelity force feedback. Overall, the proposed system presents a promising solution for deep-space teleoperation with strong potential for real-world deployment in future missions. △ Less

Submitted 17 June, 2025; originally announced June 2025.

arXiv:2506.15082 [pdf, ps, other]

Make Your AUV Adaptive: An Environment-Aware Reinforcement Learning Framework For Underwater Tasks

Authors: Yimian Ding, Jingzehua Xu, Guanwen Xie, Shuai Zhang, Yi Li

Abstract: This study presents a novel environment-aware reinforcement learning (RL) framework designed to augment the operational capabilities of autonomous underwater vehicles (AUVs) in underwater environments. Departing from traditional RL architectures, the proposed framework integrates an environment-aware network module that dynamically captures flow field data, effectively embedding this critical envi… ▽ More This study presents a novel environment-aware reinforcement learning (RL) framework designed to augment the operational capabilities of autonomous underwater vehicles (AUVs) in underwater environments. Departing from traditional RL architectures, the proposed framework integrates an environment-aware network module that dynamically captures flow field data, effectively embedding this critical environmental information into the state space. This integration facilitates real-time environmental adaptation, significantly enhancing the AUV's situational awareness and decision-making capabilities. Furthermore, the framework incorporates AUV structure characteristics into the optimization process, employing a large language model (LLM)-based iterative refinement mechanism that leverages both environmental conditions and training outcomes to optimize task performance. Comprehensive experimental evaluations demonstrate the framework's superior performance, robustness and adaptability. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: This paper has been accepted by IROS 2025

arXiv:2506.13642 [pdf, ps, other]

Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model

Authors: Shaolei Zhang, Shoutao Guo, Qingkai Fang, Yan Zhou, Yang Feng

Abstract: The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone. While sequence-dimension concatenation is straightforward for… ▽ More The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone. While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments. In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments. To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations. Stream-Omni employs LLM as the backbone and aligns the vision and speech to the text based on their relationships. For vision that is semantically complementary to text, Stream-Omni uses sequence-dimension concatenation to achieve vision-text alignment. For speech that is semantically consistent with text, Stream-Omni introduces a CTC-based layer-dimension mapping to achieve speech-text alignment. In this way, Stream-Omni can achieve modality alignments with less data (especially speech), enabling the transfer of text capabilities to other modalities. Experiments on various benchmarks demonstrate that Stream-Omni achieves strong performance on visual understanding, speech interaction, and vision-grounded speech interaction tasks. Owing to the layer-dimensional mapping, Stream-Omni can simultaneously provide intermediate text outputs (such as ASR transcriptions and model responses) during speech interaction, offering users a comprehensive multimodal experience. △ Less

Submitted 22 June, 2025; v1 submitted 16 June, 2025; originally announced June 2025.

Comments: Code: https://github.com/ictnlp/Stream-Omni , Model: https://huggingface.co/ICTNLP/stream-omni-8b

arXiv:2506.12668 [pdf, ps, other]

SIC-Free Rate-Splitting Multiple Access: Constellation-Constrained Optimization and Application to Large-Scale Systems

Authors: Sibo Zhang, Bruno Clerckx, David Vargas

Abstract: Rate-Splitting Multiple Access (RSMA) has been recognized as a promising multiple access technique for future wireless communication systems. Recent research demonstrates that RSMA can maintain its superiority without relying on Successive Interference Cancellation (SIC) receivers. In practical systems, SIC-free receivers are more attractive than SIC receivers because of their low complexity and l… ▽ More Rate-Splitting Multiple Access (RSMA) has been recognized as a promising multiple access technique for future wireless communication systems. Recent research demonstrates that RSMA can maintain its superiority without relying on Successive Interference Cancellation (SIC) receivers. In practical systems, SIC-free receivers are more attractive than SIC receivers because of their low complexity and latency. This paper evaluates the theoretical limits of RSMA with and without SIC receivers under finite constellations. We first derive the constellation-constrained rate expressions for RSMA. We then design algorithms based on projected subgradient ascent to optimize the precoders and maximize the weighted sum-rate or max-min fairness (MMF) among users. To apply the proposed optimization algorithms to large-scale systems, one challenge lies in the exponentially increasing computational complexity brought about by the constellation-constrained rate expressions. In light of this, we propose methods to avoid such computational burden. Numerical results show that, under optimized precoders, SIC-free RSMA leads to minor losses in weighted sum-rate and MMF performance in comparison to RSMA with SIC receivers, making it a viable option for future implementations. △ Less

Submitted 14 June, 2025; originally announced June 2025.

Comments: Submitted to IEEE for publication

arXiv:2506.12646 [pdf, ps, other]

Optimal and Suboptimal Decoders under Finite-Alphabet Interference: A Mismatched Decoding Perspective

Authors: Sibo Zhang, Bruno Clerckx

Abstract: Interference widely exists in communication systems and is often not optimally treated at the receivers due to limited knowledge and/or computational burden. Evolutions of receivers have been proposed to balance complexity and spectral efficiency, for example, for 6G, while commonly used performance metrics, such as capacity and mutual information, fail to capture the suboptimal treatment of inter… ▽ More Interference widely exists in communication systems and is often not optimally treated at the receivers due to limited knowledge and/or computational burden. Evolutions of receivers have been proposed to balance complexity and spectral efficiency, for example, for 6G, while commonly used performance metrics, such as capacity and mutual information, fail to capture the suboptimal treatment of interference, leading to potentially inaccurate performance evaluations. Mismatched decoding is an information-theoretic tool for analyzing communications with suboptimal decoders. In this work, we use mismatched decoding to analyze communications with decoders that treat interference suboptimally, aiming at more accurate performance metrics. Specifically, we consider a finite-alphabet input Gaussian channel under interference, representative of modern systems, where the decoder can be matched (optimal) or mismatched (suboptimal) to the channel. The matched capacity is derived using Mutual Information (MI), while a lower bound on the mismatched capacity under various decoding metrics is derived using the Generalized Mutual Information (GMI). We show that the decoding metric in the proposed channel model is closely related to the behavior of the demodulator in Bit-Interleaved Coded Modulation (BICM) systems. Simulations illustrate that GMI/MI accurately predicts the throughput performance of BICM-type systems. Finally, we extend the channel model and the GMI to multiple antenna cases, with an example of multi-user multiple-input-single-output (MU-MISO) precoder optimization problem considering GMI under different decoding strategies. In short, this work discovers new insights about the impact of interference, proposes novel receivers, and introduces a new design and performance evaluation framework that more accurately captures the effect of interference. △ Less

Submitted 14 June, 2025; originally announced June 2025.

Comments: Submitted to IEEE for publication

arXiv:2506.12495 [pdf, ps, other]

Automated Heuristic Design for Unit Commitment Using Large Language Models

Authors: Junjin Lv, Chenggang Cui, Shaodi Zhang, Hui Chen, Chunyang Gong, Jiaming Liu

Abstract: The Unit Commitment (UC) problem is a classic challenge in the optimal scheduling of power systems. Years of research and practice have shown that formulating reasonable unit commitment plans can significantly improve the economic efficiency of power systems' operations. In recent years, with the introduction of technologies such as machine learning and the Lagrangian relaxation method, the soluti… ▽ More The Unit Commitment (UC) problem is a classic challenge in the optimal scheduling of power systems. Years of research and practice have shown that formulating reasonable unit commitment plans can significantly improve the economic efficiency of power systems' operations. In recent years, with the introduction of technologies such as machine learning and the Lagrangian relaxation method, the solution methods for the UC problem have become increasingly diversified, but still face challenges in terms of accuracy and robustness. This paper proposes a Function Space Search (FunSearch) method based on large language models. This method combines pre-trained large language models and evaluators to creatively generate solutions through the program search and evolution process while ensuring their rationality. In simulation experiments, a case of unit commitment with $10$ units is used mainly. Compared to the genetic algorithm, the results show that FunSearch performs better in terms of sampling time, evaluation time, and total operating cost of the system, demonstrating its great potential as an effective tool for solving the UC problem. △ Less

Submitted 14 June, 2025; originally announced June 2025.

arXiv:2506.10653 [pdf, ps, other]

Robust Unsupervised Adaptation of a Speech Recogniser Using Entropy Minimisation and Speaker Codes

Authors: Rogier C. van Dalen, Shucong Zhang, Titouan Parcollet, Sourav Bhattacharya

Abstract: Speech recognisers usually perform optimally only in a specific environment and need to be adapted to work well in another. For adaptation to a new speaker, there is often too little data for fine-tuning to be robust, and that data is usually unlabelled. This paper proposes a combination of approaches to make adaptation to a single minute of data robust. First, instead of estimating the adaptation… ▽ More Speech recognisers usually perform optimally only in a specific environment and need to be adapted to work well in another. For adaptation to a new speaker, there is often too little data for fine-tuning to be robust, and that data is usually unlabelled. This paper proposes a combination of approaches to make adaptation to a single minute of data robust. First, instead of estimating the adaptation parameters with cross-entropy on a single error-prone hypothesis or "pseudo-label", this paper proposes a novel loss function, the conditional entropy over complete hypotheses. Using multiple hypotheses makes adaptation more robust to errors in the initial recognition. Second, a "speaker code" characterises a speaker in a vector short enough that it requires little data to estimate. On a far-field noise-augmented version of Common Voice, the proposed scheme yields a 20% relative improvement in word error rate on one minute of adaptation data, increasing on 10 minutes to 29%. △ Less

Submitted 12 June, 2025; originally announced June 2025.

Journal ref: Interspeech 2025

arXiv:2506.09377 [pdf, ps, other]

An Interpretable Two-Stage Feature Decomposition Method for Deep Learning-based SAR ATR

Authors: Chenwei Wang, Renjie Xu, Congwen Wu, Cunyi Yin, Ziyun Liao, Deqing Mao, Sitong Zhang, Hong Yan

Abstract: Synthetic aperture radar automatic target recognition (SAR ATR) has seen significant performance improvements with deep learning. However, the black-box nature of deep SAR ATR introduces low confidence and high risks in decision-critical SAR applications, hindering practical deployment. To address this issue, deep SAR ATR should provide an interpretable reasoning basis $r_b$ and logic $λ_w$, formi… ▽ More Synthetic aperture radar automatic target recognition (SAR ATR) has seen significant performance improvements with deep learning. However, the black-box nature of deep SAR ATR introduces low confidence and high risks in decision-critical SAR applications, hindering practical deployment. To address this issue, deep SAR ATR should provide an interpretable reasoning basis $r_b$ and logic $λ_w$, forming the reasoning logic $\sum_{i} {{r_b^i} \times {λ_w^i}} =pred$ behind the decisions. Therefore, this paper proposes a physics-based two-stage feature decomposition method for interpretable deep SAR ATR, which transforms uninterpretable deep features into attribute scattering center components (ASCC) with clear physical meanings. First, ASCCs are obtained through a clustering algorithm. To extract independent physical components from deep features, we propose a two-stage decomposition method. In the first stage, a feature decoupling and discrimination module separates deep features into approximate ASCCs with global discriminability. In the second stage, a multilayer orthogonal non-negative matrix tri-factorization (MLO-NMTF) further decomposes the ASCCs into independent components with distinct physical meanings. The MLO-NMTF elegantly aligns with the clustering algorithms to obtain ASCCs. Finally, this method ensures both an interpretable reasoning process and accurate recognition results. Extensive experiments on four benchmark datasets confirm its effectiveness, showcasing the method's interpretability, robust recognition performance, and strong generalization capability. △ Less

Submitted 11 June, 2025; originally announced June 2025.

arXiv:2506.08366 [pdf, ps, other]

Learning event-triggered controllers for linear parameter-varying systems from data

Authors: Renjie Ma, Su Zhang, Wenjie Liu, Zhijian Hu, Peng Shi

Abstract: Nonlinear dynamical behaviours in engineering applications can be approximated by linear-parameter varying (LPV) representations, but obtaining precise model knowledge to develop a control algorithm is difficult in practice. In this paper, we develop the data-driven control strategies for event-triggered LPV systems with stability verifications. First, we provide the theoretical analysis of $θ$-pe… ▽ More Nonlinear dynamical behaviours in engineering applications can be approximated by linear-parameter varying (LPV) representations, but obtaining precise model knowledge to develop a control algorithm is difficult in practice. In this paper, we develop the data-driven control strategies for event-triggered LPV systems with stability verifications. First, we provide the theoretical analysis of $θ$-persistence of excitation for LPV systems, which leads to the feasible data-based representations. Then, in terms of the available perturbed data, we derive the stability certificates for event-triggered LPV systems with the aid of Petersen's lemma in the sense of robust control, resulting in the computationally tractable semidefinite programmings, the feasible solutions of which yields the optimal gain schedulings. Besides, we generalize the data-driven eventtriggered LPV control methods to the scenario of reference trajectory tracking, and discuss the robust tracking stability accordingly. Finally, we verify the effectiveness of our theoretical derivations by numerical simulations. △ Less

Submitted 9 June, 2025; originally announced June 2025.

Comments: 13 pages, 5 figures

arXiv:2506.07869 [pdf, ps, other]

Hybrid Beamforming Optimization for MIMO ISAC Exploiting Prior Information: A PCRB-based Approach

Authors: Yizhuo Wang, Shuowen Zhang

Abstract: This paper considers a multiple-input multiple-output (MIMO) integrated sensing and communication (ISAC) system, where a multi-antenna base station (BS) with transceiver hybrid analog-digital arrays transmits dual-functional signals to communicate with a multi-antenna user and simultaneously sense the unknown and random location information of a target based on the reflected echo signals and the p… ▽ More This paper considers a multiple-input multiple-output (MIMO) integrated sensing and communication (ISAC) system, where a multi-antenna base station (BS) with transceiver hybrid analog-digital arrays transmits dual-functional signals to communicate with a multi-antenna user and simultaneously sense the unknown and random location information of a target based on the reflected echo signals and the prior distribution information on the target's location. Under transceiver hybrid arrays, we characterize the sensing performance by deriving the posterior Cramér-Rao bound (PCRB) of the mean-squared error which is a function of the transmit hybrid beamforming and receive analog beamforming. We study joint transmit hybrid beamforming and receive analog beamforming optimization to minimize the PCRB subject to a communication rate requirement. We first consider a sensing-only system and derive the optimal solution to each element in the transmit/receive analog beamforming matrices that minimizes the PCRB in closed form. Then, we develop an alternating optimization (AO) based algorithm. Next, we study a narrowband MIMO ISAC system and devise an efficient AO-based hybrid beamforming algorithm by leveraging weighted minimum mean-squared error and feasible point pursuit successive convex approximation methods. Furthermore, we extend the results for narrowband systems to a MIMO orthogonal frequency-division multiplexing (OFDM) ISAC system. Numerical results validate the effectiveness of our proposed hybrid beamforming designs. It is revealed that the number of receive RF chains has more significant impact on the sensing performance than its transmit counterpart. Under a given budget on the total number of transmit/receive RF chains at the BS, the optimal number of transmit RF chains increases as the communication rate target increases due to the non-trivial PCRB-rate trade-off. △ Less

Submitted 9 June, 2025; originally announced June 2025.

Comments: submitted for possible journal publication

arXiv:2506.06012 [pdf, ps, other]

Enhanced Trust Region Sequential Convex Optimization for Multi-Drone Thermal Screening Trajectory Planning in Urban Environments

Authors: Kaiyuan Chen, Zhengjie Hu, Shaolin Zhang, Yuanqing Xia, Wannian Liang, Shuo Wang

Abstract: The rapid detection of abnormal body temperatures in urban populations is essential for managing public health risks, especially during outbreaks of infectious diseases. Multi-drone thermal screening systems offer promising solutions for fast, large-scale, and non-intrusive human temperature monitoring. However, trajectory planning for multiple drones in complex urban environments poses significan… ▽ More The rapid detection of abnormal body temperatures in urban populations is essential for managing public health risks, especially during outbreaks of infectious diseases. Multi-drone thermal screening systems offer promising solutions for fast, large-scale, and non-intrusive human temperature monitoring. However, trajectory planning for multiple drones in complex urban environments poses significant challenges, including collision avoidance, coverage efficiency, and constrained flight environments. In this study, we propose an enhanced trust region sequential convex optimization (TR-SCO) algorithm for optimal trajectory planning of multiple drones performing thermal screening tasks. Our improved algorithm integrates a refined convex optimization formulation within a trust region framework, effectively balancing trajectory smoothness, obstacle avoidance, altitude constraints, and maximum screening coverage. Simulation results demonstrate that our approach significantly improves trajectory optimality and computational efficiency compared to conventional convex optimization methods. This research provides critical insights and practical contributions toward deploying efficient multi-drone systems for real-time thermal screening in urban areas. For reader who are interested in our research, we release our source code at https://github.com/Cherry0302/Enhanced-TR-SCO. △ Less

Submitted 19 June, 2025; v1 submitted 6 June, 2025; originally announced June 2025.

arXiv:2506.03134 [pdf, ps, other]

Simulate Any Radar: Attribute-Controllable Radar Simulation via Waveform Parameter Embedding

Authors: Weiqing Xiao, Hao Huang, Chonghao Zhong, Yujie Lin, Nan Wang, Xiaoxue Chen, Zhaoxi Chen, Saining Zhang, Shuocheng Yang, Pierre Merriaux, Lei Lei, Hao Zhao

Abstract: We present SA-Radar (Simulate Any Radar), a radar simulation approach that enables controllable and efficient generation of radar cubes conditioned on customizable radar attributes. Unlike prior generative or physics-based simulators, SA-Radar integrates both paradigms through a waveform-parameterized attribute embedding. We design ICFAR-Net, a 3D U-Net conditioned on radar attributes encoded via… ▽ More We present SA-Radar (Simulate Any Radar), a radar simulation approach that enables controllable and efficient generation of radar cubes conditioned on customizable radar attributes. Unlike prior generative or physics-based simulators, SA-Radar integrates both paradigms through a waveform-parameterized attribute embedding. We design ICFAR-Net, a 3D U-Net conditioned on radar attributes encoded via waveform parameters, which captures signal variations induced by different radar configurations. This formulation bypasses the need for detailed radar hardware specifications and allows efficient simulation of range-azimuth-Doppler (RAD) tensors across diverse sensor settings. We further construct a mixed real-simulated dataset with attribute annotations to robustly train the network. Extensive evaluations on multiple downstream tasks-including 2D/3D object detection and radar semantic segmentation-demonstrate that SA-Radar's simulated data is both realistic and effective, consistently improving model performance when used standalone or in combination with real data. Our framework also supports simulation in novel sensor viewpoints and edited scenes, showcasing its potential as a general-purpose radar data engine for autonomous driving applications. Code and additional materials are available at https://zhuxing0.github.io/projects/SA-Radar. △ Less

Submitted 3 June, 2025; originally announced June 2025.

Comments: Code: https://github.com/zhuxing0/SA-Radar Project page: https://zhuxing0.github.io/projects/SA-Radar

arXiv:2506.02604 [pdf]

Application of convolutional neural networks in image super-resolution

Authors: Chunwei Tian, Mingjian Song, Wangmeng Zuo, Bo Du, Yanning Zhang, Shichao Zhang

Abstract: Due to strong learning abilities of convolutional neural networks (CNNs), they have become mainstream methods for image super-resolution. However, there are big differences of different deep learning methods with different types. There is little literature to summarize relations and differences of different methods in image super-resolution. Thus, summarizing these literatures are important, accor… ▽ More Due to strong learning abilities of convolutional neural networks (CNNs), they have become mainstream methods for image super-resolution. However, there are big differences of different deep learning methods with different types. There is little literature to summarize relations and differences of different methods in image super-resolution. Thus, summarizing these literatures are important, according to loading capacity and execution speed of devices. This paper first introduces principles of CNNs in image super-resolution, then introduces CNNs based bicubic interpolation, nearest neighbor interpolation, bilinear interpolation, transposed convolution, sub-pixel layer, meta up-sampling for image super-resolution to analyze differences and relations of different CNNs based interpolations and modules, and compare performance of these methods by experiments. Finally, this paper gives potential research points and drawbacks and summarizes the whole paper, which can facilitate developments of CNNs in image super-resolution. △ Less

Submitted 6 June, 2025; v1 submitted 3 June, 2025; originally announced June 2025.

Comments: It has been accepted by CAAI transactions on intelligent systems, in Chinese language

arXiv:2506.01947 [pdf, ps, other]

RAW Image Reconstruction from RGB on Smartphones. NTIRE 2025 Challenge Report

Authors: Marcos V. Conde, Radu Timofte, Radu Berdan, Beril Besbinar, Daisuke Iso, Pengzhou Ji, Xiong Dun, Zeying Fan, Chen Wu, Zhansheng Wang, Pengbo Zhang, Jiazi Huang, Qinglin Liu, Wei Yu, Shengping Zhang, Xiangyang Ji, Kyungsik Kim, Minkyung Kim, Hwalmin Lee, Hekun Ma, Huan Zheng, Yanyan Wei, Zhao Zhang, Jing Fang, Meilin Gao , et al. (8 additional authors not shown)

Abstract: Numerous low-level vision tasks operate in the RAW domain due to its linear properties, bit depth, and sensor designs. Despite this, RAW image datasets are scarce and more expensive to collect than the already large and public sRGB datasets. For this reason, many approaches try to generate realistic RAW images using sensor information and sRGB images. This paper covers the second challenge on RAW… ▽ More Numerous low-level vision tasks operate in the RAW domain due to its linear properties, bit depth, and sensor designs. Despite this, RAW image datasets are scarce and more expensive to collect than the already large and public sRGB datasets. For this reason, many approaches try to generate realistic RAW images using sensor information and sRGB images. This paper covers the second challenge on RAW Reconstruction from sRGB (Reverse ISP). We aim to recover RAW sensor images from smartphones given the corresponding sRGB images without metadata and, by doing this, ``reverse" the ISP transformation. Over 150 participants joined this NTIRE 2025 challenge and submitted efficient models. The proposed methods and benchmark establish the state-of-the-art for generating realistic RAW data. △ Less

Submitted 2 June, 2025; originally announced June 2025.

Comments: CVPR 2025 - New Trends in Image Restoration and Enhancement (NTIRE)

arXiv:2505.23036 [pdf, ps, other]

AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition

Authors: Yuhang Dai, He Wang, Xingchen Li, Zihan Zhang, Shuiyuan Wang, Lei Xie, Xin Xu, Hongxiao Guo, Shaoji Zhang, Hui Bu, Wei Chen

Abstract: This paper delineates AISHELL-5, the first open-source in-car multi-channel multi-speaker Mandarin automatic speech recognition (ASR) dataset. AISHLL-5 includes two parts: (1) over 100 hours of multi-channel speech data recorded in an electric vehicle across more than 60 real driving scenarios. This audio data consists of four far-field speech signals captured by microphones located on each car do… ▽ More This paper delineates AISHELL-5, the first open-source in-car multi-channel multi-speaker Mandarin automatic speech recognition (ASR) dataset. AISHLL-5 includes two parts: (1) over 100 hours of multi-channel speech data recorded in an electric vehicle across more than 60 real driving scenarios. This audio data consists of four far-field speech signals captured by microphones located on each car door, as well as near-field signals obtained from high-fidelity headset microphones worn by each speaker. (2) a collection of 40 hours of real-world environmental noise recordings, which supports the in-car speech data simulation. Moreover, we also provide an open-access, reproducible baseline system based on this dataset. This system features a speech frontend model that employs speech source separation to extract each speaker's clean speech from the far-field signals, along with a speech recognition module that accurately transcribes the content of each individual speaker. Experimental results demonstrate the challenges faced by various mainstream ASR models when evaluated on the AISHELL-5. We firmly believe the AISHELL-5 dataset will significantly advance the research on ASR systems under complex driving scenarios by establishing the first publicly available in-car ASR benchmark. △ Less

Submitted 28 May, 2025; originally announced May 2025.

Comments: 5 pages, 1 figures, 3 tables, accepted by InterSpeech 2025

arXiv:2505.22251 [pdf, ps, other]

Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition

Authors: Yuan Tseng, Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya

Abstract: Recent work suggests that large language models (LLMs) can improve performance of speech tasks compared to existing systems. To support their claims, results on LibriSpeech and Common Voice are often quoted. However, this work finds that a substantial amount of the LibriSpeech and Common Voice evaluation sets appear in public LLM pretraining corpora. This calls into question the reliability of fin… ▽ More Recent work suggests that large language models (LLMs) can improve performance of speech tasks compared to existing systems. To support their claims, results on LibriSpeech and Common Voice are often quoted. However, this work finds that a substantial amount of the LibriSpeech and Common Voice evaluation sets appear in public LLM pretraining corpora. This calls into question the reliability of findings drawn from these two datasets. To measure contamination impact, LLMs trained with/without contamination are compared. A contaminated LLM is more likely to generate test sentences it has seen during training. Then, speech recognisers based on LLMs are compared. They show only subtle error rate differences if the LLM is contaminated, but assign significantly higher probabilities to transcriptions seen during LLM training. Results show that LLM outputs can be biased by tiny amounts of data contamination, highlighting the importance of evaluating LLM-based speech systems with held-out data. △ Less

Submitted 5 June, 2025; v1 submitted 28 May, 2025; originally announced May 2025.

arXiv:2505.21578 [pdf, ps, other]

Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use

Authors: Titouan Parcollet, Yuan Tseng, Shucong Zhang, Rogier van Dalen

Abstract: Automatic speech recognition (ASR) research is driven by the availability of common datasets between industrial researchers and academics, encouraging comparisons and evaluations. LibriSpeech, despite its long success as an ASR benchmark, is now limited by its size and focus on clean, read speech, leading to near-zero word error rates. More recent datasets, including MOSEL, YODAS, Gigaspeech, OWSM… ▽ More Automatic speech recognition (ASR) research is driven by the availability of common datasets between industrial researchers and academics, encouraging comparisons and evaluations. LibriSpeech, despite its long success as an ASR benchmark, is now limited by its size and focus on clean, read speech, leading to near-zero word error rates. More recent datasets, including MOSEL, YODAS, Gigaspeech, OWSM, Libriheavy or People's Speech suffer from major limitations including licenses that researchers in the industry cannot use, unreliable transcriptions, incorrect audio data, or the lack of evaluation sets. This work presents the Loquacious Set, a 25,000-hour curated collection of commercially usable English speech. Featuring hundreds of thousands of speakers with diverse accents and a wide range of speech types (read, spontaneous, talks, clean, noisy), the Loquacious Set is designed to work for academics and researchers in the industry to build ASR systems in real-world scenarios. △ Less

Submitted 27 May, 2025; originally announced May 2025.

Comments: Accepted at Interspeech 2025

arXiv:2505.20956 [pdf, ps, other]

Hybrid Disagreement-Diversity Active Learning for Bioacoustic Sound Event Detection

Authors: Shiqi Zhang, Tuomas Virtanen

Abstract: Bioacoustic sound event detection (BioSED) is crucial for biodiversity conservation but faces practical challenges during model development and training: limited amounts of annotated data, sparse events, species diversity, and class imbalance. To address these challenges efficiently with a limited labeling budget, we apply the mismatch-first farthest-traversal (MFFT), an active learning method int… ▽ More Bioacoustic sound event detection (BioSED) is crucial for biodiversity conservation but faces practical challenges during model development and training: limited amounts of annotated data, sparse events, species diversity, and class imbalance. To address these challenges efficiently with a limited labeling budget, we apply the mismatch-first farthest-traversal (MFFT), an active learning method integrating committee voting disagreement and diversity analysis. We also refine an existing BioSED dataset specifically for evaluating active learning algorithms. Experimental results demonstrate that MFFT achieves a mAP of 68% when cold-starting and 71% when warm-starting (which is close to the fully-supervised mAP of 75%) while using only 2.3% of the annotations. Notably, MFFT excels in cold-start scenarios and with rare species, which are critical for monitoring endangered species, demonstrating its practical value. △ Less

Submitted 28 May, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

Comments: 5 pages, 1 figure, accepted by EUSIPCO 2025 v2: add our github repo

arXiv:2505.17589 [pdf, ps, other]

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Authors: Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Chongjia Ni, Xian Shi, Keyu An, Guanrou Yang, Yabin Li, Yanni Chen, Zhifu Gao, Qian Chen, Yue Gu, Mengzhe Chen, Yafeng Chen, Shiliang Zhang, Wen Wang, Jieping Ye

Abstract: In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-… ▽ More In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness. Key features of CosyVoice 3 include: 1) A novel speech tokenizer to improve prosody naturalness, developed via supervised multi-task training, including automatic speech recognition, speech emotion recognition, language identification, audio event detection, and speaker analysis. 2) A new differentiable reward model for post-training applicable not only to CosyVoice 3 but also to other LLM-based speech synthesis models. 3) Dataset Size Scaling: Training data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects across various domains and text formats. 4) Model Size Scaling: Model parameters are increased from 0.5 billion to 1.5 billion, resulting in enhanced performance on our multilingual benchmark due to the larger model capacity. These advancements contribute significantly to the progress of speech synthesis in the wild. We encourage readers to listen to the demo at https://funaudiollm.github.io/cosyvoice3. △ Less

Submitted 27 May, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

Comments: Preprint, work in progress

arXiv:2505.16236 [pdf, ps, other]

Base Station Placement Optimization for Networked Sensing Exploiting Target Location Distribution

Authors: Kaiyue Hou, Shuowen Zhang

Abstract: This paper studies a networked sensing system with multiple base stations (BSs), which collaboratively sense the unknown and random three-dimensional (3D) location of a target based on the target-reflected echo signals received at the BSs. Considering a practical scenario where the target location distribution is known a priori for exploitation, we aim to design the placement of the multiple BSs t… ▽ More This paper studies a networked sensing system with multiple base stations (BSs), which collaboratively sense the unknown and random three-dimensional (3D) location of a target based on the target-reflected echo signals received at the BSs. Considering a practical scenario where the target location distribution is known a priori for exploitation, we aim to design the placement of the multiple BSs to optimize the networked sensing performance. Firstly, we characterize the posterior Cramér-Rao bound (PCRB) of the mean-squared error (MSE) in sensing the target's 3D location. Despite its complex form under networked sensing, we derive its closed-form expression in terms of the BS locations. Next, we formulate the BS placement optimization problem to minimize the sensing PCRB, which is non-convex and difficult to solve. By leveraging a series of equivalent transformations and the iterative inner approximation method, we devise an algorithm with polynomial-time complexity which is guaranteed to converge to a solution satisfying the Karush-Kuhn Tucker (KKT) conditions of the problem. Numerical results show that the proposed placement design significantly outperforms various benchmark designs. △ Less

Submitted 22 May, 2025; originally announced May 2025.

Comments: Longer version of a paper submitted for possible publication

arXiv:2505.16230 [pdf, ps, other]

Beyond Diagonal Intelligent Reflecting Surface Aided Integrated Sensing and Communication

Authors: Shuo Zheng, Shuowen Zhang

Abstract: Beyond diagonal intelligent reflecting surface (BD-IRS) is a new promising IRS architecture for which the reflection matrix is not limited to the diagonal structure as for conventional IRS. In this paper, we study a BD-IRS aided uplink integrated sensing and communication (ISAC) system where sensing is performed in a device-based manner. Specifically, we aim to estimate the unknown and random loca… ▽ More Beyond diagonal intelligent reflecting surface (BD-IRS) is a new promising IRS architecture for which the reflection matrix is not limited to the diagonal structure as for conventional IRS. In this paper, we study a BD-IRS aided uplink integrated sensing and communication (ISAC) system where sensing is performed in a device-based manner. Specifically, we aim to estimate the unknown and random location of an active target based on its uplink probing signals sent to a multi-antenna base station (BS) as well as the known prior distribution information of the target's location. Multiple communication users also simultaneously send uplink signals, resulting in a challenging mutual interference issue between sensing and communication. We first characterize the sensing performance metric by deriving the posterior Cramér-Rao bound (PCRB) of the mean-squared error (MSE) when prior information is available. Then, we formulate a BD-IRS reflection matrix optimization problem to maximize the minimum expected achievable rate among the multiple users subject to a constraint on the PCRB as well as the lossless and reciprocal constraints on the BD-IRS reflection matrix. The formulated problem is non-convex and challenging to solve. To tackle this problem, we propose a penalty dual decomposition (PDD) based algorithm which can find a high-quality suboptimal solution with polynomial-time complexity. In addition, we propose and optimize a time-division multiple access (TDMA) based scheme which removes the sensing-communication mutual interference. Numerical results verify the effectiveness of the proposed designs and provide useful design insights such as the optimal choice of multiple access scheme. △ Less

Submitted 24 June, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

Comments: Accepted to appear in IEEE Transactions on Cognitive Communications and Networking, special issue on smart environment engineering for integrated sensing and communication

arXiv:2505.16211 [pdf, ps, other]

AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

Authors: Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng, Xuechao Zou, Zhe Wang, Xingjian Du, Shun Zhang, Hanjun Luo, Yingbin Jin, Xinxin Xing, Ziyang Ma, Yue Liu, Xiaojun Jia, Yifan Zhang, Junfeng Fang, Kun Wang, Yibo Yan, Haoyang Li, Yiming Li, Xiaobin Zhuang, Yang Liu, Haibo Hu, Zhizheng Wu , et al. (6 additional authors not shown)

Abstract: The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safet… ▽ More The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safety dimensions, failing to adequately account for the unique characteristics and application scenarios inherent to the audio modality. We introduce AudioTrust-the first multifaceted trustworthiness evaluation framework and benchmark specifically designed for ALLMs. AudioTrust facilitates assessments across six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. To comprehensively evaluate these dimensions, AudioTrust is structured around 18 distinct experimental setups. Its core is a meticulously constructed dataset of over 4,420 audio/text samples, drawn from real-world scenarios (e.g., daily conversations, emergency calls, voice assistant interactions), specifically designed to probe the multifaceted trustworthiness of ALLMs. For assessment, the benchmark carefully designs 9 audio-specific evaluation metrics, and we employ a large-scale automated pipeline for objective and scalable scoring of model outputs. Experimental results reveal the trustworthiness boundaries and limitations of current state-of-the-art open-source and closed-source ALLMs when confronted with various high-risk audio scenarios, offering valuable insights for the secure and trustworthy deployment of future audio models. Our platform and benchmark are available at https://github.com/JusperLee/AudioTrust. △ Less

Submitted 1 July, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

Comments: Technical Report

arXiv:2505.15279 [pdf, ps, other]

Robust Secure Communications in Near-Field ISCAP Systems with Extremely Large-Scale Antenna Array

Authors: Zixiang Ren, Siyao Zhang, Ling Qiu, Derrick Wing Kwan Ng, Jie Xu

Abstract: This paper investigates robust secure communications in a near-field integrated sensing, communication, and powering (ISCAP) system, in which the base station (BS) is equipped with an extremely large-scale antenna array (ELAA). In this system, the BS transmits confidential messages to a single legitimate communication user (CU), simultaneously providing wireless power transfer to multiple energy r… ▽ More This paper investigates robust secure communications in a near-field integrated sensing, communication, and powering (ISCAP) system, in which the base station (BS) is equipped with an extremely large-scale antenna array (ELAA). In this system, the BS transmits confidential messages to a single legitimate communication user (CU), simultaneously providing wireless power transfer to multiple energy receivers (ERs) and performing point target sensing. We consider a scenario in which both the ERs and the sensing target may act as potential eavesdroppers attempting to intercept the confidential messages. To safeguard secure communication, the BS employs a joint beamforming design by transmitting information beams combined with dedicated triple-purpose beams serving as energy and sensing signals, as well as artificial noise (AN) for effectively jamming potential eavesdroppers. It is assumed that only coarse location information of the ERs and sensing targets or eavesdroppers is available at the BS, leading to imperfect channel state information (CSI). Under this setup, we formulate a robust beamforming optimization problem with the objective of maximizing the secrecy rate for the CU, while ensuring worst-case performance requirements on both target sensing and wireless energy harvesting at the ERs. To address the non-convex robust joint beamforming problem and facilitate the deployment of a low-complexity algorithm, we employ the S-procedure alongside an eavesdropping CSI error-bound determination method to acquire a high-quality solution. △ Less

Submitted 21 May, 2025; originally announced May 2025.

Comments: 13 pages

arXiv:2505.13762

From Structural Design to Dynamics Modeling: Control-Oriented Development of a 3-RRR Parallel Ankle Rehabilitation Robot

Authors: Siyuan Zhang, Yufei Zhang, Junlin Lyu, Sunil K. Agrawal

Abstract: This paper presents the development of a wearable ankle rehabilitation robot based on a 3-RRR spherical parallel mechanism (SPM) to support multi-DOF recovery through pitch, roll, and yaw motions. The system features a compact, ergonomic structure designed for comfort, safety, and compatibility with ankle biomechanics. A complete design-to-dynamics pipeline has been implemented, including structur… ▽ More This paper presents the development of a wearable ankle rehabilitation robot based on a 3-RRR spherical parallel mechanism (SPM) to support multi-DOF recovery through pitch, roll, and yaw motions. The system features a compact, ergonomic structure designed for comfort, safety, and compatibility with ankle biomechanics. A complete design-to-dynamics pipeline has been implemented, including structural design, kinematic modeling for motion planning, and Lagrangian-based dynamic modeling for torque estimation and simulation analysis. Preliminary simulations verify stable joint coordination and smooth motion tracking under representative rehabilitation trajectories. The control framework is currently being developed to enhance responsiveness across the workspace. Future work will focus on integrating personalized modeling and adaptive strategies to address kinematic singularities through model based control. This work establishes a foundational platform for intelligent, personalized ankle rehabilitation, enabling both static training and potential extension to gait-phase-timed assistance. △ Less

Submitted 30 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

Comments: This paper was originally submitted as a class project and included the name of a faculty member without prior permission. At the instructor's request, I am withdrawing the paper. The work may be resubmitted in the future after further development and testing

arXiv:2505.12887

RetinaLogos: Fine-Grained Synthesis of High-Resolution Retinal Images Through Captions

Authors: Junzhi Ning, Cheng Tang, Kaijin Zhou, Diping Song, Lihao Liu, Ming Hu, Wei Li, Yanzhou Su, Tianbing Li, Jiyao Liu, Yejin, Sheng Zhang, Yuanfeng Ji, Junjun He

Abstract: The scarcity of high-quality, labelled retinal imaging data, which presents a significant challenge in the development of machine learning models for ophthalmology, hinders progress in the field. To synthesise Colour Fundus Photographs (CFPs), existing methods primarily relying on predefined disease labels face significant limitations. However, current methods remain limited, thus failing to gener… ▽ More The scarcity of high-quality, labelled retinal imaging data, which presents a significant challenge in the development of machine learning models for ophthalmology, hinders progress in the field. To synthesise Colour Fundus Photographs (CFPs), existing methods primarily relying on predefined disease labels face significant limitations. However, current methods remain limited, thus failing to generate images for broader categories with diverse and fine-grained anatomical structures. To overcome these challenges, we first introduce an innovative pipeline that creates a large-scale, synthetic Caption-CFP dataset comprising 1.4 million entries, called RetinaLogos-1400k. Specifically, RetinaLogos-1400k uses large language models (LLMs) to describe retinal conditions and key structures, such as optic disc configuration, vascular distribution, nerve fibre layers, and pathological features. Furthermore, based on this dataset, we employ a novel three-step training framework, called RetinaLogos, which enables fine-grained semantic control over retinal images and accurately captures different stages of disease progression, subtle anatomical variations, and specific lesion types. Extensive experiments demonstrate state-of-the-art performance across multiple datasets, with 62.07% of text-driven synthetic images indistinguishable from real ones by ophthalmologists. Moreover, the synthetic data improves accuracy by 10%-25% in diabetic retinopathy grading and glaucoma detection, thereby providing a scalable solution to augment ophthalmic datasets. △ Less

Submitted 24 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

Comments: The paper is being withdrawn due to issues with the authorship list. Specifically, one or more contributors were unintentionally omitted in the initial submission

arXiv:2505.11200 [pdf, ps, other]

Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese

Authors: Xihuai Wang, Ziyi Zhao, Siyu Ren, Shao Zhang, Song Li, Xiaoyu Li, Ziwen Wang, Lin Qiu, Guanglu Wan, Xuezhi Cao, Xunliang Cai, Weinan Zhang

Abstract: Recent advances in large language models (LLMs) have significantly improved text-to-speech (TTS) systems, enhancing control over speech style, naturalness, and emotional expression, which brings TTS Systems closer to human-level performance. Although the Mean Opinion Score (MOS) remains the standard for TTS System evaluation, it suffers from subjectivity, environmental inconsistencies, and limited… ▽ More Recent advances in large language models (LLMs) have significantly improved text-to-speech (TTS) systems, enhancing control over speech style, naturalness, and emotional expression, which brings TTS Systems closer to human-level performance. Although the Mean Opinion Score (MOS) remains the standard for TTS System evaluation, it suffers from subjectivity, environmental inconsistencies, and limited interpretability. Existing evaluation datasets also lack a multi-dimensional design, often neglecting factors such as speaking styles, context diversity, and trap utterances, which is particularly evident in Chinese TTS evaluation. To address these challenges, we introduce the Audio Turing Test (ATT), a multi-dimensional Chinese corpus dataset ATT-Corpus paired with a simple, Turing-Test-inspired evaluation protocol. Instead of relying on complex MOS scales or direct model comparisons, ATT asks evaluators to judge whether a voice sounds human. This simplification reduces rating bias and improves evaluation robustness. To further support rapid model development, we also finetune Qwen2-Audio-Instruct with human judgment data as Auto-ATT for automatic evaluation. Experimental results show that ATT effectively differentiates models across specific capability dimensions using its multi-dimensional design. Auto-ATT also demonstrates strong alignment with human evaluations, confirming its value as a fast and reliable assessment tool. The white-box ATT-Corpus and Auto-ATT can be found in ATT Hugging Face Collection (https://huggingface.co/collections/meituan/audio-turing-test-682446320368164faeaf38a4). △ Less

Submitted 16 May, 2025; originally announced May 2025.

Comments: Under Review

arXiv:2505.10834 [pdf, ps, other]

TACO: Rethinking Semantic Communications with Task Adaptation and Context Embedding

Authors: Achintha Wijesinghe, Weiwei Wang, Suchinthaka Wanninayaka, Songyang Zhang, Zhi Ding

Abstract: Recent advancements in generative artificial intelligence have introduced groundbreaking approaches to innovating next-generation semantic communication, which prioritizes conveying the meaning of a message rather than merely transmitting raw data. A fundamental challenge in semantic communication lies in accurately identifying and extracting the most critical semantic information while adapting t… ▽ More Recent advancements in generative artificial intelligence have introduced groundbreaking approaches to innovating next-generation semantic communication, which prioritizes conveying the meaning of a message rather than merely transmitting raw data. A fundamental challenge in semantic communication lies in accurately identifying and extracting the most critical semantic information while adapting to downstream tasks without degrading performance, particularly when the objective at the receiver may evolve over time. To enable flexible adaptation to multiple tasks at the receiver, this work introduces a novel semantic communication framework, which is capable of jointly capturing task-specific information to enhance downstream task performance and contextual information. Through rigorous experiments on popular image datasets and computer vision tasks, our framework shows promising improvement compared to existing work, including superior performance in downstream tasks, better generalizability, ultra-high bandwidth efficiency, and low reconstruction latency. △ Less

Submitted 16 May, 2025; originally announced May 2025.

Comments: Submitted to the IEEE GlobeCom 2025

arXiv:2505.10134 [pdf, other]

Large Wireless Localization Model (LWLM): A Foundation Model for Positioning in 6G Networks

Authors: Guangjin Pan, Kaixuan Huang, Hui Chen, Shunqing Zhang, Christian Häger, Henk Wymeersch

Abstract: Accurate and robust localization is a critical enabler for emerging 5G and 6G applications, including autonomous driving, extended reality (XR), and smart manufacturing. While data-driven approaches have shown promise, most existing models require large amounts of labeled data and struggle to generalize across deployment scenarios and wireless configurations. To address these limitations, we propo… ▽ More Accurate and robust localization is a critical enabler for emerging 5G and 6G applications, including autonomous driving, extended reality (XR), and smart manufacturing. While data-driven approaches have shown promise, most existing models require large amounts of labeled data and struggle to generalize across deployment scenarios and wireless configurations. To address these limitations, we propose a foundation-model-based solution tailored for wireless localization. We first analyze how different self-supervised learning (SSL) tasks acquire general-purpose and task-specific semantic features based on information bottleneck (IB) theory. Building on this foundation, we design a pretraining methodology for the proposed Large Wireless Localization Model (LWLM). Specifically, we propose an SSL framework that jointly optimizes three complementary objectives: (i) spatial-frequency masked channel modeling (SF-MCM), (ii) domain-transformation invariance (DTI), and (iii) position-invariant contrastive learning (PICL). These objectives jointly capture the underlying semantics of wireless channel from multiple perspectives. We further design lightweight decoders for key downstream tasks, including time-of-arrival (ToA) estimation, angle-of-arrival (AoA) estimation, single base station (BS) localization, and multiple BS localization. Comprehensive experimental results confirm that LWLM consistently surpasses both model-based and supervised learning baselines across all localization tasks. In particular, LWLM achieves 26.0%--87.5% improvement over transformer models without pretraining, and exhibits strong generalization under label-limited fine-tuning and unseen BS configurations, confirming its potential as a foundation model for wireless localization. △ Less

Submitted 15 May, 2025; originally announced May 2025.

Comments: 13 pages,16 figures.This work has been submitted to the IEEE for possible publication

arXiv:2505.04068 [pdf, other]

Shadow Wireless Intelligence: Large Language Model-Driven Reasoning in Covert Communications

Authors: Yuanai Xie, Zhaozhi Liu, Xiao Zhang, Shihua Zhang, Rui Hou, Minrui Xu, Ruichen Zhang, Dusit Niyato

Abstract: Covert Communications (CC) can secure sensitive transmissions in industrial, military, and mission-critical applications within 6G wireless networks. However, traditional optimization methods based on Artificial Noise (AN), power control, and channel manipulation might not adapt to dynamic and adversarial environments due to the high dimensionality, nonlinearity, and stringent real-time covertness… ▽ More Covert Communications (CC) can secure sensitive transmissions in industrial, military, and mission-critical applications within 6G wireless networks. However, traditional optimization methods based on Artificial Noise (AN), power control, and channel manipulation might not adapt to dynamic and adversarial environments due to the high dimensionality, nonlinearity, and stringent real-time covertness requirements. To bridge this gap, we introduce Shadow Wireless Intelligence (SWI), which integrates the reasoning capabilities of Large Language Models (LLMs) with retrieval-augmented generation to enable intelligent decision-making in covert wireless systems. Specifically, we utilize DeepSeek-R1, a mixture-of-experts-based LLM with RL-enhanced reasoning, combined with real-time retrieval of domain-specific knowledge to improve context accuracy and mitigate hallucinations. Our approach develops a structured CC knowledge base, supports context-aware retrieval, and performs semantic optimization, allowing LLMs to generate and adapt CC strategies in real time. In a case study on optimizing AN power in a full-duplex CC scenario, DeepSeek-R1 achieves 85% symbolic derivation accuracy and 94% correctness in the generation of simulation code, outperforming baseline models. These results validate SWI as a robust, interpretable, and adaptive foundation for LLM-driven intelligent covert wireless systems in 6G networks. △ Less

Submitted 6 May, 2025; originally announced May 2025.

arXiv:2505.02625 [pdf, ps, other]

LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis

Authors: Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, Yang Feng

Abstract: Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achievin… ▽ More Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achieving high-quality real-time speech interaction. LLaMA-Omni 2 is built upon the Qwen2.5 series models, integrating a speech encoder and an autoregressive streaming speech decoder. Despite being trained on only 200K multi-turn speech dialogue samples, LLaMA-Omni 2 demonstrates strong performance on several spoken question answering and speech instruction following benchmarks, surpassing previous state-of-the-art SpeechLMs like GLM-4-Voice, which was trained on millions of hours of speech data. △ Less

Submitted 5 May, 2025; originally announced May 2025.

Comments: Preprint. Project: https://github.com/ictnlp/LLaMA-Omni2

arXiv:2505.00578 [pdf, other]

AI-Driven Segmentation and Analysis of Microbial Cells

Authors: Shuang Zhang, Carleton Coffin, Karyn L. Rogers, Catherine Ann Royer, Ge Wang

Abstract: Studying the growth and metabolism of microbes provides critical insights into their evolutionary adaptations to harsh environments, which are essential for microbial research and biotechnology applications. In this study, we developed an AI-driven image analysis system to efficiently segment individual cells and quantitatively analyze key cellular features. This system is comprised of four main m… ▽ More Studying the growth and metabolism of microbes provides critical insights into their evolutionary adaptations to harsh environments, which are essential for microbial research and biotechnology applications. In this study, we developed an AI-driven image analysis system to efficiently segment individual cells and quantitatively analyze key cellular features. This system is comprised of four main modules. First, a denoising algorithm enhances contrast and suppresses noise while preserving fine cellular details. Second, the Segment Anything Model (SAM) enables accurate, zero-shot segmentation of cells without additional training. Third, post-processing is applied to refine segmentation results by removing over-segmented masks. Finally, quantitative analysis algorithms extract essential cellular features, including average intensity, length, width, and volume. The results show that denoising and post-processing significantly improved the segmentation accuracy of SAM in this new domain. Without human annotations, the AI-driven pipeline automatically and efficiently outlines cellular boundaries, indexes them, and calculates key cellular parameters with high accuracy. This framework will enable efficient and automated quantitative analysis of high-resolution fluorescence microscopy images to advance research into microbial adaptations to grow and metabolism that allow extremophiles to thrive in their harsh habitats. △ Less

Submitted 5 May, 2025; v1 submitted 1 May, 2025; originally announced May 2025.

arXiv:2504.16037 [pdf, other]

Adaptive Fault-tolerant Control of Underwater Vehicles with Thruster Failures

Authors: Haolin Liu, Shiliang Zhang, Shangbin Jiao, Xiaohui Zhang, Xuehui Ma, Yan Yan, Wenchuan Cui, Youmin Zhang

Abstract: This paper presents a fault-tolerant control for the trajectory tracking of autonomous underwater vehicles (AUVs) against thruster failures. We formulate faults in AUV thrusters as discrete switching events during a UAV mission, and develop a soft-switching approach in facilitating shift of control strategies across fault scenarios. We mathematically define AUV thruster fault scenarios, and develo… ▽ More This paper presents a fault-tolerant control for the trajectory tracking of autonomous underwater vehicles (AUVs) against thruster failures. We formulate faults in AUV thrusters as discrete switching events during a UAV mission, and develop a soft-switching approach in facilitating shift of control strategies across fault scenarios. We mathematically define AUV thruster fault scenarios, and develop the fault-tolerant control that captures the fault scenario via Bayesian approach. Particularly, when the AUV fault type switches from one to another, the developed control captures the fault states and maintains the control by a linear quadratic tracking controller. With the captured fault states by Bayesian approach, we derive the control law by aggregating the control outputs for individual fault scenarios weighted by their Bayesian posterior probability. The developed fault-tolerant control works in an adaptive way and guarantees soft-switching across fault scenarios, and requires no complicated fault detection dedicated to different type of faults. The entailed soft-switching ensures stable AUV trajectory tracking when fault type shifts, which otherwise leads to reduced control under hard-switching control strategies. We conduct numerical simulations with diverse AUV thruster fault settings. The results demonstrate that the proposed control can provide smooth transition across thruster failures, and effectively sustain AUV trajectory tracking control in case of thruster failures and failure shifts. △ Less

Submitted 22 April, 2025; originally announced April 2025.

arXiv:2504.15520 [pdf, other]

Element-Grouping Strategy for Intelligent Reflecting Surface: Performance Analysis and Algorithm Optimization

Authors: Shengsheng Zhang, Taotao Ji, Meng Hua, Yongming Huang, Luxi Yang

Abstract: As a revolutionary paradigm for intelligently controlling wireless channels, intelligent reflecting surface (IRS) has emerged as a promising technology for future sixth-generation (6G) wireless communications. While IRS-aided communication systems can achieve attractive high channel gains, existing schemes require plenty of IRS elements to mitigate the ``multiplicative fading'' effect in cascaded… ▽ More As a revolutionary paradigm for intelligently controlling wireless channels, intelligent reflecting surface (IRS) has emerged as a promising technology for future sixth-generation (6G) wireless communications. While IRS-aided communication systems can achieve attractive high channel gains, existing schemes require plenty of IRS elements to mitigate the ``multiplicative fading'' effect in cascaded channels, leading to high complexity for real-time beamforming and high signaling overhead for channel estimation. In this paper, the concept of sustainable intelligent element-grouping IRS (IEG-IRS) is proposed to overcome those fundamental bottlenecks. Specifically, based on the statistical channel state information (S-CSI), the proposed grouping strategy intelligently pre-divide the IEG-IRS elements into multiple groups based on the beam-domain grouping method, with each group sharing the common reflection coefficient and being optimized in real time using the instantaneous channel state information (I-CSI). Then, we further analyze the asymptotic performance of the IEG-IRS to reveal the substantial capacity gain in an extremely large-scale IRS (XL-IRS) aided single-user single-input single-output (SU-SISO) system. In particular, when a line-of-sight (LoS) component exists, it demonstrates that the combined cascaded link can be considered as a ``deterministic virtual LoS'' channel, resulting in a sustainable squared array gain achieved by the IEG-IRS. Finally, we formulate a weighted-sum-rate (WSR) maximization problem for an IEG-IRS-aided multiuser multiple-input single-output (MU-MISO) system and a two-stage algorithm for optimizing the beam-domain grouping strategy and the multi-user active-passive beamforming is proposed. △ Less

Submitted 21 April, 2025; originally announced April 2025.

arXiv:2504.14906 [pdf, ps, other]

OmniAudio: Generating Spatial Audio from 360-Degree Video

Authors: Huadai Liu, Tianyi Luo, Kaicheng Luo, Qikai Jiang, Peiwen Sun, Jialei Wang, Rongjie Huang, Qian Chen, Wen Wang, Xiangtai Li, Shiliang Zhang, Zhijie Yan, Zhou Zhao, Wei Xue

Abstract: Traditional video-to-audio generation techniques primarily focus on perspective video and non-spatial audio, often missing the spatial cues necessary for accurately representing sound sources in 3D environments. To address this limitation, we introduce a novel task, 360V2SA, to generate spatial audio from 360-degree videos, specifically producing First-order Ambisonics (FOA) audio - a standard for… ▽ More Traditional video-to-audio generation techniques primarily focus on perspective video and non-spatial audio, often missing the spatial cues necessary for accurately representing sound sources in 3D environments. To address this limitation, we introduce a novel task, 360V2SA, to generate spatial audio from 360-degree videos, specifically producing First-order Ambisonics (FOA) audio - a standard format for representing 3D spatial audio that captures sound directionality and enables realistic 3D audio reproduction. We first create Sphere360, a novel dataset tailored for this task that is curated from real-world data. We also design an efficient semi-automated pipeline for collecting and cleaning paired video-audio data. To generate spatial audio from 360-degree video, we propose a novel framework OmniAudio, which leverages self-supervised pre-training using both spatial audio data (in FOA format) and large-scale non-spatial data. Furthermore, OmniAudio features a dual-branch framework that utilizes both panoramic and perspective video inputs to capture comprehensive local and global information from 360-degree videos. Experimental results demonstrate that OmniAudio achieves state-of-the-art performance across both objective and subjective metrics on Sphere360. Code and datasets are available at https://github.com/liuhuadai/OmniAudio. The project website is available at https://OmniAudio-360V2SA.github.io. △ Less

Submitted 2 June, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

Comments: ICML 2025

arXiv:2504.14894 [pdf, other]

Never too Cocky to Cooperate: An FIM and RL-based USV-AUV Collaborative System for Underwater Tasks in Extreme Sea Conditions

Authors: Jingzehua Xu, Guanwen Xie, Jiwei Tang, Yimian Ding, Weiyi Liu, Shuai Zhang, Yi Li

Abstract: This paper develops a novel unmanned surface vehicle (USV)-autonomous underwater vehicle (AUV) collaborative system designed to enhance underwater task performance in extreme sea conditions. The system integrates a dual strategy: (1) high-precision multi-AUV localization enabled by Fisher information matrix-optimized USV path planning, and (2) reinforcement learning-based cooperative planning and… ▽ More This paper develops a novel unmanned surface vehicle (USV)-autonomous underwater vehicle (AUV) collaborative system designed to enhance underwater task performance in extreme sea conditions. The system integrates a dual strategy: (1) high-precision multi-AUV localization enabled by Fisher information matrix-optimized USV path planning, and (2) reinforcement learning-based cooperative planning and control method for multi-AUV task execution. Extensive experimental evaluations in the underwater data collection task demonstrate the system's operational feasibility, with quantitative results showing significant performance improvements over baseline methods. The proposed system exhibits robust coordination capabilities between USV and AUVs while maintaining stability in extreme sea conditions. To facilitate reproducibility and community advancement, we provide an open-source simulation toolkit available at: https://github.com/360ZMEM/USV-AUV-colab . △ Less

Submitted 21 April, 2025; originally announced April 2025.

arXiv:2504.12867 [pdf, other]

EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting

Authors: Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, Xie Chen

Abstract: Human speech goes beyond the mere transfer of information; it is a profound exchange of emotions and a connection between individuals. While Text-to-Speech (TTS) models have made huge progress, they still face challenges in controlling the emotional expression in the generated speech. In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLM… ▽ More Human speech goes beyond the mere transfer of information; it is a profound exchange of emotions and a connection between individuals. While Text-to-Speech (TTS) models have made huge progress, they still face challenges in controlling the emotional expression in the generated speech. In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control, and a phoneme boost variant design that makes the model output phoneme tokens and audio tokens in parallel to enhance content consistency, inspired by chain-of-thought (CoT) and chain-of-modality (CoM) techniques. Besides, we introduce EmoVoice-DB, a high-quality 40-hour English emotion dataset featuring expressive speech and fine-grained emotion labels with natural language descriptions. EmoVoice achieves state-of-the-art performance on the English EmoVoice-DB test set using only synthetic training data, and on the Chinese Secap test set using our in-house data. We further investigate the reliability of existing emotion evaluation metrics and their alignment with human perceptual preferences, and explore using SOTA multimodal LLMs GPT-4o-audio and Gemini to assess emotional speech. Demo samples are available at https://yanghaha0908.github.io/EmoVoice/. Dataset, code, and checkpoints will be released. △ Less

Submitted 21 April, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

arXiv:2504.12711 [pdf, other]

NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results

Authors: Xin Li, Yeying Jin, Xin Jin, Zongwei Wu, Bingchen Li, Yufei Wang, Wenhan Yang, Yu Li, Zhibo Chen, Bihan Wen, Robby T. Tan, Radu Timofte, Qiyu Rong, Hongyuan Jing, Mengmeng Zhang, Jinglong Li, Xiangyu Lu, Yi Ren, Yuting Liu, Meng Zhang, Xiang Chen, Qiyuan Guan, Jiangxin Dong, Jinshan Pan, Conglin Gou , et al. (112 additional authors not shown)

Abstract: This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includ… ▽ More This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includes day raindrop-focused, day background-focused, night raindrop-focused, and night background-focused degradations. This dataset is divided into three subsets for competition: 14,139 images for training, 240 images for validation, and 731 images for testing. The primary objective of this challenge is to establish a new and powerful benchmark for the task of removing raindrops under varying lighting and focus conditions. There are a total of 361 participants in the competition, and 32 teams submitting valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the Raindrop Clarity dataset. The project can be found at https://lixinustc.github.io/CVPR-NTIRE2025-RainDrop-Competition.github.io/. △ Less

Submitted 19 April, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

Comments: Challenge Report of CVPR NTIRE 2025; 26 pages; Methods from 32 teams

arXiv:2504.11162 [pdf, ps, other]

Scalable Transceiver Design for Multi-User Communication in FDD Massive MIMO Systems via Deep Learning

Authors: Lin Zhu, Weifeng Zhu, Shuowen Zhang, Shuguang Cui, Liang Liu

Abstract: This paper addresses the joint transceiver design, including pilot transmission, channel feature extraction and feedback, as well as precoding, for low-overhead downlink massive multiple-input multiple-output (MIMO) communication in frequency-division duplex (FDD) systems. Although deep learning (DL) has shown great potential in tackling this problem, existing methods often suffer from poor scalab… ▽ More This paper addresses the joint transceiver design, including pilot transmission, channel feature extraction and feedback, as well as precoding, for low-overhead downlink massive multiple-input multiple-output (MIMO) communication in frequency-division duplex (FDD) systems. Although deep learning (DL) has shown great potential in tackling this problem, existing methods often suffer from poor scalability in practical systems, as the solution obtained in the training phase merely works for a fixed feedback capacity and a fixed number of users in the deployment phase. To address this limitation, we propose a novel DL-based framework comprised of choreographed neural networks, which can utilize one training phase to generate all the transceiver solutions used in the deployment phase with varying sizes of feedback codebooks and numbers of users. The proposed framework includes a residual vector-quantized variational autoencoder (RVQ-VAE) for efficient channel feedback and an edge graph attention network (EGAT) for robust multiuser precoding. It can adapt to different feedback capacities by flexibly adjusting the RVQ codebook sizes using the hierarchical codebook structure, and scale with the number of users through a feedback module sharing scheme and the inherent scalability of EGAT. Moreover, a progressive training strategy is proposed to further enhance data transmission performance and generalization capability. Numerical results on a real-world dataset demonstrate the superior scalability and performance of our approach over existing methods. △ Less

Submitted 15 April, 2025; originally announced April 2025.

arXiv:2504.10911 [pdf, ps, other]

Low-Overhead Channel Estimation Framework for Beyond Diagonal Reconfigurable Intelligent Surface Assisted Multi-User MIMO Communication

Authors: Rui Wang, Shuowen Zhang, Bruno Clerckx, Liang Liu

Abstract: Beyond diagonal reconfigurable intelligent surface (BD-RIS) refers to a family of RIS architectures characterized by scattering matrices not limited to being diagonal and enables higher wave manipulation flexibility and large performance gains over conventional (diagonal) RIS. To achieve those promising gains, accurate channel state information (CSI) needs to be acquired in BD-RIS assisted communi… ▽ More Beyond diagonal reconfigurable intelligent surface (BD-RIS) refers to a family of RIS architectures characterized by scattering matrices not limited to being diagonal and enables higher wave manipulation flexibility and large performance gains over conventional (diagonal) RIS. To achieve those promising gains, accurate channel state information (CSI) needs to be acquired in BD-RIS assisted communication systems. However, the number of coefficients in the cascaded channels to be estimated in BD-RIS assisted systems is significantly larger than that in conventional RIS assisted systems, because the channels associated with the off-diagonal elements of the scattering matrix have to be estimated as well. Surprisingly, for the first time in the literature, this paper rigorously shows that the uplink channel estimation overhead in BD-RIS assisted systems is actually of the same order as that in the conventional RIS assisted systems. This amazing result stems from a key observation: for each user antenna, its cascaded channel matrix associated with one reference BD-RIS element is a scaled version of that associated with any other BD-RIS element due to the common RIS-base station (BS) channel. In other words, the number of independent unknown variables is far less than it would seem at first glance. Building upon this property, this paper manages to characterize the minimum overhead to perfectly estimate all the channels in the ideal case without noise at the BS, and propose a twophase estimation framework for the practical case with noise at the BS. Numerical results demonstrate outstanding channel estimation overhead reduction over existing schemes in BD-RIS assisted systems. △ Less

Submitted 15 April, 2025; originally announced April 2025.

arXiv:2504.10060 [pdf, other]

Learning to Beamform for Cooperative Localization and Communication: A Link Heterogeneous GNN-Based Approach

Authors: Lixiang Lian, Chuanqi Bai, Yihan Xu, Huanyu Dong, Rui Cheng, Shunqing Zhang

Abstract: Integrated sensing and communication (ISAC) has emerged as a key enabler for next-generation wireless networks, supporting advanced applications such as high-precision localization and environment reconstruction. Cooperative ISAC (CoISAC) further enhances these capabilities by enabling multiple base stations (BSs) to jointly optimize communication and sensing performance through coordination. Howe… ▽ More Integrated sensing and communication (ISAC) has emerged as a key enabler for next-generation wireless networks, supporting advanced applications such as high-precision localization and environment reconstruction. Cooperative ISAC (CoISAC) further enhances these capabilities by enabling multiple base stations (BSs) to jointly optimize communication and sensing performance through coordination. However, CoISAC beamforming design faces significant challenges due to system heterogeneity, large-scale problem complexity, and sensitivity to parameter estimation errors. Traditional deep learning-based techniques fail to exploit the unique structural characteristics of CoISAC systems, thereby limiting their ability to enhance system performance. To address these challenges, we propose a Link-Heterogeneous Graph Neural Network (LHGNN) for joint beamforming in CoISAC systems. Unlike conventional approaches, LHGNN models communication and sensing links as heterogeneous nodes and their interactions as edges, enabling the capture of the heterogeneous nature and intricate interactions of CoISAC systems. Furthermore, a graph attention mechanism is incorporated to dynamically adjust node and link importance, improving robustness to channel and position estimation errors. Numerical results demonstrate that the proposed attention-enhanced LHGNN achieves superior communication rates while maintaining sensing accuracy under power constraints. The proposed method also exhibits strong robustness to communication channel and position estimation error. △ Less

Submitted 14 April, 2025; originally announced April 2025.

Showing 1–50 of 844 results for author: Zhang, S