Search | arXiv e-print repository

Experience-Centric Resource Management in ISAC Networks: A Digital Agent-Assisted Approach

Authors: Xinyu Huang, Yixiao Zhang, Yingying Pei, Jianzhe Xue, Xuemin Shen

Abstract: In this paper, we propose a digital agent (DA)-assisted resource management scheme for enhanced user quality of experience (QoE) in integrated sensing and communication (ISAC) networks. Particularly, user QoE is a comprehensive metric that integrates quality of service (QoS), user behavioral dynamics, and environmental complexity. The novel DA module includes a user status prediction model, a QoS… ▽ More In this paper, we propose a digital agent (DA)-assisted resource management scheme for enhanced user quality of experience (QoE) in integrated sensing and communication (ISAC) networks. Particularly, user QoE is a comprehensive metric that integrates quality of service (QoS), user behavioral dynamics, and environmental complexity. The novel DA module includes a user status prediction model, a QoS factor selection model, and a QoE fitting model, which analyzes historical user status data to construct and update user-specific QoE models. Users are clustered into different groups based on their QoE models. A Cramér-Rao bound (CRB) model is utilized to quantify the impact of allocated communication resources on sensing accuracy. A joint optimization problem of communication and computing resource management is formulated to maximize long-term user QoE while satisfying CRB and resource constraints. A two-layer data-model-driven algorithm is developed to solve the formulated problem, where the top layer utilizes an advanced deep reinforcement learning algorithm to make group-level decisions, and the bottom layer uses convex optimization techniques to make user-level decisions. Simulation results based on a real-world dataset demonstrate that the proposed DA-assisted resource management scheme outperforms benchmark schemes in terms of user QoE. △ Less

Submitted 8 July, 2025; originally announced July 2025.

arXiv:2507.05727 [pdf, ps, other]

ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark

Authors: He Wang, Linhan Ma, Dake Guo, Xiong Wang, Lei Xie, Jin Xu, Junyang Lin

Abstract: Automatic Speech Recognition (ASR) has been extensively investigated, yet prior evaluative efforts have largely been restricted to contextless paradigms. This constraint stems from the limited proficiency of conventional ASR models in context modeling and their deficiency in memory and reasoning based on world knowledge. Recent breakthroughs in the development of Large Language Models (LLMs) and c… ▽ More Automatic Speech Recognition (ASR) has been extensively investigated, yet prior evaluative efforts have largely been restricted to contextless paradigms. This constraint stems from the limited proficiency of conventional ASR models in context modeling and their deficiency in memory and reasoning based on world knowledge. Recent breakthroughs in the development of Large Language Models (LLMs) and corresponding Large Audio Language Models (LALMs) have markedly enhanced the visibility of general artificial intelligence capabilities. Consequently, there exists a compelling need for a benchmark that can evaluate both the generality and intelligence of ASR systems. To address this gap, we propose ContextASR-Bench: a comprehensive, large-scale benchmark designed to assess contextual speech recognition. This benchmark encompasses up to 40,000 data entries across over 10 domains, enabling a thorough evaluation of model performance in scenarios that omit or incorporate coarse-grained or fine-grained contextual information. Moreover, diverging from conventional ASR evaluations, our benchmark includes an analysis of model efficacy in recognizing named entities mentioned within the auditory input. Our extensive evaluation highlights that LALMs, with strong world knowledge and context learning capabilities, outperform conventional ASR models by a large margin. The dataset and evaluation code have been released at https://github.com/MrSupW/ContextASR-Bench. △ Less

Submitted 8 July, 2025; originally announced July 2025.

Comments: 18 pages, 4 figures

arXiv:2507.04891 [pdf, ps, other]

MurreNet: Modeling Holistic Multimodal Interactions Between Histopathology and Genomic Profiles for Survival Prediction

Authors: Mingxin Liu, Chengfei Cai, Jun Li, Pengbo Xu, Jinze Li, Jiquan Ma, Jun Xu

Abstract: Cancer survival prediction requires integrating pathological Whole Slide Images (WSIs) and genomic profiles, a challenging task due to the inherent heterogeneity and the complexity of modeling both inter- and intra-modality interactions. Current methods often employ straightforward fusion strategies for multimodal feature integration, failing to comprehensively capture modality-specific and modali… ▽ More Cancer survival prediction requires integrating pathological Whole Slide Images (WSIs) and genomic profiles, a challenging task due to the inherent heterogeneity and the complexity of modeling both inter- and intra-modality interactions. Current methods often employ straightforward fusion strategies for multimodal feature integration, failing to comprehensively capture modality-specific and modality-common interactions, resulting in a limited understanding of multimodal correlations and suboptimal predictive performance. To mitigate these limitations, this paper presents a Multimodal Representation Decoupling Network (MurreNet) to advance cancer survival analysis. Specifically, we first propose a Multimodal Representation Decomposition (MRD) module to explicitly decompose paired input data into modality-specific and modality-shared representations, thereby reducing redundancy between modalities. Furthermore, the disentangled representations are further refined then updated through a novel training regularization strategy that imposes constraints on distributional similarity, difference, and representativeness of modality features. Finally, the augmented multimodal features are integrated into a joint representation via proposed Deep Holistic Orthogonal Fusion (DHOF) strategy. Extensive experiments conducted on six TCGA cancer cohorts demonstrate that our MurreNet achieves state-of-the-art (SOTA) performance in survival prediction. △ Less

Submitted 7 July, 2025; originally announced July 2025.

Comments: 11 pages, 2 figures, Accepted by MICCAI 2025

arXiv:2507.04622 [pdf, ps, other]

A Deep Unfolding Framework for Diffractive Snapshot Spectral Imaging

Authors: Zhengyue Zhuge, Jiahui Xu, Shiqi Chen, Hao Xu, Yueting Chen, Zhihai Xu, Huajun Feng

Abstract: Snapshot hyperspectral imaging systems acquire spectral data cubes through compressed sensing. Recently, diffractive snapshot spectral imaging (DSSI) methods have attracted significant attention. While various optical designs and improvements continue to emerge, research on reconstruction algorithms remains limited. Although numerous networks and deep unfolding methods have been applied on similar… ▽ More Snapshot hyperspectral imaging systems acquire spectral data cubes through compressed sensing. Recently, diffractive snapshot spectral imaging (DSSI) methods have attracted significant attention. While various optical designs and improvements continue to emerge, research on reconstruction algorithms remains limited. Although numerous networks and deep unfolding methods have been applied on similar tasks, they are not fully compatible with DSSI systems because of their distinct optical encoding mechanism. In this paper, we propose an efficient deep unfolding framework for diffractive systems, termed diffractive deep unfolding (DDU). Specifically, we derive an analytical solution for the data fidelity term in DSSI, ensuring both the efficiency and the effectiveness during the iterative reconstruction process. Given the severely ill-posed nature of the problem, we employ a network-based initialization strategy rather than non-learning-based methods or linear layers, leading to enhanced stability and performance. Our framework demonstrates strong compatibility with existing state-of-the-art (SOTA) models, which effectively address the initialization and prior subproblem. Extensive experiments validate the superiority of the proposed DDU framework, showcasing improved performance while maintaining comparable parameter counts and computational complexity. These results suggest that DDU provides a solid foundation for future unfolding-based methods in DSSI. △ Less

Submitted 6 July, 2025; originally announced July 2025.

arXiv:2507.03799 [pdf, ps, other]

On the Distribution of Age of Information in Time-varying Updating Systems

Authors: Jin Xu, Weiqi Wang, Natarajan Gautam

Abstract: Age of Information (AoI) is a crucial metric for quantifying information freshness in real-time systems where the sampling rate of data packets is time-varying. Evaluating AoI under such conditions is challenging, as system states become temporally correlated and traditional stationary analysis is inapplicable. We investigate an $M_{t}/G/1/1$ queueing system with a time-varying sampling rate and p… ▽ More Age of Information (AoI) is a crucial metric for quantifying information freshness in real-time systems where the sampling rate of data packets is time-varying. Evaluating AoI under such conditions is challenging, as system states become temporally correlated and traditional stationary analysis is inapplicable. We investigate an $M_{t}/G/1/1$ queueing system with a time-varying sampling rate and probabilistic preemption, proposing a novel analytical framework based on multi-dimensional partial differential equations (PDEs) to capture the time evolution of the system's status distribution. To solve the PDEs, we develop a decomposition technique that breaks the high-dimensional PDE into lower-dimensional subsystems. Solving these subsystems allows us to derive the Aol distribution at arbitrary time instances. We show AoI does not exhibit a memoryless property, even with negligible processing times, due to its dependence on the historical sampling process. Our framework extends to the stationary setting, where we derive a closed-form expression for the Laplace-Stieltjes Transform (LST) of the steady-state AoI. Numerical experiments reveal AoI exhibits a non-trivial lag in response to sampling rate changes. Our results also show that no single preemption probability or processing time distribution can minimize Aol violation probability across all thresholds in either time-varying or stationary scenarios. Finally, we formulate an optimization problem and propose a heuristic method to find sampling rates that reduce costs while satisfying AoI constraints. △ Less

Submitted 4 July, 2025; originally announced July 2025.

Comments: 32 pages, 10 figures

arXiv:2507.03043 [pdf, ps, other]

K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function

Authors: Shuhe Li, Chenxu Guo, Jiachen Lian, Cheol Jun Cho, Wenshuo Zhao, Xuanru Zhou, Dingkun Zhou, Sam Wang, Grace Wang, Jingze Yang, Jingyi Xu, Ruohan Bao, Elise Brenner, Brandon In, Francesca Pei, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli

Abstract: Early evaluation of children's language is frustrated by the high pitch, long phones, and sparse data that derail automatic speech recognisers. We introduce K-Function, a unified framework that combines accurate sub-word transcription, objective scoring, and actionable feedback. Its core, Kids-WFST, merges a Wav2Vec2 phoneme encoder with a phoneme-similarity Dysfluent-WFST to capture child-specifi… ▽ More Early evaluation of children's language is frustrated by the high pitch, long phones, and sparse data that derail automatic speech recognisers. We introduce K-Function, a unified framework that combines accurate sub-word transcription, objective scoring, and actionable feedback. Its core, Kids-WFST, merges a Wav2Vec2 phoneme encoder with a phoneme-similarity Dysfluent-WFST to capture child-specific errors while remaining fully interpretable. Kids-WFST attains 1.39% phoneme error on MyST and 8.61% on Multitudes--absolute gains of 10.47 and 7.06 points over a greedy-search decoder. These high-fidelity transcripts power an LLM that grades verbal skills, milestones, reading, and comprehension, aligning with human proctors and supplying tongue-and-lip visualizations plus targeted advice. The results show that precise phoneme recognition cements a complete diagnostic-feedback loop, paving the way for scalable, clinician-ready language assessment. △ Less

Submitted 3 July, 2025; originally announced July 2025.

arXiv:2507.00993 [pdf, ps, other]

Advancing Lung Disease Diagnosis in 3D CT Scans

Authors: Qingqiu Li, Runtian Yuan, Junlin Hou, Jilan Xu, Yuejie Zhang, Rui Feng, Hao Chen

Abstract: To enable more accurate diagnosis of lung disease in chest CT scans, we propose a straightforward yet effective model. Firstly, we analyze the characteristics of 3D CT scans and remove non-lung regions, which helps the model focus on lesion-related areas and reduces computational cost. We adopt ResNeSt50 as a strong feature extractor, and use a weighted cross-entropy loss to mitigate class imbalan… ▽ More To enable more accurate diagnosis of lung disease in chest CT scans, we propose a straightforward yet effective model. Firstly, we analyze the characteristics of 3D CT scans and remove non-lung regions, which helps the model focus on lesion-related areas and reduces computational cost. We adopt ResNeSt50 as a strong feature extractor, and use a weighted cross-entropy loss to mitigate class imbalance, especially for the underrepresented squamous cell carcinoma category. Our model achieves a Macro F1 Score of 0.80 on the validation set of the Fair Disease Diagnosis Challenge, demonstrating its strong performance in distinguishing between different lung conditions. △ Less

Submitted 1 July, 2025; originally announced July 2025.

arXiv:2506.23688 [pdf, ps, other]

GUSL: A Novel and Efficient Machine Learning Model for Prostate Segmentation on MRI

Authors: Jiaxin Yang, Vasileios Magoulianitis, Catherine Aurelia Christie Alexander, Jintang Xue, Masatomo Kaneko, Giovanni Cacciamani, Andre Abreu, Vinay Duddalwar, C. -C. Jay Kuo, Inderbir S. Gill, Chrysostomos Nikias

Abstract: Prostate and zonal segmentation is a crucial step for clinical diagnosis of prostate cancer (PCa). Computer-aided diagnosis tools for prostate segmentation are based on the deep learning (DL) paradigm. However, deep neural networks are perceived as "black-box" solutions by physicians, thus making them less practical for deployment in the clinical setting. In this paper, we introduce a feed-forward… ▽ More Prostate and zonal segmentation is a crucial step for clinical diagnosis of prostate cancer (PCa). Computer-aided diagnosis tools for prostate segmentation are based on the deep learning (DL) paradigm. However, deep neural networks are perceived as "black-box" solutions by physicians, thus making them less practical for deployment in the clinical setting. In this paper, we introduce a feed-forward machine learning model, named Green U-shaped Learning (GUSL), suitable for medical image segmentation without backpropagation. GUSL introduces a multi-layer regression scheme for coarse-to-fine segmentation. Its feature extraction is based on a linear model, which enables seamless interpretability during feature extraction. Also, GUSL introduces a mechanism for attention on the prostate boundaries, which is an error-prone region, by employing regression to refine the predictions through residue correction. In addition, a two-step pipeline approach is used to mitigate the class imbalance, an issue inherent in medical imaging problems. After conducting experiments on two publicly available datasets and one private dataset, in both prostate gland and zonal segmentation tasks, GUSL achieves state-of-the-art performance among other DL-based models. Notably, GUSL features a very energy-efficient pipeline, since it has a model size several times smaller and less complexity than the rest of the solutions. In all datasets, GUSL achieved a Dice Similarity Coefficient (DSC) performance greater than $0.9$ for gland segmentation. Considering also its lightweight model size and transparency in feature extraction, it offers a competitive and practical package for medical imaging applications. △ Less

Submitted 30 June, 2025; originally announced June 2025.

arXiv:2506.23584 [pdf, ps, other]

A Clinically-Grounded Two-Stage Framework for Renal CT Report Generation

Authors: Renjie Liang, Zhengkang Fan, Jinqian Pan, Chenkun Sun, Russell Terry, Jie Xu

Abstract: Generating radiology reports from CT scans remains a complex task due to the nuanced nature of medical imaging and the variability in clinical documentation. In this study, we propose a two-stage framework for generating renal radiology reports from 2D CT slices. First, we extract structured abnormality features using a multi-task learning model trained to identify lesion attributes such as locati… ▽ More Generating radiology reports from CT scans remains a complex task due to the nuanced nature of medical imaging and the variability in clinical documentation. In this study, we propose a two-stage framework for generating renal radiology reports from 2D CT slices. First, we extract structured abnormality features using a multi-task learning model trained to identify lesion attributes such as location, size, enhancement, and attenuation. These extracted features are subsequently combined with the corresponding CT image and fed into a fine-tuned vision-language model to generate natural language report sentences aligned with clinical findings. We conduct experiments on a curated dataset of renal CT studies with manually annotated sentence-slice-feature triplets and evaluate performance using both classification metrics and natural language generation metrics. Our results demonstrate that the proposed model outperforms random baselines across all abnormality types, and the generated reports capture key clinical content with reasonable textual accuracy. This exploratory work highlights the feasibility of modular, feature-informed report generation for renal imaging. Future efforts will focus on extending this pipeline to 3D CT volumes and further improving clinical fidelity in multimodal medical AI systems. △ Less

Submitted 30 June, 2025; originally announced June 2025.

arXiv:2506.23490 [pdf, ps, other]

UltraTwin: Towards Cardiac Anatomical Twin Generation from Multi-view 2D Ultrasound

Authors: Junxuan Yu, Yaofei Duan, Yuhao Huang, Yu Wang, Rongbo Ling, Weihao Luo, Ang Zhang, Jingxian Xu, Qiongying Ni, Yongsong Zhou, Binghan Li, Haoran Dou, Liping Liu, Yanfen Chu, Feng Geng, Zhe Sheng, Zhifeng Ding, Dingxin Zhang, Rui Huang, Yuhang Zhang, Xiaowei Xu, Tao Tan, Dong Ni, Zhongshan Gou, Xin Yang

Abstract: Echocardiography is routine for cardiac examination. However, 2D ultrasound (US) struggles with accurate metric calculation and direct observation of 3D cardiac structures. Moreover, 3D US is limited by low resolution, small field of view and scarce availability in practice. Constructing the cardiac anatomical twin from 2D images is promising to provide precise treatment planning and clinical quan… ▽ More Echocardiography is routine for cardiac examination. However, 2D ultrasound (US) struggles with accurate metric calculation and direct observation of 3D cardiac structures. Moreover, 3D US is limited by low resolution, small field of view and scarce availability in practice. Constructing the cardiac anatomical twin from 2D images is promising to provide precise treatment planning and clinical quantification. However, it remains challenging due to the rare paired data, complex structures, and US noises. In this study, we introduce a novel generative framework UltraTwin, to obtain cardiac anatomical twin from sparse multi-view 2D US. Our contribution is three-fold. First, pioneered the construction of a real-world and high-quality dataset containing strictly paired multi-view 2D US and CT, and pseudo-paired data. Second, we propose a coarse-to-fine scheme to achieve hierarchical reconstruction optimization. Last, we introduce an implicit autoencoder for topology-aware constraints. Extensive experiments show that UltraTwin reconstructs high-quality anatomical twins versus strong competitors. We believe it advances anatomical twin modeling for potential applications in personalized cardiac care. △ Less

Submitted 29 June, 2025; originally announced June 2025.

Comments: accepted by miccai 2025

arXiv:2506.23208 [pdf, ps, other]

Multi-Source COVID-19 Detection via Variance Risk Extrapolation

Authors: Runtian Yuan, Qingqiu Li, Junlin Hou, Jilan Xu, Yuejie Zhang, Rui Feng, Hao Chen

Abstract: We present our solution for the Multi-Source COVID-19 Detection Challenge, which aims to classify chest CT scans into COVID and Non-COVID categories across data collected from four distinct hospitals and medical centers. A major challenge in this task lies in the domain shift caused by variations in imaging protocols, scanners, and patient populations across institutions. To enhance the cross-doma… ▽ More We present our solution for the Multi-Source COVID-19 Detection Challenge, which aims to classify chest CT scans into COVID and Non-COVID categories across data collected from four distinct hospitals and medical centers. A major challenge in this task lies in the domain shift caused by variations in imaging protocols, scanners, and patient populations across institutions. To enhance the cross-domain generalization of our model, we incorporate Variance Risk Extrapolation (VREx) into the training process. VREx encourages the model to maintain consistent performance across multiple source domains by explicitly minimizing the variance of empirical risks across environments. This regularization strategy reduces overfitting to center-specific features and promotes learning of domain-invariant representations. We further apply Mixup data augmentation to improve generalization and robustness. Mixup interpolates both the inputs and labels of randomly selected pairs of training samples, encouraging the model to behave linearly between examples and enhancing its resilience to noise and limited data. Our method achieves an average macro F1 score of 0.96 across the four sources on the validation set, demonstrating strong generalization. △ Less

Submitted 29 June, 2025; originally announced June 2025.

arXiv:2506.21803 [pdf, ps, other]

From Token to Rhythm: A Multi-Scale Approach for ECG-Language Pretraining

Authors: Fuying Wang, Jiacheng Xu, Lequan Yu

Abstract: Electrocardiograms (ECGs) play a vital role in monitoring cardiac health and diagnosing heart diseases. However, traditional deep learning approaches for ECG analysis rely heavily on large-scale manual annotations, which are both time-consuming and resource-intensive to obtain. To overcome this limitation, self-supervised learning (SSL) has emerged as a promising alternative, enabling the extracti… ▽ More Electrocardiograms (ECGs) play a vital role in monitoring cardiac health and diagnosing heart diseases. However, traditional deep learning approaches for ECG analysis rely heavily on large-scale manual annotations, which are both time-consuming and resource-intensive to obtain. To overcome this limitation, self-supervised learning (SSL) has emerged as a promising alternative, enabling the extraction of robust ECG representations that can be efficiently transferred to various downstream tasks. While previous studies have explored SSL for ECG pretraining and multi-modal ECG-language alignment, they often fail to capture the multi-scale nature of ECG signals. As a result, these methods struggle to learn generalized representations due to their inability to model the hierarchical structure of ECG data. To address this gap, we introduce MELP, a novel Multi-scale ECG-Language Pretraining (MELP) model that fully leverages hierarchical supervision from ECG-text pairs. MELP first pretrains a cardiology-specific language model to enhance its understanding of clinical text. It then applies three levels of cross-modal supervision-at the token, beat, and rhythm levels-to align ECG signals with textual reports, capturing structured information across different time scales. We evaluate MELP on three public ECG datasets across multiple tasks, including zero-shot ECG classification, linear probing, and transfer learning. Experimental results demonstrate that MELP outperforms existing SSL methods, underscoring its effectiveness and adaptability across diverse clinical applications. Our code is available at https://github.com/HKU-MedAI/MELP. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: ICML 2025

arXiv:2506.21690 [pdf, ps, other]

Joint RIS-UE Association and Beamforming Design in RIS-Assisted Cell-Free MIMO Network

Authors: Hongqin Ke, Jindan Xu, Wei Xu, Chau Yuen, Zhaohua Lu

Abstract: Reconfigurable intelligent surface (RIS)-assisted cell-free (CF) multiple-input multiple-output (MIMO) networks can significantly enhance system performance. However, the extensive deployment of RIS elements imposes considerable channel acquisition overhead, with the high density of nodes and antennas in RIS-assisted CF networks amplifying this challenge. To tackle this issue, in this paper, we ex… ▽ More Reconfigurable intelligent surface (RIS)-assisted cell-free (CF) multiple-input multiple-output (MIMO) networks can significantly enhance system performance. However, the extensive deployment of RIS elements imposes considerable channel acquisition overhead, with the high density of nodes and antennas in RIS-assisted CF networks amplifying this challenge. To tackle this issue, in this paper, we explore integrating RIS-user equipment (UE) association into downlink RIS-assisted CF transmitter design, which greatly reduces the channel acquisition costs. The key point is that once UEs are associated with specific RISs, there is no need to frequently acquire channels from non-associated RISs. Then, we formulate the problem of joint RIS-UE association and beamforming at APs and RISs to maximize the weighted sum rate (WSR). In particular, we propose a two-stage framework to solve it. In the first stage, we apply a many-to-many matching algorithm to establish the RIS-UE association. In the second stage, we introduce a sequential optimization-based method that decomposes the joint optimization of RIS phase shifts and AP beamforming into two distinct subproblems. To optimize the RIS phase shifts, we employ the majorization-minimization (MM) algorithm to obtain a semi-closed-form solution. For AP beamforming, we develop a joint block diagonalization algorithm, which yields a closed-form solution. Simulation results demonstrate the effectiveness of the proposed algorithm and show that, while RIS-UE association significantly reduces overhead, it incurs a minor performance loss that remains within an acceptable range. Additionally, we investigate the impact of RIS deployment and conclude that RISs exhibit enhanced performance when positioned between APs and UEs. △ Less

Submitted 26 June, 2025; originally announced June 2025.

arXiv:2506.15082 [pdf, ps, other]

Make Your AUV Adaptive: An Environment-Aware Reinforcement Learning Framework For Underwater Tasks

Authors: Yimian Ding, Jingzehua Xu, Guanwen Xie, Shuai Zhang, Yi Li

Abstract: This study presents a novel environment-aware reinforcement learning (RL) framework designed to augment the operational capabilities of autonomous underwater vehicles (AUVs) in underwater environments. Departing from traditional RL architectures, the proposed framework integrates an environment-aware network module that dynamically captures flow field data, effectively embedding this critical envi… ▽ More This study presents a novel environment-aware reinforcement learning (RL) framework designed to augment the operational capabilities of autonomous underwater vehicles (AUVs) in underwater environments. Departing from traditional RL architectures, the proposed framework integrates an environment-aware network module that dynamically captures flow field data, effectively embedding this critical environmental information into the state space. This integration facilitates real-time environmental adaptation, significantly enhancing the AUV's situational awareness and decision-making capabilities. Furthermore, the framework incorporates AUV structure characteristics into the optimization process, employing a large language model (LLM)-based iterative refinement mechanism that leverages both environmental conditions and training outcomes to optimize task performance. Comprehensive experimental evaluations demonstrate the framework's superior performance, robustness and adaptability. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: This paper has been accepted by IROS 2025

arXiv:2506.12475 [pdf, ps, other]

Efficient Star Distillation Attention Network for Lightweight Image Super-Resolution

Authors: Fangwei Hao, Ji Du, Desheng Kong, Jiesheng Wu, Jing Xu, Ping Li

Abstract: In recent years, the performance of lightweight Single-Image Super-Resolution (SISR) has been improved significantly with the application of Convolutional Neural Networks (CNNs) and Large Kernel Attention (LKA). However, existing information distillation modules for lightweight SISR struggle to map inputs into High-Dimensional Non-Linear (HDNL) feature spaces, limiting their representation learnin… ▽ More In recent years, the performance of lightweight Single-Image Super-Resolution (SISR) has been improved significantly with the application of Convolutional Neural Networks (CNNs) and Large Kernel Attention (LKA). However, existing information distillation modules for lightweight SISR struggle to map inputs into High-Dimensional Non-Linear (HDNL) feature spaces, limiting their representation learning. And their LKA modules possess restricted ability to capture the multi-shape multi-scale information for long-range dependencies while encountering a quadratic increase in the computational burden with increasing convolutional kernel size of its depth-wise convolutional layer. To address these issues, we firstly propose a Star Distillation Module (SDM) to enhance the discriminative representation learning via information distillation in the HDNL feature spaces. Besides, we present a Multi-shape Multi-scale Large Kernel Attention (MM-LKA) module to learn representative long-range dependencies while incurring low computational and memory footprints, leading to improving the performance of CNN-based self-attention significantly. Integrating SDM and MM-LKA, we develop a Residual Star Distillation Attention Module (RSDAM) and take it as the building block of the proposed efficient Star Distillation Attention Network (SDAN) which possesses high reconstruction efficiency to recover a higher-quality image from the corresponding low-resolution (LR) counterpart. When compared with other lightweight state-of-the-art SISR methods, extensive experiments show that our SDAN with low model complexity yields superior performance quantitatively and visually. △ Less

Submitted 14 June, 2025; originally announced June 2025.

arXiv:2506.12308 [pdf, ps, other]

From Ground to Sky: Architectures, Applications, and Challenges Shaping Low-Altitude Wireless Networks

Authors: Weijie Yuan, Yuanhao Cui, Jiacheng Wang, Fan Liu, Geng Sun, Tao Xiang, Jie Xu, Shi Jin, Dusit Niyato, Sinem Coleri, Sumei Sun, Shiwen Mao, Abbas Jamalipour, Dong In Kim, Mohamed-Slim Alouini, Xuemin Shen

Abstract: In this article, we introduce a novel low-altitude wireless network (LAWN), which is a reconfigurable, three-dimensional (3D) layered architecture. In particular, the LAWN integrates connectivity, sensing, control, and computing across aerial and terrestrial nodes that enable seamless operation in complex, dynamic, and mission-critical environments. Different from the conventional aerial communica… ▽ More In this article, we introduce a novel low-altitude wireless network (LAWN), which is a reconfigurable, three-dimensional (3D) layered architecture. In particular, the LAWN integrates connectivity, sensing, control, and computing across aerial and terrestrial nodes that enable seamless operation in complex, dynamic, and mission-critical environments. Different from the conventional aerial communication systems, LAWN's distinctive feature is its tight integration of functional planes in which multiple functionalities continually reshape themselves to operate safely and efficiently in the low-altitude sky. With the LAWN, we discuss several enabling technologies, such as integrated sensing and communication (ISAC), semantic communication, and fully-actuated control systems. Finally, we identify potential applications and key cross-layer challenges. This article offers a comprehensive roadmap for future research and development in the low-altitude airspace. △ Less

Submitted 16 June, 2025; v1 submitted 13 June, 2025; originally announced June 2025.

Comments: 10 pages, 5 figures

arXiv:2506.11293 [pdf, ps, other]

Influence Functions for Data Attribution in Linear System Identification and LQR Control

Authors: Jiachen Li, Shihao Li, Jiamin Xu, Soovadeep Bakshi, Dongmei Chen

Abstract: Understanding the influence of individual training data points is crucial for developing reliable machine learning-based control systems. However, conventional methods like leave-one-out retraining are computationally infeasible for large datasets. This paper introduces a framework using influence functions to efficiently approximate the impact of removing specific training trajectories on both le… ▽ More Understanding the influence of individual training data points is crucial for developing reliable machine learning-based control systems. However, conventional methods like leave-one-out retraining are computationally infeasible for large datasets. This paper introduces a framework using influence functions to efficiently approximate the impact of removing specific training trajectories on both learned system dynamics and downstream control performance. We formulate two influence functions(IF): IF1, which estimates the effect on the predictive accuracy of a learned linear dynamics model, and IF2, which quantifies the subsequent impact on the cost of a Linear Quadratic Regulator (LQR) controller designed using these learned dynamics. These involve tracing sensitivities through the Discrete Algebraic Riccati Equation (DARE) solution. We empirically validate our approach on simulated linear systems analogous to robotic manipulators. Results show strong positive correlations between influence predictions and ground truth changes obtained via retraining. Our framework provides a computationally tractable method for data attribution. △ Less

Submitted 12 June, 2025; originally announced June 2025.

arXiv:2506.10815

Joint Beamforming with Extremely Large Scale RIS: A Sequential Multi-Agent A2C Approach

Authors: Zhi Chai, Jiajie Xu, Justin P Coon, Mohamed-Slim Alouini

Abstract: It is a challenging problem to jointly optimize the base station (BS) precoding matrix and the reconfigurable intelligent surface (RIS) phases simultaneously in a RIS-assisted multiple-user multiple-input-multiple-output (MU-MIMO) scenario when the size of the RIS becomes extremely large. In this paper, we propose a deep reinforcement learning algorithm called sequential multi-agent advantage acto… ▽ More It is a challenging problem to jointly optimize the base station (BS) precoding matrix and the reconfigurable intelligent surface (RIS) phases simultaneously in a RIS-assisted multiple-user multiple-input-multiple-output (MU-MIMO) scenario when the size of the RIS becomes extremely large. In this paper, we propose a deep reinforcement learning algorithm called sequential multi-agent advantage actor-critic (A2C) to solve this problem. In addition, the discrete phase of RISs, imperfect channel state information (CSI), and channel correlations between users are taken into consideration. The computational complexity is also analyzed, and the performance of the proposed algorithm is compared with the zero-forcing (ZF) beamformer in terms of the sum spectral efficiency (SE). It is noted that the computational complexity of the proposed algorithm is lower than the benchmark, while the performance is better than the benchmark. Throughout simulations, it is also found that the proposed algorithm is robust to medium channel estimation error. △ Less

Submitted 13 June, 2025; v1 submitted 12 June, 2025; originally announced June 2025.

Comments: There are some flaws that need to be figured out

arXiv:2506.09876 [pdf, ps, other]

Aucamp: An Underwater Camera-Based Multi-Robot Platform with Low-Cost, Distributed, and Robust Localization

Authors: Jisheng Xu, Ding Lin, Pangkit Fong, Chongrong Fang, Xiaoming Duan, Jianping He

Abstract: This paper introduces an underwater multi-robot platform, named Aucamp, characterized by cost-effective monocular-camera-based sensing, distributed protocol and robust orientation control for localization. We utilize the clarity feature to measure the distance, present the monocular imaging model, and estimate the position of the target object. We achieve global positioning in our platform by desi… ▽ More This paper introduces an underwater multi-robot platform, named Aucamp, characterized by cost-effective monocular-camera-based sensing, distributed protocol and robust orientation control for localization. We utilize the clarity feature to measure the distance, present the monocular imaging model, and estimate the position of the target object. We achieve global positioning in our platform by designing a distributed update protocol. The distributed algorithm enables the perception process to simultaneously cover a broader range, and greatly improves the accuracy and robustness of the positioning. Moreover, the explicit dynamics model of the robot in our platform is obtained, based on which, we propose a robust orientation control framework. The control system ensures that the platform maintains a balanced posture for each robot, thereby ensuring the stability of the localization system. The platform can swiftly recover from an forced unstable state to a stable horizontal posture. Additionally, we conduct extensive experiments and application scenarios to evaluate the performance of our platform. The proposed new platform may provide support for extensive marine exploration by underwater sensor networks. △ Less

Submitted 11 June, 2025; originally announced June 2025.

arXiv:2506.09175 [pdf, ps, other]

PHRASED: Phrase Dictionary Biasing for Speech Translation

Authors: Peidong Wang, Jian Xue, Rui Zhao, Junkun Chen, Aswin Shanmugam Subramanian, Jinyu Li

Abstract: Phrases are essential to understand the core concepts in conversations. However, due to their rare occurrence in training data, correct translation of phrases is challenging in speech translation tasks. In this paper, we propose a phrase dictionary biasing method to leverage pairs of phrases mapping from the source language to the target language. We apply the phrase dictionary biasing method to t… ▽ More Phrases are essential to understand the core concepts in conversations. However, due to their rare occurrence in training data, correct translation of phrases is challenging in speech translation tasks. In this paper, we propose a phrase dictionary biasing method to leverage pairs of phrases mapping from the source language to the target language. We apply the phrase dictionary biasing method to two types of widely adopted models, a transducer-based streaming speech translation model and a multimodal large language model. Experimental results show that the phrase dictionary biasing method outperforms phrase list biasing by 21% relatively for the streaming speech translation model. In addition, phrase dictionary biasing enables multimodal large language models to use external phrase information, achieving 85% relative improvement in phrase recall. △ Less

Submitted 10 June, 2025; originally announced June 2025.

arXiv:2506.04748 [pdf, ps, other]

Synchro-Thermography: Monitoring ~10 mK Facial Temperature Changes with Heartbeat Referencing for Physiological Sensing

Authors: Nanami Kotani, Kuniharu Sakurada, Jiayi Xu, Masahiko Inami, Yasuaki Monnai

Abstract: Infrared thermography has gained interest as a tool for non-contact measurement of blood circulation and skin blood flow due to cardiac activity. Partiularly, blood vessels on the surface, such as on the back of the hand, are suited for visualization. However, standardized methodologies have not yet been established for areas such as the face and neck, where many blood vessels are lie deeper benea… ▽ More Infrared thermography has gained interest as a tool for non-contact measurement of blood circulation and skin blood flow due to cardiac activity. Partiularly, blood vessels on the surface, such as on the back of the hand, are suited for visualization. However, standardized methodologies have not yet been established for areas such as the face and neck, where many blood vessels are lie deeper beneath the surface, and external stimulation for measurement could be harmful. Here we propose Synchro-Thermography for stable monitoring of facial temperature changes associated with heart rate variability. We conducted experiments with eight subjects and measured minute temperature changes with an amplitude of about \SI{10}{mK} on the forehead and chin. The proposed method improves the temperature resolution by a factor of 2 or more, and can stably measure skin temperature changes caused by blood flow. This skin temperature change could be applied to physiological sensing such as blood flow changes due to injury or disease, or as an indicator of stress. △ Less

Submitted 5 June, 2025; originally announced June 2025.

Comments: This paper has been submitted to the 2025 SICE Festival with Annual Conference (SICE FES 2025)

arXiv:2506.02401 [pdf, ps, other]

Trusted Fake Audio Detection Based on Dirichlet Distribution

Authors: Chi Ding, Junxiao Xue, Cong Wang, Hao Zhou

Abstract: With the continuous development of deep learning-based speech conversion and speech synthesis technologies, the cybersecurity problem posed by fake audio has become increasingly serious. Previously proposed models for defending against fake audio have attained remarkable performance. However, they all fall short in modeling the trustworthiness of the decisions made by the models themselves. Based… ▽ More With the continuous development of deep learning-based speech conversion and speech synthesis technologies, the cybersecurity problem posed by fake audio has become increasingly serious. Previously proposed models for defending against fake audio have attained remarkable performance. However, they all fall short in modeling the trustworthiness of the decisions made by the models themselves. Based on this, we put forward a plausible fake audio detection approach based on the Dirichlet distribution with the aim of enhancing the reliability of fake audio detection. Specifically, we first generate evidence through a neural network. Uncertainty is then modeled using the Dirichlet distribution. By modeling the belief distribution with the parameters of the Dirichlet distribution, an estimate of uncertainty can be obtained for each decision. Finally, the predicted probabilities and corresponding uncertainty estimates are combined to form the final opinion. On the ASVspoof series dataset (i.e., ASVspoof 2019 LA, ASVspoof 2021 LA, and DF), we conduct a number of comparison experiments to verify the excellent performance of the proposed model in terms of accuracy, robustness, and trustworthiness. △ Less

Submitted 2 June, 2025; originally announced June 2025.

arXiv:2506.01759 [pdf, ps, other]

ADEPT: Adaptive Diffusion Environment for Policy Transfer Sim-to-Real

Authors: Youwei Yu, Junhong Xu, Lantao Liu

Abstract: Model-free reinforcement learning has emerged as a powerful method for developing robust robot control policies capable of navigating through complex and unstructured environments. The effectiveness of these methods hinges on two essential elements: (1) the use of massively parallel physics simulations to expedite policy training, and (2) an environment generator tasked with crafting sufficiently… ▽ More Model-free reinforcement learning has emerged as a powerful method for developing robust robot control policies capable of navigating through complex and unstructured environments. The effectiveness of these methods hinges on two essential elements: (1) the use of massively parallel physics simulations to expedite policy training, and (2) an environment generator tasked with crafting sufficiently challenging yet attainable environments to facilitate continuous policy improvement. Existing methods of outdoor environment generation often rely on heuristics constrained by a set of parameters, limiting the diversity and realism. In this work, we introduce ADEPT, a novel \textbf{A}daptive \textbf{D}iffusion \textbf{E}nvironment for \textbf{P}olicy \textbf{T}ransfer in the zero-shot sim-to-real fashion that leverages Denoising Diffusion Probabilistic Models to dynamically expand existing training environments by adding more diverse and complex environments adaptive to the current policy. ADEPT guides the diffusion model's generation process through initial noise optimization, blending noise-corrupted environments from existing training environments weighted by the policy's performance in each corresponding environment. By manipulating the noise corruption level, ADEPT seamlessly transitions between generating similar environments for policy fine-tuning and novel ones to expand training diversity. To benchmark ADEPT in off-road navigation, we propose a fast and effective multi-layer map representation for wild environment generation. Our experiments show that the policy trained by ADEPT outperforms both procedural generated and natural environments, along with popular navigation methods. △ Less

Submitted 4 June, 2025; v1 submitted 2 June, 2025; originally announced June 2025.

Comments: arXiv admin note: substantial text overlap with arXiv:2410.10766

arXiv:2506.00740 [pdf, ps, other]

Length Aware Speech Translation for Video Dubbing

Authors: Harveen Singh Chadha, Aswin Shanmugam Subramanian, Vikas Joshi, Shubham Bansal, Jian Xue, Rupeshkumar Mehta, Jinyu Li

Abstract: In video dubbing, aligning translated audio with the source audio is a significant challenge. Our focus is on achieving this efficiently, tailored for real-time, on-device video dubbing scenarios. We developed a phoneme-based end-to-end length-sensitive speech translation (LSST) model, which generates translations of varying lengths short, normal, and long using predefined tags. Additionally, we i… ▽ More In video dubbing, aligning translated audio with the source audio is a significant challenge. Our focus is on achieving this efficiently, tailored for real-time, on-device video dubbing scenarios. We developed a phoneme-based end-to-end length-sensitive speech translation (LSST) model, which generates translations of varying lengths short, normal, and long using predefined tags. Additionally, we introduced length-aware beam search (LABS), an efficient approach to generate translations of different lengths in a single decoding pass. This approach maintained comparable BLEU scores compared to a baseline without length awareness while significantly enhancing synchronization quality between source and target audio, achieving a mean opinion score (MOS) gain of 0.34 for Spanish and 0.65 for Korean, respectively. △ Less

Submitted 31 May, 2025; originally announced June 2025.

Comments: This paper was accepted to Interspeech 2025

arXiv:2506.00350 [pdf, ps, other]

DiffDSR: Dysarthric Speech Reconstruction Using Latent Diffusion Model

Authors: Xueyuan Chen, Dongchao Yang, Wenxuan Wu, Minglin Wu, Jing Xu, Xixin Wu, Zhiyong Wu, Helen Meng

Abstract: Dysarthric speech reconstruction (DSR) aims to convert dysarthric speech into comprehensible speech while maintaining the speaker's identity. Despite significant advancements, existing methods often struggle with low speech intelligibility and poor speaker similarity. In this study, we introduce a novel diffusion-based DSR system that leverages a latent diffusion model to enhance the quality of sp… ▽ More Dysarthric speech reconstruction (DSR) aims to convert dysarthric speech into comprehensible speech while maintaining the speaker's identity. Despite significant advancements, existing methods often struggle with low speech intelligibility and poor speaker similarity. In this study, we introduce a novel diffusion-based DSR system that leverages a latent diffusion model to enhance the quality of speech reconstruction. Our model comprises: (i) a speech content encoder for phoneme embedding restoration via pre-trained self-supervised learning (SSL) speech foundation models; (ii) a speaker identity encoder for speaker-aware identity preservation by in-context learning mechanism; (iii) a diffusion-based speech generator to reconstruct the speech based on the restored phoneme embedding and preserved speaker identity. Through evaluations on the widely-used UASpeech corpus, our proposed model shows notable enhancements in speech intelligibility and speaker similarity. △ Less

Submitted 30 May, 2025; originally announced June 2025.

Comments: Accepted by Interspeech 2025

arXiv:2505.24496 [pdf, other]

Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation

Authors: Wenrui Liu, Qian Chen, Wen Wang, Yafeng Chen, Jin Xu, Zhifang Guo, Guanrou Yang, Weiqin Li, Xiaoda Yang, Tao Jin, Minghui Fang, Jialong Zuo, Bai Jionghao, Zemin Liu

Abstract: Neural audio codecs, used as speech tokenizers, have demonstrated remarkable potential in the field of speech generation. However, to ensure high-fidelity audio reconstruction, neural audio codecs typically encode audio into long sequences of speech tokens, posing a significant challenge for downstream language models in long-context modeling. We observe that speech token sequences exhibit short-r… ▽ More Neural audio codecs, used as speech tokenizers, have demonstrated remarkable potential in the field of speech generation. However, to ensure high-fidelity audio reconstruction, neural audio codecs typically encode audio into long sequences of speech tokens, posing a significant challenge for downstream language models in long-context modeling. We observe that speech token sequences exhibit short-range dependency: due to the monotonic alignment between text and speech in text-to-speech (TTS) tasks, the prediction of the current token primarily relies on its local context, while long-range tokens contribute less to the current token prediction and often contain redundant information. Inspired by this observation, we propose a \textbf{compressed-to-fine language modeling} approach to address the challenge of long sequence speech tokens within neural codec language models: (1) \textbf{Fine-grained Initial and Short-range Information}: Our approach retains the prompt and local tokens during prediction to ensure text alignment and the integrity of paralinguistic information; (2) \textbf{Compressed Long-range Context}: Our approach compresses long-range token spans into compact representations to reduce redundant information while preserving essential semantics. Extensive experiments on various neural audio codecs and downstream language models validate the effectiveness and generalizability of the proposed approach, highlighting the importance of token compression in improving speech generation within neural codec language models. The demo of audio samples will be available at https://anonymous.4open.science/r/SpeechTokenPredictionViaCompressedToFinedLM. △ Less

Submitted 30 May, 2025; originally announced May 2025.

arXiv:2505.23821 [pdf, ps, other]

SpeechVerifier: Robust Acoustic Fingerprint against Tampering Attacks via Watermarking

Authors: Lingfeng Yao, Chenpei Huang, Shengyao Wang, Junpei Xue, Hanqing Guo, Jiang Liu, Xun Chen, Miao Pan

Abstract: With the surge of social media, maliciously tampered public speeches, especially those from influential figures, have seriously affected social stability and public trust. Existing speech tampering detection methods remain insufficient: they either rely on external reference data or fail to be both sensitive to attacks and robust to benign operations, such as compression and resampling. To tackle… ▽ More With the surge of social media, maliciously tampered public speeches, especially those from influential figures, have seriously affected social stability and public trust. Existing speech tampering detection methods remain insufficient: they either rely on external reference data or fail to be both sensitive to attacks and robust to benign operations, such as compression and resampling. To tackle these challenges, we introduce SpeechVerifer to proactively verify speech integrity using only the published speech itself, i.e., without requiring any external references. Inspired by audio fingerprinting and watermarking, SpeechVerifier can (i) effectively detect tampering attacks, (ii) be robust to benign operations and (iii) verify the integrity only based on published speeches. Briefly, SpeechVerifier utilizes multiscale feature extraction to capture speech features across different temporal resolutions. Then, it employs contrastive learning to generate fingerprints that can detect modifications at varying granularities. These fingerprints are designed to be robust to benign operations, but exhibit significant changes when malicious tampering occurs. To enable speech verification in a self-contained manner, the generated fingerprints are then embedded into the speech signal by segment-wise watermarking. Without external references, SpeechVerifier can retrieve the fingerprint from the published audio and check it with the embedded watermark to verify the integrity of the speech. Extensive experimental results demonstrate that the proposed SpeechVerifier is effective in detecting tampering attacks and robust to benign operations. △ Less

Submitted 1 June, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

arXiv:2505.22438 [pdf, ps, other]

Synonymous Variational Inference for Perceptual Image Compression

Authors: Zijian Liang, Kai Niu, Changshuo Wang, Jin Xu, Ping Zhang

Abstract: Recent contributions of semantic information theory reveal the set-element relationship between semantic and syntactic information, represented as synonymous relationships. In this paper, we propose a synonymous variational inference (SVI) method based on this synonymity viewpoint to re-analyze the perceptual image compression problem. It takes perceptual similarity as a typical synonymous criteri… ▽ More Recent contributions of semantic information theory reveal the set-element relationship between semantic and syntactic information, represented as synonymous relationships. In this paper, we propose a synonymous variational inference (SVI) method based on this synonymity viewpoint to re-analyze the perceptual image compression problem. It takes perceptual similarity as a typical synonymous criterion to build an ideal synonymous set (Synset), and approximate the posterior of its latent synonymous representation with a parametric density by minimizing a partial semantic KL divergence. This analysis theoretically proves that the optimization direction of perception image compression follows a triple tradeoff that can cover the existing rate-distortion-perception schemes. Additionally, we introduce synonymous image compression (SIC), a new image compression scheme that corresponds to the analytical process of SVI, and implement a progressive SIC codec to fully leverage the model's capabilities. Experimental results demonstrate comparable rate-distortion-perception performance using a single progressive SIC codec, thus verifying the effectiveness of our proposed analysis method. △ Less

Submitted 28 May, 2025; originally announced May 2025.

Comments: 31 pages, 20 figures. This paper is accepted by Proceedings of the 42nd International Conference on Machine Learning (ICML 2025) Poster

arXiv:2505.22343 [pdf, ps, other]

Empowering Intelligent Low-altitude Economy with Large AI Model Deployment

Authors: Zhonghao Lyu, Yulan Gao, Junting Chen, Hongyang Du, Jie Xu, Kaibin Huang, Dong In Kim

Abstract: Low-altitude economy (LAE) represents an emerging economic paradigm that redefines commercial and social aerial activities. Large artificial intelligence models (LAIMs) offer transformative potential to further enhance the intelligence of LAE services. However, deploying LAIMs in LAE poses several challenges, including the significant gap between their computational/storage demands and the limited… ▽ More Low-altitude economy (LAE) represents an emerging economic paradigm that redefines commercial and social aerial activities. Large artificial intelligence models (LAIMs) offer transformative potential to further enhance the intelligence of LAE services. However, deploying LAIMs in LAE poses several challenges, including the significant gap between their computational/storage demands and the limited onboard resources of LAE entities, the mismatch between lab-trained LAIMs and dynamic physical environments, and the inefficiencies of traditional decoupled designs for sensing, communication, and computation. To address these issues, we first propose a hierarchical system architecture tailored for LAIM deployment and present representative LAE application scenarios. Next, we explore key enabling techniques that facilitate the mutual co-evolution of LAIMs and low-altitude systems, and introduce a task-oriented execution pipeline for scalable and adaptive service delivery. Then, the proposed framework is validated through real-world case studies. Finally, we outline open challenges to inspire future research. △ Less

Submitted 3 July, 2025; v1 submitted 28 May, 2025; originally announced May 2025.

arXiv:2505.21894 [pdf, ps, other]

Patch-based Reconstruction for Unsupervised Dynamic MRI using Learnable Tensor Function with Implicit Neural Representation

Authors: Yuanyuan Liu, Yuanbiao Yang, Zhuo-Xu Cui, Qingyong Zhu, Jing Cheng, Congcong Liu, Jinwen Xie, Jingran Xu, Hairong Zheng, Dong Liang, Yanjie Zhu

Abstract: Dynamic MRI plays a vital role in clinical practice by capturing both spatial details and dynamic motion, but its high spatiotemporal resolution is often limited by long scan times. Deep learning (DL)-based methods have shown promising performance in accelerating dynamic MRI. However, most existing algorithms rely on large fully-sampled datasets for training, which are difficult to acquire. Recent… ▽ More Dynamic MRI plays a vital role in clinical practice by capturing both spatial details and dynamic motion, but its high spatiotemporal resolution is often limited by long scan times. Deep learning (DL)-based methods have shown promising performance in accelerating dynamic MRI. However, most existing algorithms rely on large fully-sampled datasets for training, which are difficult to acquire. Recently, implicit neural representation (INR) has emerged as a powerful scan-specific paradigm for accelerated MRI, which models signals as a continuous function over spatiotemporal coordinates. Although this approach achieves efficient continuous modeling of dynamic images and robust reconstruction, it faces challenges in recovering fine details and increasing computational demands for high dimensional data representation. To enhance both efficiency and reconstruction quality, we propose TenF-INR, a novel patch-based unsupervised framework that employs INR to model bases of tensor decomposition, enabling efficient and accurate modeling of dynamic MR images with learnable tensor functions. By exploiting strong correlations in similar spatial image patches and in the temporal direction, TenF-INR enforces multidimensional low-rankness and implements patch-based reconstruction with the benefits of continuous modeling. We compare TenF-INR with state-of-the-art methods, including supervised DL methods and unsupervised approaches. Experimental results demonstrate that TenF-INR achieves high acceleration factors up to 21, outperforming all comparison methods in image quality, temporal fidelity, and quantitative metrics, even surpassing the supervised methods. △ Less

Submitted 27 May, 2025; originally announced May 2025.

arXiv:2505.21259 [pdf, other]

doi 10.1109/JIOT.2025.3574814

Stochastic Geometry-Based Performance Evaluation for LEO Satellite-Assisted Space Caching

Authors: Chunyi Ma, Jiajie Xu, Jianhua Yang, Mustafa A. Kishk

Abstract: To achieve the Internet of Things (IoT) vision,Mobile Edge Computing (MEC) is a promising technology aimed at providing low-latency computing services to user equipment (UE). However, terrestrial MEC network struggles to provide service to UEs in remote and maritime region. Low Earth Orbit (LEO) satellite networks have the potential to overcome geographical restrictions and provide seamless global… ▽ More To achieve the Internet of Things (IoT) vision,Mobile Edge Computing (MEC) is a promising technology aimed at providing low-latency computing services to user equipment (UE). However, terrestrial MEC network struggles to provide service to UEs in remote and maritime region. Low Earth Orbit (LEO) satellite networks have the potential to overcome geographical restrictions and provide seamless global coverage for UEs. In this paper, we provide the first attempt to use stochastic geometry to investigate the performance of implementing space caching with LEO satellites (SATs) in the MEC network. We study a LEO satellite-assisted space caching MEC network, and LEO SATs can be equipped with servers to enable space caching, with the advantage of seamless coverage to assist terrestrial CSs for serving UEs in remote or maritime reigon. Using stochastic geometry and queuing theory, we establish an analytical framework for this MEC network. Meanwhile, we develop association strategies for UEs to connect with LEO SATs or CSs and utilize stochastic geometry to derive uplink and downlink coverage probabilities, considering the diversity of task and service types. On this basis, we employ the queuing theory to calculate the average delay to evaluate the system performance. Through Monte Carlo simulations and numerical results, the system performance is evaluated. The results show the potential of SAT spatial caching in improving the performance of the MEC network. Additionally, our results reveal useful insights such as the significant impact of the altitude and number of LEO SATs on the average delay of the network, providing helpful system-level recommendations for the design and configuration of the space-caching MEC network. △ Less

Submitted 27 May, 2025; originally announced May 2025.

Comments: 15 pages, 12 figures, be accepted by IEEE IoTJ

arXiv:2505.18165 [pdf, ps, other]

A Comprehensive PPG-based Dataset for HR/HRV Studies

Authors: Jingye Xu, Yuntong Zhang, Wei Wang, Mimi Xie, Dakai Zhu

Abstract: Heart rate (HR) and heart rate variability (HRV) are important vital signs for human physical and mental health. Recent research has demonstrated that photoplethysmography (PPG) sensors can infer HR and HRV. However, it is difficult to find a comprehensive PPG-based dataset for HR/HRV studies, especially for various study needs: multiple scenes, long-term monitoring, and multimodality (multiple PP… ▽ More Heart rate (HR) and heart rate variability (HRV) are important vital signs for human physical and mental health. Recent research has demonstrated that photoplethysmography (PPG) sensors can infer HR and HRV. However, it is difficult to find a comprehensive PPG-based dataset for HR/HRV studies, especially for various study needs: multiple scenes, long-term monitoring, and multimodality (multiple PPG channels and extra acceleration data). In this study, we collected a comprehensive multimodal long-term dataset to address the gap of missing an all-in-one HR/HRV dataset (denoted as UTSA-PPG). We began by reviewing state-of-the-art datasets, emphasizing their strengths and limitations. Following this, we developed a custom data acquisition system and then collected the UTSA-PPG dataset and compared its key features with those of existing datasets. Additionally, five case studies were conducted, including comparisons with state-of-the-art datasets. The outcomes highlight the value of our dataset, demonstrating its utility for HR/HRV estimation exploration and its potential to aid researchers in creating generalized models for targeted research challenges. △ Less

Submitted 13 May, 2025; originally announced May 2025.

Comments: to be published in 13TH IEEE International Conference on Healthcare Informatics

arXiv:2505.17421 [pdf, ps, other]

Adaptive Implicit-Based Deep Learning Channel Estimation for 6G Communications

Authors: Zhen Qiao, Jiang Xue, Junkai Zhang, Guanzhang Liu, Xiaoqin Ma, Runhua Li, Faheem A. Khan, John S. Thompson, Zongben Xu

Abstract: With the widespread deployment of fifth-generation (5G) wireless networks, research on sixth-generation (6G) technology is gaining momentum. Artificial Intelligence (AI) is anticipated to play a significant role in 6G, particularly through integration with the physical layer for tasks such as channel estimation. Considering resource limitations in real systems, the AI algorithm should be designed… ▽ More With the widespread deployment of fifth-generation (5G) wireless networks, research on sixth-generation (6G) technology is gaining momentum. Artificial Intelligence (AI) is anticipated to play a significant role in 6G, particularly through integration with the physical layer for tasks such as channel estimation. Considering resource limitations in real systems, the AI algorithm should be designed to have the ability to balance the accuracy and resource consumption according to the scenarios dynamically. However, conventional explicit multilayer-stacked Deep Learning (DL) models struggle to adapt due to their heavy reliance on the structure of deep neural networks. This article proposes an adaptive Implicit-layer DL Channel Estimation Network (ICENet) with a lightweight framework for vehicle-to-everything communications. This novel approach balances computational complexity and channel estimation accuracy by dynamically adjusting computational resources based on input data conditions, such as channel quality. Unlike explicit multilayer-stacked DL-based channel estimation models, ICENet offers a flexible framework, where specific requirements can be achieved by adaptively changing the number of iterations of the iterative layer. Meanwhile, ICENet requires less memory while maintaining high performance. The article concludes by highlighting open research challenges and promising future research directions. △ Less

Submitted 22 May, 2025; originally announced May 2025.

arXiv:2505.15279 [pdf, ps, other]

Robust Secure Communications in Near-Field ISCAP Systems with Extremely Large-Scale Antenna Array

Authors: Zixiang Ren, Siyao Zhang, Ling Qiu, Derrick Wing Kwan Ng, Jie Xu

Abstract: This paper investigates robust secure communications in a near-field integrated sensing, communication, and powering (ISCAP) system, in which the base station (BS) is equipped with an extremely large-scale antenna array (ELAA). In this system, the BS transmits confidential messages to a single legitimate communication user (CU), simultaneously providing wireless power transfer to multiple energy r… ▽ More This paper investigates robust secure communications in a near-field integrated sensing, communication, and powering (ISCAP) system, in which the base station (BS) is equipped with an extremely large-scale antenna array (ELAA). In this system, the BS transmits confidential messages to a single legitimate communication user (CU), simultaneously providing wireless power transfer to multiple energy receivers (ERs) and performing point target sensing. We consider a scenario in which both the ERs and the sensing target may act as potential eavesdroppers attempting to intercept the confidential messages. To safeguard secure communication, the BS employs a joint beamforming design by transmitting information beams combined with dedicated triple-purpose beams serving as energy and sensing signals, as well as artificial noise (AN) for effectively jamming potential eavesdroppers. It is assumed that only coarse location information of the ERs and sensing targets or eavesdroppers is available at the BS, leading to imperfect channel state information (CSI). Under this setup, we formulate a robust beamforming optimization problem with the objective of maximizing the secrecy rate for the CU, while ensuring worst-case performance requirements on both target sensing and wireless energy harvesting at the ERs. To address the non-convex robust joint beamforming problem and facilitate the deployment of a low-complexity algorithm, we employ the S-procedure alongside an eavesdropping CSI error-bound determination method to acquire a high-quality solution. △ Less

Submitted 21 May, 2025; originally announced May 2025.

Comments: 13 pages

arXiv:2505.10793 [pdf, ps, other]

SongEval: A Benchmark Dataset for Song Aesthetics Evaluation

Authors: Jixun Yao, Guobin Ma, Huixin Xue, Huakang Chen, Chunbo Hao, Yuepeng Jiang, Haohe Liu, Ruibin Yuan, Jin Xu, Wei Xue, Hao Liu, Lei Xie

Abstract: Aesthetics serve as an implicit and important criterion in song generation tasks that reflect human perception beyond objective metrics. However, evaluating the aesthetics of generated songs remains a fundamental challenge, as the appreciation of music is highly subjective. Existing evaluation metrics, such as embedding-based distances, are limited in reflecting the subjective and perceptual aspec… ▽ More Aesthetics serve as an implicit and important criterion in song generation tasks that reflect human perception beyond objective metrics. However, evaluating the aesthetics of generated songs remains a fundamental challenge, as the appreciation of music is highly subjective. Existing evaluation metrics, such as embedding-based distances, are limited in reflecting the subjective and perceptual aspects that define musical appeal. To address this issue, we introduce SongEval, the first open-source, large-scale benchmark dataset for evaluating the aesthetics of full-length songs. SongEval includes over 2,399 songs in full length, summing up to more than 140 hours, with aesthetic ratings from 16 professional annotators with musical backgrounds. Each song is evaluated across five key dimensions: overall coherence, memorability, naturalness of vocal breathing and phrasing, clarity of song structure, and overall musicality. The dataset covers both English and Chinese songs, spanning nine mainstream genres. Moreover, to assess the effectiveness of song aesthetic evaluation, we conduct experiments using SongEval to predict aesthetic scores and demonstrate better performance than existing objective evaluation metrics in predicting human-perceived musical quality. △ Less

Submitted 15 May, 2025; originally announced May 2025.

arXiv:2505.09558 [pdf, other]

WavReward: Spoken Dialogue Models With Generalist Reward Evaluators

Authors: Shengpeng Ji, Tianle Liang, Yangzhuo Li, Jialong Zuo, Minghui Fang, Jinzheng He, Yifu Chen, Zhengqing Liu, Ziyue Jiang, Xize Cheng, Siqi Zheng, Jin Xu, Junyang Lin, Zhou Zhao

Abstract: End-to-end spoken dialogue models such as GPT-4o-audio have recently garnered significant attention in the speech domain. However, the evaluation of spoken dialogue models' conversational performance has largely been overlooked. This is primarily due to the intelligent chatbots convey a wealth of non-textual information which cannot be easily measured using text-based language models like ChatGPT.… ▽ More End-to-end spoken dialogue models such as GPT-4o-audio have recently garnered significant attention in the speech domain. However, the evaluation of spoken dialogue models' conversational performance has largely been overlooked. This is primarily due to the intelligent chatbots convey a wealth of non-textual information which cannot be easily measured using text-based language models like ChatGPT. To address this gap, we propose WavReward, a reward feedback model based on audio language models that can evaluate both the IQ and EQ of spoken dialogue systems with speech input. Specifically, 1) based on audio language models, WavReward incorporates the deep reasoning process and the nonlinear reward mechanism for post-training. By utilizing multi-sample feedback via the reinforcement learning algorithm, we construct a specialized evaluator tailored to spoken dialogue models. 2) We introduce ChatReward-30K, a preference dataset used to train WavReward. ChatReward-30K includes both comprehension and generation aspects of spoken dialogue models. These scenarios span various tasks, such as text-based chats, nine acoustic attributes of instruction chats, and implicit chats. WavReward outperforms previous state-of-the-art evaluation models across multiple spoken dialogue scenarios, achieving a substantial improvement about Qwen2.5-Omni in objective accuracy from 55.1$\%$ to 91.5$\%$. In subjective A/B testing, WavReward also leads by a margin of 83$\%$. Comprehensive ablation studies confirm the necessity of each component of WavReward. All data and code will be publicly at https://github.com/jishengpeng/WavReward after the paper is accepted. △ Less

Submitted 14 May, 2025; originally announced May 2025.

arXiv:2505.09141 [pdf, ps, other]

Sensing-Assisted Channel Prediction in Complex Wireless Environments: An LLM-Based Approach

Authors: Junjie He, Zixiang Ren, Jianping Yao, Han Hu, Tony Xiao Han, Jie Xu

Abstract: This letter studies the sensing-assisted channel prediction for a multi-antenna orthogonal frequency division multiplexing (OFDM) system operating in realistic and complex wireless environments. In this system,an integrated sensing and communication (ISAC) transmitter leverages the mono-static sensing capability to facilitate the prediction of its bi-static communication channel, by exploiting the… ▽ More This letter studies the sensing-assisted channel prediction for a multi-antenna orthogonal frequency division multiplexing (OFDM) system operating in realistic and complex wireless environments. In this system,an integrated sensing and communication (ISAC) transmitter leverages the mono-static sensing capability to facilitate the prediction of its bi-static communication channel, by exploiting the fact that the sensing and communication channels share the same physical environment involving shared scatterers. Specifically, we propose a novel large language model (LLM)-based channel prediction approach,which adapts pre-trained text-based LLM to handle the complex-matrix-form channel state information (CSI) data. This approach utilizes the LLM's strong ability to capture the intricate spatiotemporal relationships between the multi-path sensing and communication channels, and thus efficiently predicts upcoming communication CSI based on historical communication and sensing CSI data. Experimental results show that the proposed LLM-based approach significantly outperforms conventional deep learning-based methods and the benchmark scheme without sensing assistance. △ Less

Submitted 14 May, 2025; originally announced May 2025.

arXiv:2505.08414 [pdf]

An integrated language-vision foundation model for conversational diagnostics and triaging in primary eye care

Authors: Zhi Da Soh, Yang Bai, Kai Yu, Yang Zhou, Xiaofeng Lei, Sahil Thakur, Zann Lee, Lee Ching Linette Phang, Qingsheng Peng, Can Can Xue, Rachel Shujuan Chong, Quan V. Hoang, Lavanya Raghavan, Yih Chung Tham, Charumathi Sabanayagam, Wei-Chi Wu, Ming-Chih Ho, Jiangnan He, Preeti Gupta, Ecosse Lamoureux, Seang Mei Saw, Vinay Nangia, Songhomitra Panda-Jonas, Jie Xu, Ya Xing Wang , et al. (6 additional authors not shown)

Abstract: Current deep learning models are mostly task specific and lack a user-friendly interface to operate. We present Meta-EyeFM, a multi-function foundation model that integrates a large language model (LLM) with vision foundation models (VFMs) for ocular disease assessment. Meta-EyeFM leverages a routing mechanism to enable accurate task-specific analysis based on text queries. Using Low Rank Adaptati… ▽ More Current deep learning models are mostly task specific and lack a user-friendly interface to operate. We present Meta-EyeFM, a multi-function foundation model that integrates a large language model (LLM) with vision foundation models (VFMs) for ocular disease assessment. Meta-EyeFM leverages a routing mechanism to enable accurate task-specific analysis based on text queries. Using Low Rank Adaptation, we fine-tuned our VFMs to detect ocular and systemic diseases, differentiate ocular disease severity, and identify common ocular signs. The model achieved 100% accuracy in routing fundus images to appropriate VFMs, which achieved $\ge$ 82.2% accuracy in disease detection, $\ge$ 89% in severity differentiation, $\ge$ 76% in sign identification. Meta-EyeFM was 11% to 43% more accurate than Gemini-1.5-flash and ChatGPT-4o LMMs in detecting various eye diseases and comparable to an ophthalmologist. This system offers enhanced usability and diagnostic performance, making it a valuable decision support tool for primary eye care or an online LLM for fundus evaluation. △ Less

Submitted 13 May, 2025; originally announced May 2025.

arXiv:2505.06330 [pdf, other]

Prompting Large Language Models for Training-Free Non-Intrusive Load Monitoring

Authors: Junyu Xue, Xudong Wang, Xiaoling He, Shicheng Liu, Yi Wang, Guoming Tang

Abstract: Non-intrusive load monitoring (NILM) aims to disaggregate aggregate household electricity consumption into individual appliance usage and thus enables more effective energy management. While deep learning has advanced NILM, it remains limited by its dependence on labeled data, restricted generalization, and lack of explainability. This paper introduces the first prompt-based NILM framework that le… ▽ More Non-intrusive load monitoring (NILM) aims to disaggregate aggregate household electricity consumption into individual appliance usage and thus enables more effective energy management. While deep learning has advanced NILM, it remains limited by its dependence on labeled data, restricted generalization, and lack of explainability. This paper introduces the first prompt-based NILM framework that leverages large language models (LLMs) with in-context learning. We design and evaluate prompt strategies that integrate appliance features, timestamps and contextual information, as well as representative time-series examples on widely used open datasets. With optimized prompts, LLMs achieve competitive state detection accuracy and demonstrate robust generalization without the need for fine-tuning. LLMs also enhance explainability by providing clear, human-readable explanations for their predictions. Our results show that LLMs can reduce data requirements, improve adaptability, and provide transparent energy disaggregation in NILM applications. △ Less

Submitted 20 May, 2025; v1 submitted 9 May, 2025; originally announced May 2025.

arXiv:2505.05159 [pdf, other]

FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech

Authors: Linhan Ma, Dake Guo, He Wang, Jin Xu, Lei Xie

Abstract: Current speech generation research can be categorized into two primary classes: non-autoregressive and autoregressive. The fundamental distinction between these approaches lies in the duration prediction strategy employed for predictable-length sequences. The NAR methods ensure stability in speech generation by explicitly and independently modeling the duration of each phonetic unit. Conversely, A… ▽ More Current speech generation research can be categorized into two primary classes: non-autoregressive and autoregressive. The fundamental distinction between these approaches lies in the duration prediction strategy employed for predictable-length sequences. The NAR methods ensure stability in speech generation by explicitly and independently modeling the duration of each phonetic unit. Conversely, AR methods employ an autoregressive paradigm to predict the compressed speech token by implicitly modeling duration with Markov properties. Although this approach improves prosody, it does not provide the structural guarantees necessary for stability. To simultaneously address the issues of stability and naturalness in speech generation, we propose FlexSpeech, a stable, controllable, and expressive TTS model. The motivation behind FlexSpeech is to incorporate Markov dependencies and preference optimization directly on the duration predictor to boost its naturalness while maintaining explicit modeling of the phonetic units to ensure stability. Specifically, we decompose the speech generation task into two components: an AR duration predictor and a NAR acoustic model. The acoustic model is trained on a substantial amount of data to learn to render audio more stably, given reference audio prosody and phone durations. The duration predictor is optimized in a lightweight manner for different stylistic variations, thereby enabling rapid style transfer while maintaining a decoupled relationship with the specified speaker timbre. Experimental results demonstrate that our approach achieves SOTA stability and naturalness in zero-shot TTS. More importantly, when transferring to a specific stylistic domain, we can accomplish lightweight optimization of the duration module solely with about 100 data samples, without the need to adjust the acoustic model, thereby enabling rapid and stable style transfer. △ Less

Submitted 15 May, 2025; v1 submitted 8 May, 2025; originally announced May 2025.

Comments: 10 pages, 5 figures

arXiv:2505.02395 [pdf, other]

A Real-Time Control Barrier Function-Based Safety Filter for Motion Planning with Arbitrary Road Boundary Constraints

Authors: Jianye Xu, Chang Che, Bassam Alrifaee

Abstract: We present a real-time safety filter for motion planning, such as learning-based methods, using Control Barrier Functions (CBFs), which provides formal guarantees for collision avoidance with road boundaries. A key feature of our approach is its ability to directly incorporate road geometries of arbitrary shape without resorting to conservative overapproximations. We formulate the safety filter as… ▽ More We present a real-time safety filter for motion planning, such as learning-based methods, using Control Barrier Functions (CBFs), which provides formal guarantees for collision avoidance with road boundaries. A key feature of our approach is its ability to directly incorporate road geometries of arbitrary shape without resorting to conservative overapproximations. We formulate the safety filter as a constrained optimization problem in the form of a Quadratic Program (QP). It achieves safety by making minimal, necessary adjustments to the control actions issued by the nominal motion planner. We validate our safety filter through extensive numerical experiments across a variety of traffic scenarios featuring complex roads. The results confirm its reliable safety and high computational efficiency (execution frequency up to 40 Hz). Code & Video Demo: github.com/bassamlab/SigmaRL △ Less

Submitted 5 May, 2025; originally announced May 2025.

arXiv:2505.01742 [pdf, other]

Easz: An Agile Transformer-based Image Compression Framework for Resource-constrained IoTs

Authors: Yu Mao, Jingzong Li, Jun Wang, Hong Xu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue

Abstract: Neural image compression, necessary in various machine-to-machine communication scenarios, suffers from its heavy encode-decode structures and inflexibility in switching between different compression levels. Consequently, it raises significant challenges in applying the neural image compression to edge devices that are developed for powerful servers with high computational and storage capacities.… ▽ More Neural image compression, necessary in various machine-to-machine communication scenarios, suffers from its heavy encode-decode structures and inflexibility in switching between different compression levels. Consequently, it raises significant challenges in applying the neural image compression to edge devices that are developed for powerful servers with high computational and storage capacities. We take a step to solve the challenges by proposing a new transformer-based edge-compute-free image coding framework called Easz. Easz shifts the computational overhead to the server, and hence avoids the heavy encoding and model switching overhead on the edge. Easz utilizes a patch-erase algorithm to selectively remove image contents using a conditional uniform-based sampler. The erased pixels are reconstructed on the receiver side through a transformer-based framework. To further reduce the computational overhead on the receiver, we then introduce a lightweight transformer-based reconstruction structure to reduce the reconstruction load on the receiver side. Extensive evaluations conducted on a real-world testbed demonstrate multiple advantages of Easz over existing compression approaches, in terms of adaptability to different compression levels, computational efficiency, and image reconstruction quality. △ Less

Submitted 14 May, 2025; v1 submitted 3 May, 2025; originally announced May 2025.

arXiv:2504.14894 [pdf, other]

Never too Cocky to Cooperate: An FIM and RL-based USV-AUV Collaborative System for Underwater Tasks in Extreme Sea Conditions

Authors: Jingzehua Xu, Guanwen Xie, Jiwei Tang, Yimian Ding, Weiyi Liu, Shuai Zhang, Yi Li

Abstract: This paper develops a novel unmanned surface vehicle (USV)-autonomous underwater vehicle (AUV) collaborative system designed to enhance underwater task performance in extreme sea conditions. The system integrates a dual strategy: (1) high-precision multi-AUV localization enabled by Fisher information matrix-optimized USV path planning, and (2) reinforcement learning-based cooperative planning and… ▽ More This paper develops a novel unmanned surface vehicle (USV)-autonomous underwater vehicle (AUV) collaborative system designed to enhance underwater task performance in extreme sea conditions. The system integrates a dual strategy: (1) high-precision multi-AUV localization enabled by Fisher information matrix-optimized USV path planning, and (2) reinforcement learning-based cooperative planning and control method for multi-AUV task execution. Extensive experimental evaluations in the underwater data collection task demonstrate the system's operational feasibility, with quantitative results showing significant performance improvements over baseline methods. The proposed system exhibits robust coordination capabilities between USV and AUVs while maintaining stability in extreme sea conditions. To facilitate reproducibility and community advancement, we provide an open-source simulation toolkit available at: https://github.com/360ZMEM/USV-AUV-colab . △ Less

Submitted 21 April, 2025; originally announced April 2025.

arXiv:2504.13570 [pdf, other]

Integrated Super-resolution Sensing and Symbiotic Communication with 3D Sparse MIMO for Low-Altitude UAV Swarm

Authors: Jingran Xu, Hongqi Min, Yong Zeng

Abstract: Low-altitude unmanned aerial vehicle (UAV) swarms are expected to play important role for future intelligent aerial systems due to their great potential to cooperatively accomplish complicated missions effectively. However, there are important challenges to be addressed to enable their efficient operation: the large-scale nature of swarms usually leads to excessive spectrum consumption, and ultra-… ▽ More Low-altitude unmanned aerial vehicle (UAV) swarms are expected to play important role for future intelligent aerial systems due to their great potential to cooperatively accomplish complicated missions effectively. However, there are important challenges to be addressed to enable their efficient operation: the large-scale nature of swarms usually leads to excessive spectrum consumption, and ultra-low cost requirements for individual UAVs renders it necessary to develop more cost-effective communication modules. In addition, the densely located swarm UAVs require high resolution for localization and sensing. To address the above challenges and simultaneously achieve spectrum and energy-efficient communication and accurate sensing, we investigate low-altitude UAV swarm with integrated super-resolution sensing and symbiotic communication technology. Specifically, one leading UAV may act as a primary transmitter (PT) to transmit communication signals to the base station (BS), and the remaining nearby UAVs in the swarm act as passive backscatter devices (BDs), which can modulate their information by efficiently backscattering the radio frequency (RF) signals from the PT without consuming extra spectrum or power. In addition, to achieve efficient three-dimensional (3D) super-resolution sensing for the densely located UAV swarm, 3D sparse multiple-input multiple-output (MIMO) technology and super-resolution signal processing algorithms are further exploited, where both L-shaped nested array (LNA) and planar nested arrays (PNA) are considered at the BS. To evaluate the communication and sensing performance for the UAV-symbiotic radio (SR) system, the achievable rates of UAV swarm are derived and the beam patterns of sparse LNA, PNA and the benchmarking compact uniform planar array (UPA) are compared. △ Less

Submitted 18 April, 2025; originally announced April 2025.

Comments: 13 pages, 15 figures

arXiv:2504.13131 [pdf, other]

NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: Methods and Results

Authors: Xin Li, Kun Yuan, Bingchen Li, Fengbin Guan, Yizhen Shao, Zihao Yu, Xijun Wang, Yiting Lu, Wei Luo, Suhang Yao, Ming Sun, Chao Zhou, Zhibo Chen, Radu Timofte, Yabin Zhang, Ao-Xiang Zhang, Tianwu Zhi, Jianzhao Liu, Yang Li, Jingwen Xu, Yiting Liao, Yushen Zuo, Mingyang Wu, Renjie Li, Shengyun Zhong , et al. (88 additional authors not shown)

Abstract: This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating re… ▽ More This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating reliance on model ensembles, redundant weights, and other computationally expensive components in the previous IQA/VQA competitions. Track 2 introduces a new short-form UGC dataset tailored for single image super-resolution, i.e., the KwaiSR dataset. It consists of 1,800 synthetically generated S-UGC image pairs and 1,900 real-world S-UGC images, which are split into training, validation, and test sets using a ratio of 8:1:1. The primary objective of the challenge is to drive research that benefits the user experience of short-form UGC platforms such as Kwai and TikTok. This challenge attracted 266 participants and received 18 valid final submissions with corresponding fact sheets, significantly contributing to the progress of short-form UGC VQA and image superresolution. The project is publicly available at https://github.com/lixinustc/KVQE- ChallengeCVPR-NTIRE2025. △ Less

Submitted 17 April, 2025; originally announced April 2025.

Comments: Challenge Report of NTIRE 2025; Methods from 18 Teams; Accepted by CVPR Workshop; 21 pages

arXiv:2504.09441 [pdf, other]

Structure-Accurate Medical Image Translation via Dynamic Frequency Balance and Knowledge Guidance

Authors: Jiahua Xu, Dawei Zhou, Lei Hu, Zaiyi Liu, Nannan Wang, Xinbo Gao

Abstract: Multimodal medical images play a crucial role in the precise and comprehensive clinical diagnosis. Diffusion model is a powerful strategy to synthesize the required medical images. However, existing approaches still suffer from the problem of anatomical structure distortion due to the overfitting of high-frequency information and the weakening of low-frequency information. Thus, we propose a novel… ▽ More Multimodal medical images play a crucial role in the precise and comprehensive clinical diagnosis. Diffusion model is a powerful strategy to synthesize the required medical images. However, existing approaches still suffer from the problem of anatomical structure distortion due to the overfitting of high-frequency information and the weakening of low-frequency information. Thus, we propose a novel method based on dynamic frequency balance and knowledge guidance. Specifically, we first extract the low-frequency and high-frequency components by decomposing the critical features of the model using wavelet transform. Then, a dynamic frequency balance module is designed to adaptively adjust frequency for enhancing global low-frequency features and effective high-frequency details as well as suppressing high-frequency noise. To further overcome the challenges posed by the large differences between different medical modalities, we construct a knowledge-guided mechanism that fuses the prior clinical knowledge from a visual language model with visual features, to facilitate the generation of accurate anatomical structures. Experimental evaluations on multiple datasets show the proposed method achieves significant improvements in qualitative and quantitative assessments, verifying its effectiveness and superiority. △ Less

Submitted 27 May, 2025; v1 submitted 13 April, 2025; originally announced April 2025.

Comments: Medical image translation, Diffusion model, 16 pages

arXiv:2504.07498 [pdf, other]

Learning Joint Source-Channel Encoding in IRS-assisted Multi-User Semantic Communications

Authors: Haidong Wang, Songhan Zhao, Lanhua Li, Bo Gu, Jing Xu, Shimin Gong, Jiawen Kang

Abstract: In this paper, we investigate a joint source-channel encoding (JSCE) scheme in an intelligent reflecting surface (IRS)-assisted multi-user semantic communication system. Semantic encoding not only compresses redundant information, but also enhances information orthogonality in a semantic feature space. Meanwhile, the IRS can adjust the spatial orthogonality, enabling concurrent multi-user semantic… ▽ More In this paper, we investigate a joint source-channel encoding (JSCE) scheme in an intelligent reflecting surface (IRS)-assisted multi-user semantic communication system. Semantic encoding not only compresses redundant information, but also enhances information orthogonality in a semantic feature space. Meanwhile, the IRS can adjust the spatial orthogonality, enabling concurrent multi-user semantic communication in densely deployed wireless networks to improve spectrum efficiency. We aim to maximize the users' semantic throughput by jointly optimizing the users' scheduling, the IRS's passive beamforming, and the semantic encoding strategies. To tackle this non-convex problem, we propose an explainable deep neural network-driven deep reinforcement learning (XD-DRL) framework. Specifically, we employ a deep neural network (DNN) to serve as a joint source-channel semantic encoder, enabling transmitters to extract semantic features from raw images. By leveraging structural similarity, we assign some DNN weight coefficients as the IRS's phase shifts, allowing simultaneous optimization of IRS's passive beamforming and DNN training. Given the IRS's passive beamforming and semantic encoding strategies, user scheduling is optimized using the DRL method. Numerical results validate that our JSCE scheme achieves superior semantic throughput compared to the conventional schemes and efficiently reduces the semantic encoder's mode size in multi-user scenarios. △ Less

Submitted 10 April, 2025; originally announced April 2025.

arXiv:2504.06830 [pdf, other]

Integrated Sensing and Communications Over the Years: An Evolution Perspective

Authors: Di Zhang, Yuanhao Cui, Xiaowen Cao, Nanchi Su, Fan Liu, Xiaojun Jing, J. Andrew Zhang, Jie Xu, Christos Masouros, Dusit Niyato, Marco Di Renzo

Abstract: Integrated Sensing and Communications (ISAC) enables efficient spectrum utilization and reduces hardware costs for beyond 5G (B5G) and 6G networks, facilitating intelligent applications that require both high-performance communication and precise sensing capabilities. This survey provides a comprehensive review of the evolution of ISAC over the years. We examine the expansion of the spectrum acros… ▽ More Integrated Sensing and Communications (ISAC) enables efficient spectrum utilization and reduces hardware costs for beyond 5G (B5G) and 6G networks, facilitating intelligent applications that require both high-performance communication and precise sensing capabilities. This survey provides a comprehensive review of the evolution of ISAC over the years. We examine the expansion of the spectrum across RF and optical ISAC, highlighting the role of advanced technologies, along with key challenges and synergies. We further discuss the advancements in network architecture from single-cell to multi-cell systems, emphasizing the integration of collaborative sensing and interference mitigation strategies. Moreover, we analyze the progress from single-modal to multi-modal sensing, with a focus on the integration of edge intelligence to enable real-time data processing, reduce latency, and enhance decision-making. Finally, we extensively review standardization efforts by 3GPP, IEEE, and ITU, examining the transition of ISAC-related technologies and their implications for the deployment of 6G networks. △ Less

Submitted 9 April, 2025; originally announced April 2025.

arXiv:2503.20215 [pdf, other]

Qwen2.5-Omni Technical Report

Authors: Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, Junyang Lin

Abstract: In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timest… ▽ More In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose \textbf{Thinker-Talker} architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness. △ Less

Submitted 26 March, 2025; originally announced March 2025.

arXiv:2503.18074 [pdf, other]

WISE: A Framework for Gigapixel Whole-Slide-Image Lossless Compression

Authors: Yu Mao, Jun Wang, Nan Guan, Chun Jason Xue

Abstract: Whole-Slide Images (WSIs) have revolutionized medical analysis by presenting high-resolution images of the whole tissue slide. Despite avoiding the physical storage of the slides, WSIs require considerable data volume, which makes the storage and maintenance of WSI records costly and unsustainable. To this end, this work presents the first investigation of lossless compression of WSI images. Inter… ▽ More Whole-Slide Images (WSIs) have revolutionized medical analysis by presenting high-resolution images of the whole tissue slide. Despite avoiding the physical storage of the slides, WSIs require considerable data volume, which makes the storage and maintenance of WSI records costly and unsustainable. To this end, this work presents the first investigation of lossless compression of WSI images. Interestingly, we find that most existing compression methods fail to compress the WSI images effectively. Furthermore, our analysis reveals that the failure of existing compressors is mainly due to information irregularity in WSI images. To resolve this issue, we developed a simple yet effective lossless compressor called WISE, specifically designed for WSI images. WISE employs a hierarchical encoding strategy to extract effective bits, reducing the entropy of the image and then adopting a dictionary-based method to handle the irregular frequency patterns. Through extensive experiments, we show that WISE can effectively compress the gigapixel WSI images to 36 times on average and up to 136 times. △ Less

Submitted 23 March, 2025; originally announced March 2025.

Showing 1–50 of 559 results for author: Xu, J