Search | arXiv e-print repository

SenSE: Semantic-Aware High-Fidelity Universal Speech Enhancement

Authors: Xingchen Li, Hanke Xie, Ziqian Wang, Zihan Zhang, Longshuai Xiao, Lei Xie

Abstract: Generative universal speech enhancement (USE) methods aim to leverage generative models to improve speech quality under various types of distortions. Diffusion- or flow-based generative models are capable of producing enhanced speech with high quality and fidelity. However, they typically achieve speech enhancement by learning an acoustic feature mapping from degraded speech to clean speech, while… ▽ More Generative universal speech enhancement (USE) methods aim to leverage generative models to improve speech quality under various types of distortions. Diffusion- or flow-based generative models are capable of producing enhanced speech with high quality and fidelity. However, they typically achieve speech enhancement by learning an acoustic feature mapping from degraded speech to clean speech, while lacking awareness of high-level semantic information. This deficiency tends to cause semantic ambiguity and acoustic discontinuities in the enhanced speech. In contrast, humans can often comprehend heavily corrupted speech by relying on semantic priors, suggesting that semantics play a crucial role in speech enhancement. Therefore, in this paper, we propose SenSE, which leverages a language model to capture the semantic information of distorted speech and effectively integrates it into a flow-matching-based speech enhancement framework. Specifically, we introduce a semantic-aware speech language model to capture the semantics of degraded speech and generate semantic tokens. We then design a semantic guidance mechanism that incorporates semantic information into the flow-matching-based speech enhancement process, effectively mitigating semantic ambiguity. In addition, we propose a prompt guidance mechanism, which leverages a short reference utterance to alleviate the loss of speaker similarity under severe distortion conditions. The results of several benchmark data sets demonstrate that SenSE not only ensures high perceptual quality but also substantially improves speech fidelity while maintaining strong robustness under severe distortions. Codes and demos are available. △ Less

Submitted 29 September, 2025; originally announced September 2025.

Comments: Under review

arXiv:2509.24524 [pdf, ps, other]

PhysiAgent: An Embodied Agent Framework in Physical World

Authors: Zhihao Wang, Jianxiong Li, Jinliang Zheng, Wencong Zhang, Dongxiu Liu, Yinan Zheng, Haoyi Niu, Junzhi Yu, Xianyuan Zhan

Abstract: Vision-Language-Action (VLA) models have achieved notable success but often struggle with limited generalizations. To address this, integrating generalized Vision-Language Models (VLMs) as assistants to VLAs has emerged as a popular solution. However, current approaches often combine these models in rigid, sequential structures: using VLMs primarily for high-level scene understanding and task plan… ▽ More Vision-Language-Action (VLA) models have achieved notable success but often struggle with limited generalizations. To address this, integrating generalized Vision-Language Models (VLMs) as assistants to VLAs has emerged as a popular solution. However, current approaches often combine these models in rigid, sequential structures: using VLMs primarily for high-level scene understanding and task planning, and VLAs merely as executors of lower-level actions, leading to ineffective collaboration and poor grounding challenges. In this paper, we propose an embodied agent framework, PhysiAgent, tailored to operate effectively in physical environments. By incorporating monitor, memory, self-reflection mechanisms, and lightweight off-the-shelf toolboxes, PhysiAgent offers an autonomous scaffolding framework to prompt VLMs to organize different components based on real-time proficiency feedback from VLAs to maximally exploit VLAs' capabilities. Experimental results demonstrate significant improvements in task-solving performance on complex real-world robotic tasks, showcasing effective self-regulation of VLMs, coherent tool collaboration, and adaptive evolution of the framework during execution. PhysiAgent makes practical and pioneering efforts to integrate VLMs and VLAs, effectively grounding embodied agent frameworks in real-world settings. △ Less

Submitted 29 September, 2025; originally announced September 2025.

arXiv:2509.23299 [pdf, ps, other]

MeanFlowSE: One-Step Generative Speech Enhancement via MeanFlow

Authors: Yike Zhu, Boyi Kang, Ziqian Wang, Xingchen Li, Zihan Zhang, Wenjie Li, Longshuai Xiao, Wei Xue, Lei Xie

Abstract: Speech enhancement (SE) recovers clean speech from noisy signals and is vital for applications such as telecommunications and automatic speech recognition (ASR). While generative approaches achieve strong perceptual quality, they often rely on multi-step sampling (diffusion/flow-matching) or large language models, limiting real-time deployment. To mitigate these constraints, we present MeanFlowSE,… ▽ More Speech enhancement (SE) recovers clean speech from noisy signals and is vital for applications such as telecommunications and automatic speech recognition (ASR). While generative approaches achieve strong perceptual quality, they often rely on multi-step sampling (diffusion/flow-matching) or large language models, limiting real-time deployment. To mitigate these constraints, we present MeanFlowSE, a one-step generative SE framework. It adopts MeanFlow to predict an average-velocity field for one-step latent refinement and conditions the model on self-supervised learning (SSL) representations rather than VAE latents. This design accelerates inference and provides robust acoustic-semantic guidance during training. In the Interspeech 2020 DNS Challenge blind test set and simulated test set, MeanFlowSE attains state-of-the-art (SOTA) level perceptual quality and competitive intelligibility while significantly lowering both real-time factor (RTF) and model size compared with recent generative competitors, making it suitable for practical use. The code will be released upon publication at https://github.com/Hello3orld/MeanFlowSE. △ Less

Submitted 30 September, 2025; v1 submitted 27 September, 2025; originally announced September 2025.

Comments: Submitted to ICASSP 2026

arXiv:2509.21290 [pdf, ps, other]

Vision-Intelligence-Enabled Beam Tracking for Cross-Interface Water-Air Optical Wireless Communications

Authors: Tianqi Mao, Jiayue Liu, Weijie Liu, Dezhi Zheng, Zhaocheng Wang

Abstract: The escalating development of oceanic applications like underwater surveillance and mineral exploration, is motivating real-time wireless backhaul of the considerable observation data. Such prospects can be hardly realized by the narrowband acoustic approach. Alternatively, optical wireless communication (OWC) has emerged as a promising solution for maritime and underwater applications due to its… ▽ More The escalating development of oceanic applications like underwater surveillance and mineral exploration, is motivating real-time wireless backhaul of the considerable observation data. Such prospects can be hardly realized by the narrowband acoustic approach. Alternatively, optical wireless communication (OWC) has emerged as a promising solution for maritime and underwater applications due to its great potential for broadband underwater transmission. However, the implementations of water-air OWC can be rather challenging, especially when penetrating the fluctuating interface, where the direction of refracted signals changes dynamically, causing severe beam misalignment with airborne stations. This has necessitated real-time transceiver alignment adaptable to the sophisticated oceanic environment, which has yet to be addressed. Against this background, this paper establishes a mathematical channel model for water-air optical wireless transmission across the fluctuating sea surface. Based on the model, we propose a vision-based beam tracking algorithm that leverages artificial intelligence (AI) methods for dynamic channel prediction. The proposed algorithm integrates a convolutional neural network (CNN) with bi-directional long short-term memory (Bi-LSTM), which further incorporates the attention mechanism to effectively extract critical spatio-temporal features from the vision data. The numerical simulation results show that the proposed algorithm can outperform its classical counterparts in maintaining receiving signal strength and supressing the vision noises, which demonstrates its robustness against the the harsh conditions of water-air OWC systems. △ Less

Submitted 25 September, 2025; originally announced September 2025.

arXiv:2509.21118 [pdf, ps, other]

Neural Integrated Sensing and Communication for the MIMO-OFDM Downlink

Authors: Ziyi Wang, Frederik Zumegen, Christoph Studer

Abstract: The ongoing convergence of spectrum and hardware requirements for wireless sensing and communication applications has fueled the integrated sensing and communication (ISAC) paradigm in next-generation networks. Neural-network-based ISAC leverages data-driven learning techniques to add sensing capabilities to existing communication infrastructure. This paper presents a novel signal-processing frame… ▽ More The ongoing convergence of spectrum and hardware requirements for wireless sensing and communication applications has fueled the integrated sensing and communication (ISAC) paradigm in next-generation networks. Neural-network-based ISAC leverages data-driven learning techniques to add sensing capabilities to existing communication infrastructure. This paper presents a novel signal-processing framework for such neural ISAC systems based on the multiple-input multiple-output (MIMO) and orthogonal frequency-division multiplexing (OFDM) downlink. Our approach enables generalized sensing functionality without modifying the MIMO-OFDM communication link. Specifically, our neural ISAC pipeline measures the backscattered communication signals to generate discrete map representations of spatial occupancy, formulated as multiclass or multilabel classification problems, which can then be utilized by specialized downstream tasks. To improve sensing performance in closed or cluttered environments, our neural ISAC pipeline relies on features specifically designed to mitigate strong reflective paths. Extensive simulations using ray-tracing models demonstrate that our neural ISAC framework reliably reconstructs scene maps without altering the MIMO-OFDM communication pipeline or reducing data rates. △ Less

Submitted 25 September, 2025; originally announced September 2025.

Comments: To appear in the IEEE Journal on Selected Areas in Communications

arXiv:2509.20030 [pdf, ps, other]

Multi-Stage CD-Kennedy Receiver for QPSK Modulated CV-QKD in Turbulent Channels

Authors: Renzhi Yuan, Zhixing Wang, Shouye Miao, Mufei Zhao, Haifeng Yao, Bin Cao, Mugen Peng

Abstract: Continuous variable-quantum key distribution (CV-QKD) protocols attract increasing attentions in recent years because they enjoy high secret key rate (SKR) and good compatibility with existing optical communication infrastructure. Classical coherent receivers are widely employed in coherent states based CV-QKD protocols, whose detection performance is bounded by the standard quantum limit (SQL). R… ▽ More Continuous variable-quantum key distribution (CV-QKD) protocols attract increasing attentions in recent years because they enjoy high secret key rate (SKR) and good compatibility with existing optical communication infrastructure. Classical coherent receivers are widely employed in coherent states based CV-QKD protocols, whose detection performance is bounded by the standard quantum limit (SQL). Recently, quantum receivers based on displacement operators are experimentally demonstrated with detection performance outperforming the SQL in various practical conditions. However, potential applications of quantum receivers in CV-QKD protocols under turbulent channels are still not well explored, while practical CV-QKD protocols must survive from the atmospheric turbulence in satellite-to-ground optical communication links. In this paper, we consider the possibility of using a quantum receiver called multi-stage CD-Kennedy receiver to enhance the SKR performance of a quadrature phase shift keying (QPSK) modulated CV-QKD protocol in turbulent channels. We first derive the error probability of the multi-stage CD-Kennedy receiver for detecting QPSK signals in turbulent channels and further propose three types of multi-stage CD-Kennedy receiver with different displacement choices, i.e., the Type-I, Type-II, and Type-III receivers. Then we derive the SKR of a QPSK modulated CV-QKD protocol using the multi-stage CD-Kennedy receiver and post-selection strategy in turbulent channels. Numerical results show that the multi-stage CD-Kennedy receiver can outperform the classical coherent receiver in turbulent channels in terms of both error probability and SKR performance and the Type-II receiver can tolerate worse channel conditions compared with Type-I and Type-III receivers in terms of error probability performance. △ Less

Submitted 24 September, 2025; originally announced September 2025.

Comments: 25pages,7 figures

arXiv:2509.19754 [pdf, ps, other]

Timeliness-Aware Joint Source and Channel Coding for Adaptive Image Transmission

Authors: Xiaolei Yang, Zijing Wang, Zhijin Qin, Xiaoming Tao

Abstract: Accurate and timely image transmission is critical for emerging time-sensitive applications such as remote sensing in satellite-assisted Internet of Things. However, the bandwidth limitation poses a significant challenge in existing wireless systems, making it difficult to fulfill the requirements of both high-fidelity and low-latency image transmission. Semantic communication is expected to break… ▽ More Accurate and timely image transmission is critical for emerging time-sensitive applications such as remote sensing in satellite-assisted Internet of Things. However, the bandwidth limitation poses a significant challenge in existing wireless systems, making it difficult to fulfill the requirements of both high-fidelity and low-latency image transmission. Semantic communication is expected to break through the performance bottleneck by focusing on the transmission of goal-oriented semantic information rather than raw data. In this paper, we employ a new timeliness metric named the value of information (VoI) and propose an adaptive joint source and channel coding (JSCC) method for image transmission that simultaneously considers both reconstruction quality and timeliness. Specifically, we first design a JSCC framework for image transmission with adaptive code length. Next, we formulate a VoI maximization problem by optimizing the transmission code length of the adaptive JSCC under the reconstruction quality constraint. Then, a deep reinforcement learning-based algorithm is proposed to solve the optimization problem efficiently. Experimental results show that the proposed method significantly outperforms baseline schemes in terms of reconstruction quality and timeliness, particularly in low signal-to-noise ratio conditions, offering a promising solution for efficient and robust image transmission in time-sensitive wireless networks. △ Less

Submitted 24 September, 2025; originally announced September 2025.

Comments: 6 pages, 7 figures, accepted at IEEE GLOBECOM Workshops 2025

arXiv:2509.18592 [pdf, ps, other]

VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation

Authors: Neel P. Bhatt, Yunhao Yang, Rohan Siva, Pranay Samineni, Daniel Milan, Zhangyang Wang, Ufuk Topcu

Abstract: Rapid adaptation in unseen environments is essential for scalable real-world autonomy, yet existing approaches rely on exhaustive exploration or rigid navigation policies that fail to generalize. We present VLN-Zero, a two-phase vision-language navigation framework that leverages vision-language models to efficiently construct symbolic scene graphs and enable zero-shot neurosymbolic navigation. In… ▽ More Rapid adaptation in unseen environments is essential for scalable real-world autonomy, yet existing approaches rely on exhaustive exploration or rigid navigation policies that fail to generalize. We present VLN-Zero, a two-phase vision-language navigation framework that leverages vision-language models to efficiently construct symbolic scene graphs and enable zero-shot neurosymbolic navigation. In the exploration phase, structured prompts guide VLM-based search toward informative and diverse trajectories, yielding compact scene graph representations. In the deployment phase, a neurosymbolic planner reasons over the scene graph and environmental observations to generate executable plans, while a cache-enabled execution module accelerates adaptation by reusing previously computed task-location trajectories. By combining rapid exploration, symbolic reasoning, and cache-enabled execution, the proposed framework overcomes the computational inefficiency and poor generalization of prior vision-language navigation methods, enabling robust and scalable decision-making in unseen environments. VLN-Zero achieves 2x higher success rate compared to state-of-the-art zero-shot models, outperforms most fine-tuned baselines, and reaches goal locations in half the time with 55% fewer VLM calls on average compared to state-of-the-art models across diverse environments. Codebase, datasets, and videos for VLN-Zero are available at: https://vln-zero.github.io/. △ Less

Submitted 22 September, 2025; originally announced September 2025.

Comments: Codebase, datasets, and videos for VLN-Zero are available at: https://vln-zero.github.io/

arXiv:2509.18555 [pdf, ps, other]

A Secure Affine Frequency Division Multiplexing for Wireless Communication Systems

Authors: Ping Wang, Zulin Wang, Yuanfang Ma, Xiaosi Tian, Yuanhan Ni

Abstract: This paper introduces a secure affine frequency division multiplexing (SE-AFDM) for wireless communication systems to enhance communication security. Besides configuring the parameter c1 to obtain communication reliability under doubly selective channels, we also utilize the time-varying parameter c2 to improve the security of the communications system. The derived input-output relation shows that… ▽ More This paper introduces a secure affine frequency division multiplexing (SE-AFDM) for wireless communication systems to enhance communication security. Besides configuring the parameter c1 to obtain communication reliability under doubly selective channels, we also utilize the time-varying parameter c2 to improve the security of the communications system. The derived input-output relation shows that the legitimate receiver can eliminate the nonlinear impact introduced by the time-varying c2 without losing the bit error rate (BER) performance. Moreover, it is theoretically proved that the eavesdropper cannot separate the time-varying c2 and random information symbols, such that the BER performance of the eavesdropper is severely deteriorated. Meanwhile, the analysis of the effective signal-to-interference-plus-noise ratio (SINR) of the eavesdropper illustrates that the SINR decreases as the value range of c2 expands. Numerical results verify that the proposed SE-AFDM waveform has significant security while maintaining good BER performance in high-mobility scenarios. △ Less

Submitted 22 September, 2025; originally announced September 2025.

Comments: 6 pages, 5 figures, 2025 IEEE International Conference on Communications

arXiv:2509.17046 [pdf, ps, other]

A Chain-of-thought Reasoning Breast Ultrasound Dataset Covering All Histopathology Categories

Authors: Haojun Yu, Youcheng Li, Zihan Niu, Nan Zhang, Xuantong Gong, Huan Li, Zhiying Zou, Haifeng Qi, Zhenxiao Cao, Zijie Lan, Xingjian Yuan, Jiating He, Haokai Zhang, Shengtao Zhang, Zicheng Wang, Dong Wang, Ziwei Zhao, Congying Chen, Yong Wang, Wangyan Qin, Qingli Zhu, Liwei Wang

Abstract: Breast ultrasound (BUS) is an essential tool for diagnosing breast lesions, with millions of examinations per year. However, publicly available high-quality BUS benchmarks for AI development are limited in data scale and annotation richness. In this work, we present BUS-CoT, a BUS dataset for chain-of-thought (CoT) reasoning analysis, which contains 11,439 images of 10,019 lesions from 4,838 patie… ▽ More Breast ultrasound (BUS) is an essential tool for diagnosing breast lesions, with millions of examinations per year. However, publicly available high-quality BUS benchmarks for AI development are limited in data scale and annotation richness. In this work, we present BUS-CoT, a BUS dataset for chain-of-thought (CoT) reasoning analysis, which contains 11,439 images of 10,019 lesions from 4,838 patients and covers all 99 histopathology types. To facilitate research on incentivizing CoT reasoning, we construct the reasoning processes based on observation, feature, diagnosis and pathology labels, annotated and verified by experienced experts. Moreover, by covering lesions of all histopathology types, we aim to facilitate robust AI systems in rare cases, which can be error-prone in clinical practice. △ Less

Submitted 22 September, 2025; v1 submitted 21 September, 2025; originally announced September 2025.

arXiv:2509.16963 [pdf]

A Reliable Robot Motion Planner in Complex Real-world Environments via Action Imagination

Authors: Chengjin Wang, Yanmin Zhou, Zhipeng Wang, Zheng Yan, Feng Luan, Shuo Jiang, Runjie Shen, Hongrui Sang, Bin He

Abstract: Humans and animals can make real-time adjustments to movements by imagining their action outcomes to prevent unanticipated or even catastrophic motion failures in unknown unstructured environments. Action imagination, as a refined sensorimotor strategy, leverages perception-action loops to handle physical interaction-induced uncertainties in perception and system modeling within complex systems. I… ▽ More Humans and animals can make real-time adjustments to movements by imagining their action outcomes to prevent unanticipated or even catastrophic motion failures in unknown unstructured environments. Action imagination, as a refined sensorimotor strategy, leverages perception-action loops to handle physical interaction-induced uncertainties in perception and system modeling within complex systems. Inspired by the action-awareness capability of animal intelligence, this study proposes an imagination-inspired motion planner (I-MP) framework that specifically enhances robots' action reliability by imagining plausible spatial states for approaching. After topologizing the workspace, I-MP build perception-action loop enabling robots autonomously build contact models. Leveraging fixed-point theory and Hausdorff distance, the planner computes convergent spatial states under interaction characteristics and mission constraints. By homogenously representing multi-dimensional environmental characteristics through work, the robot can approach the imagined spatial states via real-time computation of energy gradients. Consequently, experimental results demonstrate the practicality and robustness of I-MP in complex cluttered environments. △ Less

Submitted 21 September, 2025; originally announced September 2025.

arXiv:2509.15162 [pdf, ps, other]

A Unified Distributed Algorithm for Hybrid Near-Far Field Activity Detection in Cell-Free Massive MIMO

Authors: Jingreng Lei, Yang Li, Ziyue Wang, Qingfeng Lin, Ya-Feng Liu, Yik-Chung Wu

Abstract: A great amount of endeavor has recently been devoted to activity detection for massive machine-type communications in cell-free multiple-input multiple-output (MIMO) systems. However, as the number of antennas at the access points (APs) increases, the Rayleigh distance that separates the near-field and far-field regions also expands, rendering the conventional assumption of far-field propagation a… ▽ More A great amount of endeavor has recently been devoted to activity detection for massive machine-type communications in cell-free multiple-input multiple-output (MIMO) systems. However, as the number of antennas at the access points (APs) increases, the Rayleigh distance that separates the near-field and far-field regions also expands, rendering the conventional assumption of far-field propagation alone impractical. To address this challenge, this paper establishes a covariance-based formulation that can effectively capture the statistical property of hybrid near-far field channels. Based on this formulation, we theoretically reveal that increasing the proportion of near-field channels enhances the detection performance. Furthermore, we propose a distributed algorithm, where each AP performs local activity detection and only exchanges the detection results to the central processing unit, thus significantly reducing the computational complexity and the communication overhead. Not only with convergence guarantee, the proposed algorithm is unified in the sense that it can handle single-cell or cell-free systems with either near-field or far-field devices as special cases. Simulation results validate the theoretical analyses and demonstrate the superior performance of the proposed approach compared with existing methods. △ Less

Submitted 18 September, 2025; originally announced September 2025.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2509.13674 [pdf]

Scaling green hydrogen and CCUS via cement-methanol co-production in China

Authors: Yuezhang He, Hongxi Luo, Yuancheng Lin, Carl J. Talsma, Anna Li, Zhenqian Wang, Yujuan Fang, Pei Liu, Jesse D. Jenkins, Eric Larson, Zheng Li

Abstract: High costs of green hydrogen and of carbon capture, utilization, and sequestration (CCUS) have hindered policy ambition and slowed real-world deployment, despite their importance for decarbonizing hard-to-abate sectors, including cement and methanol. Given the economic challenges of adopting CCUS in cement and green hydrogen in methanol production separately, we propose a renewable-powered co-prod… ▽ More High costs of green hydrogen and of carbon capture, utilization, and sequestration (CCUS) have hindered policy ambition and slowed real-world deployment, despite their importance for decarbonizing hard-to-abate sectors, including cement and methanol. Given the economic challenges of adopting CCUS in cement and green hydrogen in methanol production separately, we propose a renewable-powered co-production system that couples electrolytic hydrogen and CCUS through molecule exchange. We optimize system configurations using an hourly-resolved, process-based model incorporating operational flexibility, and explore integrated strategies for plant-level deployment and CO2 source-sink matching across China. We find that co-production could reduce CO2 abatement costs to USD 41-53 per tonne by 2035, significantly lower than approximately USD 75 for standalone cement CCUS and over USD 120 for standalone renewable-based methanol. Co-production is preferentially deployed at cement plants in renewable-rich regions, potentially reshaping national CO2 infrastructure planning. This hydrogen-CCUS coupling paradigm could accelerate industrial decarbonization and scaling for other applications. △ Less

Submitted 17 September, 2025; originally announced September 2025.

arXiv:2509.13658 [pdf, ps, other]

Assessing Data Replication in Symbolic Music via Adapted Structural Similarity Index Measure

Authors: Shulei Ji, Zihao Wang, Le Ma, Jiaxing Yu, Kejun Zhang

Abstract: AI-generated music may inadvertently replicate samples from the training data, raising concerns of plagiarism. Similarity measures can quantify such replication, thereby offering supervision and guidance for music generation models. Existing similarity measure methods for symbolic music mainly target melody repetition, leaving a gap in assessing complex music with rich textures and expressive perf… ▽ More AI-generated music may inadvertently replicate samples from the training data, raising concerns of plagiarism. Similarity measures can quantify such replication, thereby offering supervision and guidance for music generation models. Existing similarity measure methods for symbolic music mainly target melody repetition, leaving a gap in assessing complex music with rich textures and expressive performance characteristics. To address this gap, we introduce SSIMuse, the first adaptation of the Structural Similarity Index Measure (SSIM) from images to symbolic music. Specifically, we represent symbolic music as image-like piano rolls in binary and velocity-based forms. Build upon these representations, we reinterprete and suitably modify the SSIM components in the musical context to develop two variants, i.e., SSIMuse-B and SSIMuse-V, for evaluating data replication in composition and dynamic performance, respectively. Controlled experiments on synthetic samples from multiple datasets show that SSIMuse can reliably detect exact replication at a granularity of at least one bar. SSIMuse enables open evaluation of replication in music generation and draws attention to its broader ethical, social, legal, and economic implications. The code is available at https://github.com/Tayjsl97/SSIMuse. △ Less

Submitted 16 September, 2025; originally announced September 2025.

arXiv:2509.12748 [pdf, ps, other]

NEFT: A Unified Transformer Framework for Efficient Near-Field CSI Feedback in XL-MIMO Systems

Authors: Haiyang Li, Tianqi Mao, Pengyu Wang, Ruiqi Liu, Shunyu Li, Zhaocheng Wang

Abstract: Extremely large-scale multiple-input multiple-output (XL-MIMO) systems, operating in the near-field region due to their massive antenna arrays, are a key enabler of next-generation wireless communications but face significant challenges in channel state information (CSI) feedback. Deep learning has emerged as a powerful tool by learning compact CSI representations for feedback. However, existing m… ▽ More Extremely large-scale multiple-input multiple-output (XL-MIMO) systems, operating in the near-field region due to their massive antenna arrays, are a key enabler of next-generation wireless communications but face significant challenges in channel state information (CSI) feedback. Deep learning has emerged as a powerful tool by learning compact CSI representations for feedback. However, existing methods struggle to capture the intricate structure of near-field CSI while incurring prohibitive computational overhead on practical mobile devices. To overcome these limitations, we propose the Near-Field Efficient Feedback Transformer (NEFT) family for accurate and efficient near-field CSI feedback across diverse hardware platforms. Built on a hierarchical Vision Transformer backbone, NEFT is extended with lightweight variants to meet various deployment constraints: NEFT-Compact applies multi-level knowledge distillation (KD) to reduce complexity while maintaining accuracy, and NEFT-Hybrid and NEFT-Edge address encoder- and edge-constrained scenarios via attention-free encoding and KD. Extensive simulations show that NEFT achieves a 15--21 dB improvement in normalized mean-squared error (NMSE) over state-of-the-art methods, while NEFT-Compact and NEFT-Edge reduce total FLOPs by 25--36% with negligible accuracy loss. Moreover, NEFT-Hybrid lowers encoder-side complexity by up to 64%, enabling deployment in highly asymmetric device scenarios. These results establish NEFT as a practical and scalable solution for near-field CSI feedback in XL-MIMO systems. △ Less

Submitted 16 September, 2025; originally announced September 2025.

arXiv:2509.11516 [pdf]

PaiP: An Operational Aware Interactive Planner for Unknown Cabinet Environments

Authors: Chengjin Wang, Zheng Yan, Yanmin Zhou, Runjie Shen, Zhipeng Wang, Bin Cheng, Bin He

Abstract: Box/cabinet scenarios with stacked objects pose significant challenges for robotic motion due to visual occlusions and constrained free space. Traditional collision-free trajectory planning methods often fail when no collision-free paths exist, and may even lead to catastrophic collisions caused by invisible objects. To overcome these challenges, we propose an operational aware interactive motion… ▽ More Box/cabinet scenarios with stacked objects pose significant challenges for robotic motion due to visual occlusions and constrained free space. Traditional collision-free trajectory planning methods often fail when no collision-free paths exist, and may even lead to catastrophic collisions caused by invisible objects. To overcome these challenges, we propose an operational aware interactive motion planner (PaiP) a real-time closed-loop planning framework utilizing multimodal tactile perception. This framework autonomously infers object interaction features by perceiving motion effects at interaction interfaces. These interaction features are incorporated into grid maps to generate operational cost maps. Building upon this representation, we extend sampling-based planning methods to interactive planning by optimizing both path cost and operational cost. Experimental results demonstrate that PaiP achieves robust motion in narrow spaces. △ Less

Submitted 14 September, 2025; originally announced September 2025.

arXiv:2509.10666 [pdf, ps, other]

Uplink and Downlink Communications in Segmented Waveguide-Enabled Pinching-Antenna Systems (SWANs)

Authors: Chongjun Ouyang, Hao Jiang, Zhaolin Wang, Yuanwei Liu, Zhiguo Ding

Abstract: A segmented waveguide-enabled pinching-antenna system (SWAN) is proposed, in which a segmented waveguide composed of multiple short dielectric waveguide segments is employed to radiate or receive signals through the pinching antennas (PAs) deployed on each segment. Based on this architecture, three practical operating protocols are proposed: segment selection (SS), segment aggregation (SA), and se… ▽ More A segmented waveguide-enabled pinching-antenna system (SWAN) is proposed, in which a segmented waveguide composed of multiple short dielectric waveguide segments is employed to radiate or receive signals through the pinching antennas (PAs) deployed on each segment. Based on this architecture, three practical operating protocols are proposed: segment selection (SS), segment aggregation (SA), and segment multiplexing (SM). For uplink SWAN communications, where one PA is activated per segment, the segmented structure eliminates the inter-antenna radiation effect, i.e., signals captured by one PA may re-radiate through other PAs along the same waveguide. This yields a tractable and physically consistent uplink signal model for a multi-PA pinching-antenna system (PASS), which has not been established for conventional PASS using a single long waveguide. Building on this model, PA placement algorithms are proposed to maximize the uplink signal-to-noise ratio (SNR). Closed-form expressions for the received SNR under the three protocols are derived, and the corresponding scaling laws with respect to the number of segments are analyzed. It is proven that the segmented architecture reduces both the average PA-to-user distance and the PA-to-feed distance, thereby mitigating both large-scale path loss and in-waveguide propagation loss. These results are extended to downlink SWAN communications, where multiple PAs are activated per segment, and PA placement methods are proposed to maximize the downlink received SNR under the three protocols. Numerical results demonstrate that: \romannumeral1) among the three protocols, SM achieves the best performance, followed by SA and then SS; and \romannumeral2) for all protocols, the proposed SWAN achieves a higher SNR than conventional PASS with a single long waveguide in both uplink and downlink scenarios. △ Less

Submitted 12 September, 2025; originally announced September 2025.

Comments: Submitted to IEEE journal

arXiv:2509.10296 [pdf, ps, other]

Low-Complexity Null-Space-Based Simultaneous Wireless Information and Power Transfer Scheme

Authors: Cheng Luo, Jie Hu, Luping Xiang, Kun Yang, Zhiqin Wang

Abstract: Simultaneous wireless information and power transfer (SWIPT) has attracted sustained interest. We propose a null-space-based transmission scheme for multiuser SWIPT serving both energy users (EUs) and information users (IUs). Under a practical nonlinear energy-harvesting (EH) model and multiple waveform options, we revisit the role of dedicated energy beams (EBs). We show that, in general, dedicat… ▽ More Simultaneous wireless information and power transfer (SWIPT) has attracted sustained interest. We propose a null-space-based transmission scheme for multiuser SWIPT serving both energy users (EUs) and information users (IUs). Under a practical nonlinear energy-harvesting (EH) model and multiple waveform options, we revisit the role of dedicated energy beams (EBs). We show that, in general, dedicated EBs are unnecessary because information beams (IBs) with Gaussian signaling can simultaneously support wireless energy transfer (WET) and wireless information transfer (WIT), unless special energy-centric waveforms (e.g., deterministic sinusoidal waveforms) are employed and provide sufficient gains. Guided by these insights, we formulate an optimization problem for EB design to enable dedicated waveform transmission for WET, and we develop a low-complexity algorithm that reduces computation by ignoring the WET contribution of IBs during optimization. Numerical results corroborate that deterministic sinusoidal waveforms outperform Gaussian signaling when the received RF power lies in the EH high-efficiency region, making dedicated EBs beneficial. The proposed scheme achieves computational complexity reductions of 91.43\% and 98.54\% for the cases $M=8,,K^I=K^E=2$ and $M=16,,K^I=K^E=4$, respectively, with negligible performance loss, thereby validating the efficiency of the low-complexity algorithm. △ Less

Submitted 12 September, 2025; originally announced September 2025.

arXiv:2509.06425 [pdf]

First-Principle Modeling Framework of Boost Converter Dynamics for Precise Energy Conversions in Space

Authors: Yifan Wang, Wenhua Li, Zhenlong Wang, Xinrui Zhang, Jianfeng Sun, Qianfu Xia, Zhongtao Gou, Jiangang Rong, Tao Ye

Abstract: Boost converters are essential for modern electrification and intelligent technologies. However, conventional Boost converter models relying on steady-state assumptions fail to accurately predict transient behaviors during input voltage and load fluctuations, which cause significant output voltage overshoots and instability, resulting in failures of electrical systems, thereby restricting their us… ▽ More Boost converters are essential for modern electrification and intelligent technologies. However, conventional Boost converter models relying on steady-state assumptions fail to accurately predict transient behaviors during input voltage and load fluctuations, which cause significant output voltage overshoots and instability, resulting in failures of electrical systems, thereby restricting their use in space. This study introduces a first-principle modeling framework that derives precise dynamic equations for Boost converters by incorporating non-ideal component coupling. As compared to the most accurate existing Boost converter model, the proposed models reduce steady-state and dynamic-state errors between experimental and simulated output voltages by factors of 11.0 (from 20.9% to 1.9%) and 15.4 (from 77.1% to 5.0%) under input voltage variations, and by factors of 10.2 (from 15.3% to 1.5%) and 35.1 (from 42.1% to 1.2%) under load changes, respectively. Consequently, a reliable Boost converter is accordingly designed and on-orbit deployed for precise energy conversions. △ Less

Submitted 8 September, 2025; originally announced September 2025.

Comments: 24 pages, 30 pages supplementary material, 5 figures, 14 supplementary figures, 6 supplementary tables

arXiv:2509.06413 [pdf, ps, other]

VQualA 2025 Challenge on Image Super-Resolution Generated Content Quality Assessment: Methods and Results

Authors: Yixiao Li, Xin Li, Chris Wei Zhou, Shuo Xing, Hadi Amirpour, Xiaoshuai Hao, Guanghui Yue, Baoquan Zhao, Weide Liu, Xiaoyuan Yang, Zhengzhong Tu, Xinyu Li, Chuanbiao Song, Chenqi Zhang, Jun Lan, Huijia Zhu, Weiqiang Wang, Xiaoyan Sun, Shishun Tian, Dongyang Yan, Weixia Zhang, Junlin Chen, Wei Sun, Zhihua Wang, Zhuohang Shi , et al. (6 additional authors not shown)

Abstract: This paper presents the ISRGC-Q Challenge, built upon the Image Super-Resolution Generated Content Quality Assessment (ISRGen-QA) dataset, and organized as part of the Visual Quality Assessment (VQualA) Competition at the ICCV 2025 Workshops. Unlike existing Super-Resolution Image Quality Assessment (SR-IQA) datasets, ISRGen-QA places a greater emphasis on SR images generated by the latest generat… ▽ More This paper presents the ISRGC-Q Challenge, built upon the Image Super-Resolution Generated Content Quality Assessment (ISRGen-QA) dataset, and organized as part of the Visual Quality Assessment (VQualA) Competition at the ICCV 2025 Workshops. Unlike existing Super-Resolution Image Quality Assessment (SR-IQA) datasets, ISRGen-QA places a greater emphasis on SR images generated by the latest generative approaches, including Generative Adversarial Networks (GANs) and diffusion models. The primary goal of this challenge is to analyze the unique artifacts introduced by modern super-resolution techniques and to evaluate their perceptual quality effectively. A total of 108 participants registered for the challenge, with 4 teams submitting valid solutions and fact sheets for the final testing phase. These submissions demonstrated state-of-the-art (SOTA) performance on the ISRGen-QA dataset. The project is publicly available at: https://github.com/Lighting-YXLI/ISRGen-QA. △ Less

Submitted 8 September, 2025; originally announced September 2025.

Comments: 11 pages, 12 figures, VQualA ICCV Workshop

arXiv:2509.06170 [pdf, ps, other]

Pinching Antenna System (PASS) Enhanced Covert Communications: Against Warden via Sensing

Authors: Hao Jiang, Zhaolin Wang, Yuanwei Liu, Arumugam Nallanathan, Zhiguo Ding

Abstract: A sensing-aided covert communication network empowered by pinching antenna systems (PASS) is proposed in this work. Unlike conventional fixed-position MIMO arrays, PASS dynamically reconfigures its pinching antennas (PAs) closer to the legitimate user, substantially enhancing covertness. To further secure the adversary's channel state information (CSI), a sensing function is leveraged to track the… ▽ More A sensing-aided covert communication network empowered by pinching antenna systems (PASS) is proposed in this work. Unlike conventional fixed-position MIMO arrays, PASS dynamically reconfigures its pinching antennas (PAs) closer to the legitimate user, substantially enhancing covertness. To further secure the adversary's channel state information (CSI), a sensing function is leveraged to track the malicious warden's movements. In particular, this paper first proposes an extended Kalman filter (EKF) based approach to fulfilling the tracking function. Building on this, a covert communication problem is formulated with a joint design of beamforming, artificial noise (AN) signals, and the position of PAs. Then, the beamforming and AN design subproblems are resolved jointly with a subspace approach, while the PA position optimization subproblem is handled by a deep reinforcement learning (DRL) approach by treating the evolution of the warden's mobility status as a temporally corrected process. Numerical results are presented and demonstrate that: i) the EKF approach can accurately track the warden's CSI with low complexity, ii) the effectiveness of the proposed solution is verified by its outperformance over the greedy and searching-based benchmarks, and iii) with new design degrees of freedom (DoFs), the performance of PASS is superior to the conventional fully-digital MIMO systems. △ Less

Submitted 7 September, 2025; originally announced September 2025.

Comments: Submit to possible IEEE journal

arXiv:2509.05971 [pdf, ps, other]

DeepStream: Prototyping Deep Joint Source-Channel Coding for Real-Time Multimedia Transmissions

Authors: Kaiyi Chi, Yinghui He, Qianqian Yang, Zhiping Jiang, Yuanchao Shu, Zhiqin Wang, Jun Luo, Jiming Chen

Abstract: Deep learning-based joint source-channel coding (DeepJSCC) has emerged as a promising technique in 6G for enhancing the efficiency and reliability of data transmission across diverse modalities, particularly in low signal-to-noise ratio (SNR) environments. This advantage is realized by leveraging powerful neural networks to learn an optimal end-to-end mapping from the source data directly to the t… ▽ More Deep learning-based joint source-channel coding (DeepJSCC) has emerged as a promising technique in 6G for enhancing the efficiency and reliability of data transmission across diverse modalities, particularly in low signal-to-noise ratio (SNR) environments. This advantage is realized by leveraging powerful neural networks to learn an optimal end-to-end mapping from the source data directly to the transmit symbol sequence, eliminating the need for separate source coding, channel coding, and modulation. Although numerous efforts have been made towards efficient DeepJSCC, they have largely stayed at numerical simulations that can be far from practice, leaving the real-world viability of DeepJSCC largely unverified. To this end, we prototype DeepStream upon orthogonal frequency division multiplexing (OFDM) technology to offer efficient and robust DeepJSCC for multimedia transmission. In conforming to OFDM, we develop both a feature-to-symbol mapping method and a cross-subcarrier precoding method to improve the subcarrier independence and reduce peak-to-average power ratio. To reduce system complexity and enable flexibility in accommodating varying quality of service requirements, we further propose a progressive coding strategy that adjusts the compression ratio based on latency with minimal performance loss. We implement DeepStream for real-time image transmission and video streaming using software-defined radio. Extensive evaluations verify that DeepStream outperforms both the standard scheme and the direct deployment scheme. Particularly, at an SNR of 10 dB, DeepStream achieves a PSNR of 35 dB for image transmission and an MS-SSIM of 20 dB for video streaming, whereas the standard scheme fails to recover meaningful information. △ Less

Submitted 7 September, 2025; originally announced September 2025.

Comments: 13 pages, 43 figures

arXiv:2509.04870 [pdf, ps, other]

Multi-modal Uncertainty Robust Tree Cover Segmentation For High-Resolution Remote Sensing Images

Authors: Yuanyuan Gui, Wei Li, Yinjian Wang, Xiang-Gen Xia, Mauro Marty, Christian Ginzler, Zuyuan Wang

Abstract: Recent advances in semantic segmentation of multi-modal remote sensing images have significantly improved the accuracy of tree cover mapping, supporting applications in urban planning, forest monitoring, and ecological assessment. Integrating data from multiple modalities-such as optical imagery, light detection and ranging (LiDAR), and synthetic aperture radar (SAR)-has shown superior performance… ▽ More Recent advances in semantic segmentation of multi-modal remote sensing images have significantly improved the accuracy of tree cover mapping, supporting applications in urban planning, forest monitoring, and ecological assessment. Integrating data from multiple modalities-such as optical imagery, light detection and ranging (LiDAR), and synthetic aperture radar (SAR)-has shown superior performance over single-modality methods. However, these data are often acquired days or even months apart, during which various changes may occur, such as vegetation disturbances (e.g., logging, and wildfires) and variations in imaging quality. Such temporal misalignments introduce cross-modal uncertainty, especially in high-resolution imagery, which can severely degrade segmentation accuracy. To address this challenge, we propose MURTreeFormer, a novel multi-modal segmentation framework that mitigates and leverages aleatoric uncertainty for robust tree cover mapping. MURTreeFormer treats one modality as primary and others as auxiliary, explicitly modeling patch-level uncertainty in the auxiliary modalities via a probabilistic latent representation. Uncertain patches are identified and reconstructed from the primary modality's distribution through a VAE-based resampling mechanism, producing enhanced auxiliary features for fusion. In the decoder, a gradient magnitude attention (GMA) module and a lightweight refinement head (RH) are further integrated to guide attention toward tree-like structures and to preserve fine-grained spatial details. Extensive experiments on multi-modal datasets from Shanghai and Zurich demonstrate that MURTreeFormer significantly improves segmentation performance and effectively reduces the impact of temporally induced aleatoric uncertainty. △ Less

Submitted 5 September, 2025; originally announced September 2025.

arXiv:2509.03421 [pdf]

Generalist versus Specialist Vision Foundation Models for Ocular Disease and Oculomics

Authors: Yukun Zhou, Paul Nderitu, Jocelyn Hui Lin Goh, Justin Engelmann, Siegfried K. Wagner, Anran Ran, Hongyang Jiang, Lie Ju, Ke Zou, Sahana Srinivasan, Hyunmin Kim, Takahiro Ninomiya, Zheyuan Wang, Gabriel Dawei Yang, Eden Ruffell, Dominic Williamson, Rui Santos, Gabor Mark Somfai, Carol Y. Cheung, Tien Yin Wong, Daniel C. Alexander, Yih Chung Tham, Pearse A. Keane

Abstract: Medical foundation models, pre-trained with large-scale clinical data, demonstrate strong performance in diverse clinically relevant applications. RETFound, trained on nearly one million retinal images, exemplifies this approach in applications with retinal images. However, the emergence of increasingly powerful and multifold larger generalist foundation models such as DINOv2 and DINOv3 raises the… ▽ More Medical foundation models, pre-trained with large-scale clinical data, demonstrate strong performance in diverse clinically relevant applications. RETFound, trained on nearly one million retinal images, exemplifies this approach in applications with retinal images. However, the emergence of increasingly powerful and multifold larger generalist foundation models such as DINOv2 and DINOv3 raises the question of whether domain-specific pre-training remains essential, and if so, what gap persists. To investigate this, we systematically evaluated the adaptability of DINOv2 and DINOv3 in retinal image applications, compared to two specialist RETFound models, RETFound-MAE and RETFound-DINOv2. We assessed performance on ocular disease detection and systemic disease prediction using two adaptation strategies: fine-tuning and linear probing. Data efficiency and adaptation efficiency were further analysed to characterise trade-offs between predictive performance and computational cost. Our results show that although scaling generalist models yields strong adaptability across diverse tasks, RETFound-DINOv2 consistently outperforms these generalist foundation models in ocular-disease detection and oculomics tasks, demonstrating stronger generalisability and data efficiency. These findings suggest that specialist retinal foundation models remain the most effective choice for clinical applications, while the narrowing gap with generalist foundation models suggests that continued data and model scaling can deliver domain-relevant gains and position them as strong foundations for future medical foundation models. △ Less

Submitted 3 September, 2025; originally announced September 2025.

Comments: 39 pages, 8 Figures

ACM Class: J.3; I.2.10

arXiv:2509.02402 [pdf, ps, other]

autoPET IV challenge: Incorporating organ supervision and human guidance for lesion segmentation in PET/CT

Authors: Junwei Huang, Yingqi Hao, Yitong Luo, Ziyu Wang, Mingxuan Liu, Yifei Chen, Yuanhan Wang, Lei Xiang, Qiyuan Tian

Abstract: Lesion Segmentation in PET/CT scans is an essential part of modern oncological workflows. To address the challenges of time-intensive manual annotation and high inter-observer variability, the autoPET challenge series seeks to advance automated segmentation methods in complex multi-tracer and multi-center settings. Building on this foundation, autoPET IV introduces a human-in-the-loop scenario to… ▽ More Lesion Segmentation in PET/CT scans is an essential part of modern oncological workflows. To address the challenges of time-intensive manual annotation and high inter-observer variability, the autoPET challenge series seeks to advance automated segmentation methods in complex multi-tracer and multi-center settings. Building on this foundation, autoPET IV introduces a human-in-the-loop scenario to efficiently utilize interactive human guidance in segmentation tasks. In this work, we incorporated tracer classification, organ supervision and simulated clicks guidance into the nnUNet Residual Encoder framework, forming an integrated pipeline that demonstrates robust performance in a fully automated (zero-guidance) context and efficiently leverages iterative interactions to progressively enhance segmentation accuracy. △ Less

Submitted 2 September, 2025; originally announced September 2025.

arXiv:2509.02116 [pdf, ps, other]

Affine-Doppler Division Multiplexing for High-Mobility Wireless Communications Systems

Authors: Yuanfang Ma, Zulin Wang, Peng Yuan, Qin Huang, Yuanhan Ni

Abstract: Affine Frequency Division Multiplexing (AFDM) has been regarded as a candidate integrated sensing and communications (ISAC) waveform owing to its superior communication performance, outperforming the Orthogonal Time-Frequency Space (OTFS) that has been researched for a longer time. However, since the above two waveforms are incompatible with each other, the state-of-the-art methods well-designed f… ▽ More Affine Frequency Division Multiplexing (AFDM) has been regarded as a candidate integrated sensing and communications (ISAC) waveform owing to its superior communication performance, outperforming the Orthogonal Time-Frequency Space (OTFS) that has been researched for a longer time. However, since the above two waveforms are incompatible with each other, the state-of-the-art methods well-designed for OTFS may not be directly applicable to AFDM. This paper introduces a new orthogonal multicarrier waveform, namely Affine-Doppler Division Multiplexing (ADDM), which can provide a generic framework and subsume the existing OTFS and AFDM as a particular case. ADDM modulating information symbols in the Affine-Doppler (A-D) domain based on a two-dimensional (2D) transform can enjoy both excellent unambiguous Doppler and Doppler resolution, which is the same as AFDM but outperforms OTFS. Moreover, benefiting from the 2D transform, the symbols block of ADDM in the A-D domain undergoes a 2D cyclic shift produced by the delay and the Doppler of the channel, similar to the 2D cyclic shift in the delay-Doppler domain of cyclic prefix (CP)-OTFS. This offers a potential to directly apply the state-of-the-art methods well-designed for OTFS and AFDM to ADDM. Numerical results show that ADDM achieves comparable BER performance with AFDM but outperforms OTFS in high-mobility scenarios. △ Less

Submitted 4 September, 2025; v1 submitted 2 September, 2025; originally announced September 2025.

Comments: 7 pages, 4 figures, 1 table

arXiv:2509.01905 [pdf, ps, other]

Efficient River Water Level Sensing Using Cellular CSI and Joint Space-Time Processing

Authors: Khawaja Fahad Masood, Kai Wu, Zhongqin Wang, J. Andrew Zhang, Shu-Lin Chen, Y. Jay Guo

Abstract: Accurate and timely water level monitoring is critical for flood prevention, environmental management, and emerging smart infrastructure systems. Traditional water sensing methods often rely on dedicated sensors, which can be costly to deploy and difficult to maintain and are vulnerable to damage during floods.In this work, we propose a novel cellular signalbased sensing scheme that passively esti… ▽ More Accurate and timely water level monitoring is critical for flood prevention, environmental management, and emerging smart infrastructure systems. Traditional water sensing methods often rely on dedicated sensors, which can be costly to deploy and difficult to maintain and are vulnerable to damage during floods.In this work, we propose a novel cellular signalbased sensing scheme that passively estimates water level changes using downlink mobile signals from existing communication infrastructure. By capturing subtle variations in channel state information (CSI), the proposed method estimates the length changes of the water-reflected signal path, which correspond to water level variations. A space-time processing framework is developed to jointly estimate the angle of arrival and Doppler shift, enabling isolation and enhancement of the water-reflected path via beamforming, while effectively suppressing environmental noise. The phase evolution of the beamformed signal is then extracted to infer water level changes. To address clock asynchronism between the transmitter and receiver inherent in bistatic systems, we introduce a beamforming-based compensation technique for removing time-varying random phase offsets in CSI. Field experiments conducted across a river demonstrate that the proposed method enables accurate and reliable water level estimation, achieving a mean accuracy ranging from 1.5 cm to 3.05 cm across different receiver configurations and deployments. △ Less

Submitted 1 September, 2025; originally announced September 2025.

Comments: 12 pages, 13 figures, submitted to an ieee journal for possible publication

arXiv:2509.01217 [pdf, ps, other]

Learn2Reg 2024: New Benchmark Datasets Driving Progress on New Challenges

Authors: Lasse Hansen, Wiebke Heyer, Christoph Großbröhmer, Frederic Madesta, Thilo Sentker, Wang Jiazheng, Yuxi Zhang, Hang Zhang, Min Liu, Junyi Wang, Xi Zhu, Yuhua Li, Liwen Wang, Daniil Morozov, Nazim Haouchine, Joel Honkamaa, Pekka Marttinen, Yichao Zhou, Zuopeng Tan, Zhuoyuan Wang, Yi Wang, Hongchao Zhou, Shunbo Hu, Yi Zhang, Qian Tao , et al. (29 additional authors not shown)

Abstract: Medical image registration is critical for clinical applications, and fair benchmarking of different methods is essential for monitoring ongoing progress. To date, the Learn2Reg 2020-2023 challenges have released several complementary datasets and established metrics for evaluations. However, these editions did not capture all aspects of the registration problem, particularly in terms of modality… ▽ More Medical image registration is critical for clinical applications, and fair benchmarking of different methods is essential for monitoring ongoing progress. To date, the Learn2Reg 2020-2023 challenges have released several complementary datasets and established metrics for evaluations. However, these editions did not capture all aspects of the registration problem, particularly in terms of modality diversity and task complexity. To address these limitations, the 2024 edition introduces three new tasks, including large-scale multi-modal registration and unsupervised inter-subject brain registration, as well as the first microscopy-focused benchmark within Learn2Reg. The new datasets also inspired new method developments, including invertibility constraints, pyramid features, keypoints alignment and instance optimisation. △ Less

Submitted 8 September, 2025; v1 submitted 1 September, 2025; originally announced September 2025.

Comments: submitted to MELBA Journal v2: added Jinming Duan to author list

arXiv:2509.00964 [pdf, ps, other]

Doubly-Dispersive Continuous MIMO Systems: Channel Modeling and Beamforming Design

Authors: Kuranage Roche Rayan Ranasinghe, Zhaolin Wang, Hyeon Seok Rou, Giuseppe Thadeu Freitas de Abreu, Emil Björnson

Abstract: We address the modeling and optimal beamforming (BF) design for multiple-input multiple-output (MIMO) continuous aperture array (CAPA) systems operating over doubly-dispersive (DD) channels. First, a comprehensive DD continuous MIMO (DDC MIMO) channel model that incorporates CAPAs at both the transmitter (TX) and receiver (RX) is derived, which is used to obtain explicit input-output (I/O) relatio… ▽ More We address the modeling and optimal beamforming (BF) design for multiple-input multiple-output (MIMO) continuous aperture array (CAPA) systems operating over doubly-dispersive (DD) channels. First, a comprehensive DD continuous MIMO (DDC MIMO) channel model that incorporates CAPAs at both the transmitter (TX) and receiver (RX) is derived, which is used to obtain explicit input-output (I/O) relations for various waveforms well suited to integrated sensing and communications (ISAC) and robust to DD channels, namely orthogonal frequency division multiplexing (OFDM), orthogonal time frequency space (OTFS), and affine frequency division multiplexing (AFDM). Then, functional optimization problems are formulated for the design of TX and RX BF matrices that maximize received power, in which novel low-complexity, closed-form solutions are obtained via the calculus of variations (CoV) method, yielding expressions closely related to the classical matched filter commonly used in conventional MIMO systems. Simulation results confirm that the proposed TX/RX BF designs with CAPAs provide significant performance and computational complexity gains over conventional MIMO systems in DD channels. △ Less

Submitted 4 September, 2025; v1 submitted 31 August, 2025; originally announced September 2025.

Comments: Submitted to IEEE Transactions on Wireless Communications

arXiv:2509.00314 [pdf, ps, other]

CoMET: A Contrastive-Masked Brain Foundation Model for Universal EEG Representation

Authors: Ang Li, Zikai Wang, Liuyin Yang, Zhenyu Wang, Tianheng Xu, Honglin Hu, Marc M. Van Hulle

Abstract: Electroencephalography (EEG) is a non-invasive technique for recording brain activity, widely used in brain-computer interfaces, clinic, and healthcare. Traditional EEG deep models typically focus on specific dataset and task, limiting model size and generalization. Recently, self-supervised brain foundation models have emerged and been applied to various downstream tasks. Nevertheless, these mode… ▽ More Electroencephalography (EEG) is a non-invasive technique for recording brain activity, widely used in brain-computer interfaces, clinic, and healthcare. Traditional EEG deep models typically focus on specific dataset and task, limiting model size and generalization. Recently, self-supervised brain foundation models have emerged and been applied to various downstream tasks. Nevertheless, these models still have limitations: current SOTA models typically rely on masked reconstruction strategy; however, EEG features of adjacent channels are highly correlated, which causes the pre-training to overly focus on low-dimensional signal-similarity features in local regions and neglect the global discriminative patterns vital for downstream tasks. To address these limitations, we propose a brain foundation model called CoMET. Specifically, we employ the masked autoencoder with redesigned patching and embedding for EEG as backbone and devise a novel contrastive learning framework with mirror-scale augmentation to strengthen the global discrimination ability. CoMET is pre-trained on mixed EEG datasets over 3000 subjects with over one million samples. It is evaluated on ten different downstream datasets, and the SOTA results demonstrate CoMET's superior ability in extracting universal EEG representations and strong clinical potential. △ Less

Submitted 29 August, 2025; originally announced September 2025.

arXiv:2508.20288 [pdf, ps, other]

Neural Spline Operators for Risk Quantification in Stochastic Systems

Authors: Zhuoyuan Wang, Raffaele Romagnoli, Kamyar Azizzadenesheli, Yorie Nakahira

Abstract: Accurately quantifying long-term risk probabilities in diverse stochastic systems is essential for safety-critical control. However, existing sampling-based and partial differential equation (PDE)-based methods often struggle to handle complex varying dynamics. Physics-informed neural networks learn surrogate mappings for risk probabilities from varying system parameters of fixed and finite dimens… ▽ More Accurately quantifying long-term risk probabilities in diverse stochastic systems is essential for safety-critical control. However, existing sampling-based and partial differential equation (PDE)-based methods often struggle to handle complex varying dynamics. Physics-informed neural networks learn surrogate mappings for risk probabilities from varying system parameters of fixed and finite dimensions, yet can not account for functional variations in system dynamics. To address these challenges, we introduce physics-informed neural operator (PINO) methods to risk quantification problems, to learn mappings from varying \textit{functional} system dynamics to corresponding risk probabilities. Specifically, we propose Neural Spline Operators (NeSO), a PINO framework that leverages B-spline representations to improve training efficiency and achieve better initial and boundary condition enforcements, which are crucial for accurate risk quantification. We provide theoretical analysis demonstrating the universal approximation capability of NeSO. We also present two case studies, one with varying functional dynamics and another with high-dimensional multi-agent dynamics, to demonstrate the efficacy of NeSO and its significant online speed-up over existing methods. The proposed framework and the accompanying universal approximation theorem are expected to be beneficial for other control or PDE-related problems beyond risk quantification. △ Less

Submitted 27 August, 2025; originally announced August 2025.

arXiv:2508.20141 [pdf]

UltraEar: a multicentric, large-scale database combining ultra-high-resolution computed tomography and clinical data for ear diseases

Authors: Ruowei Tang, Pengfei Zhao, Xiaoguang Li, Ning Xu, Yue Cheng, Mengshi Zhang, Zhixiang Wang, Zhengyu Zhang, Hongxia Yin, Heyu Ding, Shusheng Gong, Yuhe Liu, Zhenchang Wang

Abstract: Ear diseases affect billions of people worldwide, leading to substantial health and socioeconomic burdens. Computed tomography (CT) plays a pivotal role in accurate diagnosis, treatment planning, and outcome evaluation. The objective of this study is to present the establishment and design of UltraEar Database, a large-scale, multicentric repository of isotropic 0.1 mm ultra-high-resolution CT (U-… ▽ More Ear diseases affect billions of people worldwide, leading to substantial health and socioeconomic burdens. Computed tomography (CT) plays a pivotal role in accurate diagnosis, treatment planning, and outcome evaluation. The objective of this study is to present the establishment and design of UltraEar Database, a large-scale, multicentric repository of isotropic 0.1 mm ultra-high-resolution CT (U-HRCT) images and associated clinical data dedicated to ear diseases. UltraEar recruits patients from 11 tertiary hospitals between October 2020 and October 2035, integrating U-HRCT images, structured CT reports, and comprehensive clinical information, including demographics, audiometric profiles, surgical records, and pathological findings. A broad spectrum of otologic disorders is covered, such as otitis media, cholesteatoma, ossicular chain malformation, temporal bone fracture, inner ear malformation, cochlear aperture stenosis, enlarged vestibular aqueduct, and sigmoid sinus bony deficiency. Standardized preprocessing pipelines have been developed for geometric calibration, image annotation, and multi-structure segmentation. All personal identifiers in DICOM headers and metadata are removed or anonymized to ensure compliance with data privacy regulation. Data collection and curation are coordinated through monthly expert panel meetings, with secure storage on an offline cloud system. UltraEar provides an unprecedented ultra-high-resolution reference atlas with both technical fidelity and clinical relevance. This resource has significant potential to advance radiological research, enable development and validation of AI algorithms, serve as an educational tool for training in otologic imaging, and support multi-institutional collaborative studies. UltraEar will be continuously updated and expanded, ensuring long-term accessibility and usability for the global otologic research community. △ Less

Submitted 27 August, 2025; originally announced August 2025.

arXiv:2508.19644 [pdf, ps, other]

Low-Cost Architecture and Efficient Pattern Synthesis for Polarimetric Phased Array Based on Polarization Coding Reconfigurable Elements

Authors: Yiqing Wang, Jian Zhou, Chen Pang, Wenyang Man, Zixiang Xiong, Ke Meng, Zhanling Wang, Yongzhen Li

Abstract: Polarimetric phased arrays (PPAs) enhance radar target detection and anti-jamming capabilities. However, the dual transmit/receive (T/R) channel requirement leads to high costs and system complexity. To address this, this paper introduces a polarization-coding reconfigurable phased array (PCRPA) and associated pattern synthesis techniques to reduce PPA costs while minimizing performance degradatio… ▽ More Polarimetric phased arrays (PPAs) enhance radar target detection and anti-jamming capabilities. However, the dual transmit/receive (T/R) channel requirement leads to high costs and system complexity. To address this, this paper introduces a polarization-coding reconfigurable phased array (PCRPA) and associated pattern synthesis techniques to reduce PPA costs while minimizing performance degradation. Each PCRPA element connects to a single T/R channel and incorporates two-level RF switches for real-time control of polarization states and waveforms. By adjusting element codes and excitation weights, the PCRPA can generate arbitrarily polarized and dual-polarized beams. Efficient beam pattern synthesis methods are also proposed, featuring novel optimization constraints derived from theoretical and analytical analysis of PCRPAs. Simulations demonstrate that the approach achieves low cross-polarization and sidelobe levels comparable to conventional architectures within the scan range, particularly for large arrays. However, the channel reduction inevitably incurs power and directivity loss. Experiments conducted on an $8\times 8$ X-band array antenna validate the effectiveness of the proposed system. The PCRPA and synthesis methods are well-suited for large-scale PPA systems, offering significant cost-effectiveness while maintaining good sidelobe suppression and polarization control performance. △ Less

Submitted 28 August, 2025; v1 submitted 27 August, 2025; originally announced August 2025.

arXiv:2508.19205 [pdf, ps, other]

VibeVoice Technical Report

Authors: Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, Furu Wei

Abstract: This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression… ▽ More This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models. △ Less

Submitted 26 August, 2025; originally announced August 2025.

arXiv:2508.18653 [pdf, ps, other]

The Sound of Risk: A Multimodal Physics-Informed Acoustic Model for Forecasting Market Volatility and Enhancing Market Interpretability

Authors: Xiaoliang Chen, Xin Yu, Le Chang, Teng Jing, Jiashuai He, Ze Wang, Yangjun Luo, Xingyu Chen, Jiayue Liang, Yuchen Wang, Jiaying Xie

Abstract: Information asymmetry in financial markets, often amplified by strategically crafted corporate narratives, undermines the effectiveness of conventional textual analysis. We propose a novel multimodal framework for financial risk assessment that integrates textual sentiment with paralinguistic cues derived from executive vocal tract dynamics in earnings calls. Central to this framework is the Physi… ▽ More Information asymmetry in financial markets, often amplified by strategically crafted corporate narratives, undermines the effectiveness of conventional textual analysis. We propose a novel multimodal framework for financial risk assessment that integrates textual sentiment with paralinguistic cues derived from executive vocal tract dynamics in earnings calls. Central to this framework is the Physics-Informed Acoustic Model (PIAM), which applies nonlinear acoustics to robustly extract emotional signatures from raw teleconference sound subject to distortions such as signal clipping. Both acoustic and textual emotional states are projected onto an interpretable three-dimensional Affective State Label (ASL) space-Tension, Stability, and Arousal. Using a dataset of 1,795 earnings calls (approximately 1,800 hours), we construct features capturing dynamic shifts in executive affect between scripted presentation and spontaneous Q&A exchanges. Our key finding reveals a pronounced divergence in predictive capacity: while multimodal features do not forecast directional stock returns, they explain up to 43.8% of the out-of-sample variance in 30-day realized volatility. Importantly, volatility predictions are strongly driven by emotional dynamics during executive transitions from scripted to spontaneous speech, particularly reduced textual stability and heightened acoustic instability from CFOs, and significant arousal variability from CEOs. An ablation study confirms that our multimodal approach substantially outperforms a financials-only baseline, underscoring the complementary contributions of acoustic and textual modalities. By decoding latent markers of uncertainty from verifiable biometric signals, our methodology provides investors and regulators a powerful tool for enhancing market interpretability and identifying hidden corporate uncertainty. △ Less

Submitted 25 August, 2025; originally announced August 2025.

Comments: 9 pages, 6 figures

MSC Class: 62P05; 68T0 ACM Class: I.2.7; J.4

arXiv:2508.18295 [pdf, ps, other]

H-PRM: A Pluggable Hotword Pre-Retrieval Module for Various Speech Recognition Systems

Authors: Huangyu Dai, Lingtao Mao, Ben Chen, Zihan Wang, Zihan Liang, Ying Han, Chenyi Lei, Han Li

Abstract: Hotword customization is crucial in ASR to enhance the accuracy of domain-specific terms. It has been primarily driven by the advancements in traditional models and Audio large language models (LLMs). However, existing models often struggle with large-scale hotwords, as the recognition rate drops dramatically with the number of hotwords increasing. In this paper, we introduce a novel hotword custo… ▽ More Hotword customization is crucial in ASR to enhance the accuracy of domain-specific terms. It has been primarily driven by the advancements in traditional models and Audio large language models (LLMs). However, existing models often struggle with large-scale hotwords, as the recognition rate drops dramatically with the number of hotwords increasing. In this paper, we introduce a novel hotword customization system that utilizes a hotword pre-retrieval module (H-PRM) to identify the most relevant hotword candidate by measuring the acoustic similarity between the hotwords and the speech segment. This plug-and-play solution can be easily integrated into traditional models such as SeACo-Paraformer, significantly enhancing hotwords post-recall rate (PRR). Additionally, we incorporate H-PRM into Audio LLMs through a prompt-based approach, enabling seamless customization of hotwords. Extensive testing validates that H-PRM can outperform existing methods, showing a new direction for hotword customization in ASR. △ Less

Submitted 22 August, 2025; originally announced August 2025.

arXiv:2508.16569 [pdf, ps, other]

A Disease-Centric Vision-Language Foundation Model for Precision Oncology in Kidney Cancer

Authors: Yuhui Tao, Zhongwei Zhao, Zilong Wang, Xufang Luo, Feng Chen, Kang Wang, Chuanfu Wu, Xue Zhang, Shaoting Zhang, Jiaxi Yao, Xingwei Jin, Xinyang Jiang, Yifan Yang, Dongsheng Li, Lili Qiu, Zhiqiang Shao, Jianming Guo, Nengwang Yu, Shuo Wang, Ying Xiong

Abstract: The non-invasive assessment of increasingly incidentally discovered renal masses is a critical challenge in urologic oncology, where diagnostic uncertainty frequently leads to the overtreatment of benign or indolent tumors. In this study, we developed and validated RenalCLIP using a dataset of 27,866 CT scans from 8,809 patients across nine Chinese medical centers and the public TCIA cohort, a vis… ▽ More The non-invasive assessment of increasingly incidentally discovered renal masses is a critical challenge in urologic oncology, where diagnostic uncertainty frequently leads to the overtreatment of benign or indolent tumors. In this study, we developed and validated RenalCLIP using a dataset of 27,866 CT scans from 8,809 patients across nine Chinese medical centers and the public TCIA cohort, a visual-language foundation model for characterization, diagnosis and prognosis of renal mass. The model was developed via a two-stage pre-training strategy that first enhances the image and text encoders with domain-specific knowledge before aligning them through a contrastive learning objective, to create robust representations for superior generalization and diagnostic precision. RenalCLIP achieved better performance and superior generalizability across 10 core tasks spanning the full clinical workflow of kidney cancer, including anatomical assessment, diagnostic classification, and survival prediction, compared with other state-of-the-art general-purpose CT foundation models. Especially, for complicated task like recurrence-free survival prediction in the TCIA cohort, RenalCLIP achieved a C-index of 0.726, representing a substantial improvement of approximately 20% over the leading baselines. Furthermore, RenalCLIP's pre-training imparted remarkable data efficiency; in the diagnostic classification task, it only needs 20% training data to achieve the peak performance of all baseline models even after they were fully fine-tuned on 100% of the data. Additionally, it achieved superior performance in report generation, image-text retrieval and zero-shot diagnosis tasks. Our findings establish that RenalCLIP provides a robust tool with the potential to enhance diagnostic accuracy, refine prognostic stratification, and personalize the management of patients with kidney cancer. △ Less

Submitted 22 August, 2025; originally announced August 2025.

arXiv:2508.13479 [pdf, ps, other]

AIM 2025 challenge on Inverse Tone Mapping Report: Methods and Results

Authors: Chao Wang, Francesco Banterle, Bin Ren, Radu Timofte, Xin Lu, Yufeng Peng, Chengjie Ge, Zhijing Sun, Ziang Zhou, Zihao Li, Zishun Liao, Qiyu Kang, Xueyang Fu, Zheng-Jun Zha, Zhijing Sun, Xingbo Wang, Kean Liu, Senyan Xu, Yang Qiu, Yifan Ding, Gabriel Eilertsen, Jonas Unger, Zihao Wang, Ke Wu, Jinshan Pan , et al. (4 additional authors not shown)

Abstract: This paper presents a comprehensive review of the AIM 2025 Challenge on Inverse Tone Mapping (ITM). The challenge aimed to push forward the development of effective ITM algorithms for HDR image reconstruction from single LDR inputs, focusing on perceptual fidelity and numerical consistency. A total of \textbf{67} participants submitted \textbf{319} valid results, from which the best five teams wer… ▽ More This paper presents a comprehensive review of the AIM 2025 Challenge on Inverse Tone Mapping (ITM). The challenge aimed to push forward the development of effective ITM algorithms for HDR image reconstruction from single LDR inputs, focusing on perceptual fidelity and numerical consistency. A total of \textbf{67} participants submitted \textbf{319} valid results, from which the best five teams were selected for detailed analysis. This report consolidates their methodologies and performance, with the lowest PU21-PSNR among the top entries reaching 29.22 dB. The analysis highlights innovative strategies for enhancing HDR reconstruction quality and establishes strong benchmarks to guide future research in inverse tone mapping. △ Less

Submitted 21 September, 2025; v1 submitted 18 August, 2025; originally announced August 2025.

arXiv:2508.13306 [pdf, ps, other]

Stochastic Black Start Resource Allocation to Enable Dynamic Formation of Networked Microgrids and DER-aided Restoration

Authors: Cong Bai, Salish Maharjan, Han Wang, Zhaoyu Wang

Abstract: Extended outages in distributed systems (DSs) dominated by distributed energy resources (DERs) require innovative strategies to efficiently and securely deploy black start (BS) resources. To address the need, this paper proposes a two-stage stochastic resource allocation method within synchronizing dynamic microgrids (MGs) for black start (SDMG-BS), enabling risk-averse and adaptive restoration ac… ▽ More Extended outages in distributed systems (DSs) dominated by distributed energy resources (DERs) require innovative strategies to efficiently and securely deploy black start (BS) resources. To address the need, this paper proposes a two-stage stochastic resource allocation method within synchronizing dynamic microgrids (MGs) for black start (SDMG-BS), enabling risk-averse and adaptive restoration across various scenarios while ensuring frequency security. Virtual synchronous generator (VSG)-controlled grid-forming inverters (GFMIs) equipped with primary frequency governors (PFGs) are modeled as BS resources. Their frequency response is characterized by three transient indices, which are deployed as frequency dynamic constraints on load pick-up events to ensure frequency stability during the BS process. SDMG-BS framework facilitates location-independent synchronization among restored MGs and with the transmission grid (TG) with the help of smart switches (SSWs). The model incorporates scenario-based stochastic programming to address multi-source uncertainties, including season-dependent operational conditions and unpredictable TG outage durations, ensuring a resilient allocation plan. The proposed approach is validated on a modified IEEE 123-node feeder with three study cases designed across sixteen uncertainty scenarios. △ Less