-
SenSE: Semantic-Aware High-Fidelity Universal Speech Enhancement
Authors:
Xingchen Li,
Hanke Xie,
Ziqian Wang,
Zihan Zhang,
Longshuai Xiao,
Lei Xie
Abstract:
Generative universal speech enhancement (USE) methods aim to leverage generative models to improve speech quality under various types of distortions. Diffusion- or flow-based generative models are capable of producing enhanced speech with high quality and fidelity. However, they typically achieve speech enhancement by learning an acoustic feature mapping from degraded speech to clean speech, while…
▽ More
Generative universal speech enhancement (USE) methods aim to leverage generative models to improve speech quality under various types of distortions. Diffusion- or flow-based generative models are capable of producing enhanced speech with high quality and fidelity. However, they typically achieve speech enhancement by learning an acoustic feature mapping from degraded speech to clean speech, while lacking awareness of high-level semantic information. This deficiency tends to cause semantic ambiguity and acoustic discontinuities in the enhanced speech. In contrast, humans can often comprehend heavily corrupted speech by relying on semantic priors, suggesting that semantics play a crucial role in speech enhancement. Therefore, in this paper, we propose SenSE, which leverages a language model to capture the semantic information of distorted speech and effectively integrates it into a flow-matching-based speech enhancement framework. Specifically, we introduce a semantic-aware speech language model to capture the semantics of degraded speech and generate semantic tokens. We then design a semantic guidance mechanism that incorporates semantic information into the flow-matching-based speech enhancement process, effectively mitigating semantic ambiguity. In addition, we propose a prompt guidance mechanism, which leverages a short reference utterance to alleviate the loss of speaker similarity under severe distortion conditions. The results of several benchmark data sets demonstrate that SenSE not only ensures high perceptual quality but also substantially improves speech fidelity while maintaining strong robustness under severe distortions. Codes and demos are available.
△ Less
Submitted 29 September, 2025;
originally announced September 2025.
-
PhysiAgent: An Embodied Agent Framework in Physical World
Authors:
Zhihao Wang,
Jianxiong Li,
Jinliang Zheng,
Wencong Zhang,
Dongxiu Liu,
Yinan Zheng,
Haoyi Niu,
Junzhi Yu,
Xianyuan Zhan
Abstract:
Vision-Language-Action (VLA) models have achieved notable success but often struggle with limited generalizations. To address this, integrating generalized Vision-Language Models (VLMs) as assistants to VLAs has emerged as a popular solution. However, current approaches often combine these models in rigid, sequential structures: using VLMs primarily for high-level scene understanding and task plan…
▽ More
Vision-Language-Action (VLA) models have achieved notable success but often struggle with limited generalizations. To address this, integrating generalized Vision-Language Models (VLMs) as assistants to VLAs has emerged as a popular solution. However, current approaches often combine these models in rigid, sequential structures: using VLMs primarily for high-level scene understanding and task planning, and VLAs merely as executors of lower-level actions, leading to ineffective collaboration and poor grounding challenges. In this paper, we propose an embodied agent framework, PhysiAgent, tailored to operate effectively in physical environments. By incorporating monitor, memory, self-reflection mechanisms, and lightweight off-the-shelf toolboxes, PhysiAgent offers an autonomous scaffolding framework to prompt VLMs to organize different components based on real-time proficiency feedback from VLAs to maximally exploit VLAs' capabilities. Experimental results demonstrate significant improvements in task-solving performance on complex real-world robotic tasks, showcasing effective self-regulation of VLMs, coherent tool collaboration, and adaptive evolution of the framework during execution. PhysiAgent makes practical and pioneering efforts to integrate VLMs and VLAs, effectively grounding embodied agent frameworks in real-world settings.
△ Less
Submitted 29 September, 2025;
originally announced September 2025.
-
MeanFlowSE: One-Step Generative Speech Enhancement via MeanFlow
Authors:
Yike Zhu,
Boyi Kang,
Ziqian Wang,
Xingchen Li,
Zihan Zhang,
Wenjie Li,
Longshuai Xiao,
Wei Xue,
Lei Xie
Abstract:
Speech enhancement (SE) recovers clean speech from noisy signals and is vital for applications such as telecommunications and automatic speech recognition (ASR). While generative approaches achieve strong perceptual quality, they often rely on multi-step sampling (diffusion/flow-matching) or large language models, limiting real-time deployment. To mitigate these constraints, we present MeanFlowSE,…
▽ More
Speech enhancement (SE) recovers clean speech from noisy signals and is vital for applications such as telecommunications and automatic speech recognition (ASR). While generative approaches achieve strong perceptual quality, they often rely on multi-step sampling (diffusion/flow-matching) or large language models, limiting real-time deployment. To mitigate these constraints, we present MeanFlowSE, a one-step generative SE framework. It adopts MeanFlow to predict an average-velocity field for one-step latent refinement and conditions the model on self-supervised learning (SSL) representations rather than VAE latents. This design accelerates inference and provides robust acoustic-semantic guidance during training. In the Interspeech 2020 DNS Challenge blind test set and simulated test set, MeanFlowSE attains state-of-the-art (SOTA) level perceptual quality and competitive intelligibility while significantly lowering both real-time factor (RTF) and model size compared with recent generative competitors, making it suitable for practical use. The code will be released upon publication at https://github.com/Hello3orld/MeanFlowSE.
△ Less
Submitted 30 September, 2025; v1 submitted 27 September, 2025;
originally announced September 2025.
-
Vision-Intelligence-Enabled Beam Tracking for Cross-Interface Water-Air Optical Wireless Communications
Authors:
Tianqi Mao,
Jiayue Liu,
Weijie Liu,
Dezhi Zheng,
Zhaocheng Wang
Abstract:
The escalating development of oceanic applications like underwater surveillance and mineral exploration, is motivating real-time wireless backhaul of the considerable observation data. Such prospects can be hardly realized by the narrowband acoustic approach. Alternatively, optical wireless communication (OWC) has emerged as a promising solution for maritime and underwater applications due to its…
▽ More
The escalating development of oceanic applications like underwater surveillance and mineral exploration, is motivating real-time wireless backhaul of the considerable observation data. Such prospects can be hardly realized by the narrowband acoustic approach. Alternatively, optical wireless communication (OWC) has emerged as a promising solution for maritime and underwater applications due to its great potential for broadband underwater transmission. However, the implementations of water-air OWC can be rather challenging, especially when penetrating the fluctuating interface, where the direction of refracted signals changes dynamically, causing severe beam misalignment with airborne stations. This has necessitated real-time transceiver alignment adaptable to the sophisticated oceanic environment, which has yet to be addressed. Against this background, this paper establishes a mathematical channel model for water-air optical wireless transmission across the fluctuating sea surface. Based on the model, we propose a vision-based beam tracking algorithm that leverages artificial intelligence (AI) methods for dynamic channel prediction. The proposed algorithm integrates a convolutional neural network (CNN) with bi-directional long short-term memory (Bi-LSTM), which further incorporates the attention mechanism to effectively extract critical spatio-temporal features from the vision data. The numerical simulation results show that the proposed algorithm can outperform its classical counterparts in maintaining receiving signal strength and supressing the vision noises, which demonstrates its robustness against the the harsh conditions of water-air OWC systems.
△ Less
Submitted 25 September, 2025;
originally announced September 2025.
-
Neural Integrated Sensing and Communication for the MIMO-OFDM Downlink
Authors:
Ziyi Wang,
Frederik Zumegen,
Christoph Studer
Abstract:
The ongoing convergence of spectrum and hardware requirements for wireless sensing and communication applications has fueled the integrated sensing and communication (ISAC) paradigm in next-generation networks. Neural-network-based ISAC leverages data-driven learning techniques to add sensing capabilities to existing communication infrastructure. This paper presents a novel signal-processing frame…
▽ More
The ongoing convergence of spectrum and hardware requirements for wireless sensing and communication applications has fueled the integrated sensing and communication (ISAC) paradigm in next-generation networks. Neural-network-based ISAC leverages data-driven learning techniques to add sensing capabilities to existing communication infrastructure. This paper presents a novel signal-processing framework for such neural ISAC systems based on the multiple-input multiple-output (MIMO) and orthogonal frequency-division multiplexing (OFDM) downlink. Our approach enables generalized sensing functionality without modifying the MIMO-OFDM communication link. Specifically, our neural ISAC pipeline measures the backscattered communication signals to generate discrete map representations of spatial occupancy, formulated as multiclass or multilabel classification problems, which can then be utilized by specialized downstream tasks. To improve sensing performance in closed or cluttered environments, our neural ISAC pipeline relies on features specifically designed to mitigate strong reflective paths. Extensive simulations using ray-tracing models demonstrate that our neural ISAC framework reliably reconstructs scene maps without altering the MIMO-OFDM communication pipeline or reducing data rates.
△ Less
Submitted 25 September, 2025;
originally announced September 2025.
-
Multi-Stage CD-Kennedy Receiver for QPSK Modulated CV-QKD in Turbulent Channels
Authors:
Renzhi Yuan,
Zhixing Wang,
Shouye Miao,
Mufei Zhao,
Haifeng Yao,
Bin Cao,
Mugen Peng
Abstract:
Continuous variable-quantum key distribution (CV-QKD) protocols attract increasing attentions in recent years because they enjoy high secret key rate (SKR) and good compatibility with existing optical communication infrastructure. Classical coherent receivers are widely employed in coherent states based CV-QKD protocols, whose detection performance is bounded by the standard quantum limit (SQL). R…
▽ More
Continuous variable-quantum key distribution (CV-QKD) protocols attract increasing attentions in recent years because they enjoy high secret key rate (SKR) and good compatibility with existing optical communication infrastructure. Classical coherent receivers are widely employed in coherent states based CV-QKD protocols, whose detection performance is bounded by the standard quantum limit (SQL). Recently, quantum receivers based on displacement operators are experimentally demonstrated with detection performance outperforming the SQL in various practical conditions. However, potential applications of quantum receivers in CV-QKD protocols under turbulent channels are still not well explored, while practical CV-QKD protocols must survive from the atmospheric turbulence in satellite-to-ground optical communication links. In this paper, we consider the possibility of using a quantum receiver called multi-stage CD-Kennedy receiver to enhance the SKR performance of a quadrature phase shift keying (QPSK) modulated CV-QKD protocol in turbulent channels. We first derive the error probability of the multi-stage CD-Kennedy receiver for detecting QPSK signals in turbulent channels and further propose three types of multi-stage CD-Kennedy receiver with different displacement choices, i.e., the Type-I, Type-II, and Type-III receivers. Then we derive the SKR of a QPSK modulated CV-QKD protocol using the multi-stage CD-Kennedy receiver and post-selection strategy in turbulent channels. Numerical results show that the multi-stage CD-Kennedy receiver can outperform the classical coherent receiver in turbulent channels in terms of both error probability and SKR performance and the Type-II receiver can tolerate worse channel conditions compared with Type-I and Type-III receivers in terms of error probability performance.
△ Less
Submitted 24 September, 2025;
originally announced September 2025.
-
Timeliness-Aware Joint Source and Channel Coding for Adaptive Image Transmission
Authors:
Xiaolei Yang,
Zijing Wang,
Zhijin Qin,
Xiaoming Tao
Abstract:
Accurate and timely image transmission is critical for emerging time-sensitive applications such as remote sensing in satellite-assisted Internet of Things. However, the bandwidth limitation poses a significant challenge in existing wireless systems, making it difficult to fulfill the requirements of both high-fidelity and low-latency image transmission. Semantic communication is expected to break…
▽ More
Accurate and timely image transmission is critical for emerging time-sensitive applications such as remote sensing in satellite-assisted Internet of Things. However, the bandwidth limitation poses a significant challenge in existing wireless systems, making it difficult to fulfill the requirements of both high-fidelity and low-latency image transmission. Semantic communication is expected to break through the performance bottleneck by focusing on the transmission of goal-oriented semantic information rather than raw data. In this paper, we employ a new timeliness metric named the value of information (VoI) and propose an adaptive joint source and channel coding (JSCC) method for image transmission that simultaneously considers both reconstruction quality and timeliness. Specifically, we first design a JSCC framework for image transmission with adaptive code length. Next, we formulate a VoI maximization problem by optimizing the transmission code length of the adaptive JSCC under the reconstruction quality constraint. Then, a deep reinforcement learning-based algorithm is proposed to solve the optimization problem efficiently. Experimental results show that the proposed method significantly outperforms baseline schemes in terms of reconstruction quality and timeliness, particularly in low signal-to-noise ratio conditions, offering a promising solution for efficient and robust image transmission in time-sensitive wireless networks.
△ Less
Submitted 24 September, 2025;
originally announced September 2025.
-
VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation
Authors:
Neel P. Bhatt,
Yunhao Yang,
Rohan Siva,
Pranay Samineni,
Daniel Milan,
Zhangyang Wang,
Ufuk Topcu
Abstract:
Rapid adaptation in unseen environments is essential for scalable real-world autonomy, yet existing approaches rely on exhaustive exploration or rigid navigation policies that fail to generalize. We present VLN-Zero, a two-phase vision-language navigation framework that leverages vision-language models to efficiently construct symbolic scene graphs and enable zero-shot neurosymbolic navigation. In…
▽ More
Rapid adaptation in unseen environments is essential for scalable real-world autonomy, yet existing approaches rely on exhaustive exploration or rigid navigation policies that fail to generalize. We present VLN-Zero, a two-phase vision-language navigation framework that leverages vision-language models to efficiently construct symbolic scene graphs and enable zero-shot neurosymbolic navigation. In the exploration phase, structured prompts guide VLM-based search toward informative and diverse trajectories, yielding compact scene graph representations. In the deployment phase, a neurosymbolic planner reasons over the scene graph and environmental observations to generate executable plans, while a cache-enabled execution module accelerates adaptation by reusing previously computed task-location trajectories. By combining rapid exploration, symbolic reasoning, and cache-enabled execution, the proposed framework overcomes the computational inefficiency and poor generalization of prior vision-language navigation methods, enabling robust and scalable decision-making in unseen environments. VLN-Zero achieves 2x higher success rate compared to state-of-the-art zero-shot models, outperforms most fine-tuned baselines, and reaches goal locations in half the time with 55% fewer VLM calls on average compared to state-of-the-art models across diverse environments. Codebase, datasets, and videos for VLN-Zero are available at: https://vln-zero.github.io/.
△ Less
Submitted 22 September, 2025;
originally announced September 2025.
-
A Secure Affine Frequency Division Multiplexing for Wireless Communication Systems
Authors:
Ping Wang,
Zulin Wang,
Yuanfang Ma,
Xiaosi Tian,
Yuanhan Ni
Abstract:
This paper introduces a secure affine frequency division multiplexing (SE-AFDM) for wireless communication systems to enhance communication security. Besides configuring the parameter c1 to obtain communication reliability under doubly selective channels, we also utilize the time-varying parameter c2 to improve the security of the communications system. The derived input-output relation shows that…
▽ More
This paper introduces a secure affine frequency division multiplexing (SE-AFDM) for wireless communication systems to enhance communication security. Besides configuring the parameter c1 to obtain communication reliability under doubly selective channels, we also utilize the time-varying parameter c2 to improve the security of the communications system. The derived input-output relation shows that the legitimate receiver can eliminate the nonlinear impact introduced by the time-varying c2 without losing the bit error rate (BER) performance. Moreover, it is theoretically proved that the eavesdropper cannot separate the time-varying c2 and random information symbols, such that the BER performance of the eavesdropper is severely deteriorated. Meanwhile, the analysis of the effective signal-to-interference-plus-noise ratio (SINR) of the eavesdropper illustrates that the SINR decreases as the value range of c2 expands. Numerical results verify that the proposed SE-AFDM waveform has significant security while maintaining good BER performance in high-mobility scenarios.
△ Less
Submitted 22 September, 2025;
originally announced September 2025.
-
A Chain-of-thought Reasoning Breast Ultrasound Dataset Covering All Histopathology Categories
Authors:
Haojun Yu,
Youcheng Li,
Zihan Niu,
Nan Zhang,
Xuantong Gong,
Huan Li,
Zhiying Zou,
Haifeng Qi,
Zhenxiao Cao,
Zijie Lan,
Xingjian Yuan,
Jiating He,
Haokai Zhang,
Shengtao Zhang,
Zicheng Wang,
Dong Wang,
Ziwei Zhao,
Congying Chen,
Yong Wang,
Wangyan Qin,
Qingli Zhu,
Liwei Wang
Abstract:
Breast ultrasound (BUS) is an essential tool for diagnosing breast lesions, with millions of examinations per year. However, publicly available high-quality BUS benchmarks for AI development are limited in data scale and annotation richness. In this work, we present BUS-CoT, a BUS dataset for chain-of-thought (CoT) reasoning analysis, which contains 11,439 images of 10,019 lesions from 4,838 patie…
▽ More
Breast ultrasound (BUS) is an essential tool for diagnosing breast lesions, with millions of examinations per year. However, publicly available high-quality BUS benchmarks for AI development are limited in data scale and annotation richness. In this work, we present BUS-CoT, a BUS dataset for chain-of-thought (CoT) reasoning analysis, which contains 11,439 images of 10,019 lesions from 4,838 patients and covers all 99 histopathology types. To facilitate research on incentivizing CoT reasoning, we construct the reasoning processes based on observation, feature, diagnosis and pathology labels, annotated and verified by experienced experts. Moreover, by covering lesions of all histopathology types, we aim to facilitate robust AI systems in rare cases, which can be error-prone in clinical practice.
△ Less
Submitted 22 September, 2025; v1 submitted 21 September, 2025;
originally announced September 2025.
-
A Reliable Robot Motion Planner in Complex Real-world Environments via Action Imagination
Authors:
Chengjin Wang,
Yanmin Zhou,
Zhipeng Wang,
Zheng Yan,
Feng Luan,
Shuo Jiang,
Runjie Shen,
Hongrui Sang,
Bin He
Abstract:
Humans and animals can make real-time adjustments to movements by imagining their action outcomes to prevent unanticipated or even catastrophic motion failures in unknown unstructured environments. Action imagination, as a refined sensorimotor strategy, leverages perception-action loops to handle physical interaction-induced uncertainties in perception and system modeling within complex systems. I…
▽ More
Humans and animals can make real-time adjustments to movements by imagining their action outcomes to prevent unanticipated or even catastrophic motion failures in unknown unstructured environments. Action imagination, as a refined sensorimotor strategy, leverages perception-action loops to handle physical interaction-induced uncertainties in perception and system modeling within complex systems. Inspired by the action-awareness capability of animal intelligence, this study proposes an imagination-inspired motion planner (I-MP) framework that specifically enhances robots' action reliability by imagining plausible spatial states for approaching. After topologizing the workspace, I-MP build perception-action loop enabling robots autonomously build contact models. Leveraging fixed-point theory and Hausdorff distance, the planner computes convergent spatial states under interaction characteristics and mission constraints. By homogenously representing multi-dimensional environmental characteristics through work, the robot can approach the imagined spatial states via real-time computation of energy gradients. Consequently, experimental results demonstrate the practicality and robustness of I-MP in complex cluttered environments.
△ Less
Submitted 21 September, 2025;
originally announced September 2025.
-
A Unified Distributed Algorithm for Hybrid Near-Far Field Activity Detection in Cell-Free Massive MIMO
Authors:
Jingreng Lei,
Yang Li,
Ziyue Wang,
Qingfeng Lin,
Ya-Feng Liu,
Yik-Chung Wu
Abstract:
A great amount of endeavor has recently been devoted to activity detection for massive machine-type communications in cell-free multiple-input multiple-output (MIMO) systems. However, as the number of antennas at the access points (APs) increases, the Rayleigh distance that separates the near-field and far-field regions also expands, rendering the conventional assumption of far-field propagation a…
▽ More
A great amount of endeavor has recently been devoted to activity detection for massive machine-type communications in cell-free multiple-input multiple-output (MIMO) systems. However, as the number of antennas at the access points (APs) increases, the Rayleigh distance that separates the near-field and far-field regions also expands, rendering the conventional assumption of far-field propagation alone impractical. To address this challenge, this paper establishes a covariance-based formulation that can effectively capture the statistical property of hybrid near-far field channels. Based on this formulation, we theoretically reveal that increasing the proportion of near-field channels enhances the detection performance. Furthermore, we propose a distributed algorithm, where each AP performs local activity detection and only exchanges the detection results to the central processing unit, thus significantly reducing the computational complexity and the communication overhead. Not only with convergence guarantee, the proposed algorithm is unified in the sense that it can handle single-cell or cell-free systems with either near-field or far-field devices as special cases. Simulation results validate the theoretical analyses and demonstrate the superior performance of the proposed approach compared with existing methods.
△ Less
Submitted 18 September, 2025;
originally announced September 2025.
-
Scaling green hydrogen and CCUS via cement-methanol co-production in China
Authors:
Yuezhang He,
Hongxi Luo,
Yuancheng Lin,
Carl J. Talsma,
Anna Li,
Zhenqian Wang,
Yujuan Fang,
Pei Liu,
Jesse D. Jenkins,
Eric Larson,
Zheng Li
Abstract:
High costs of green hydrogen and of carbon capture, utilization, and sequestration (CCUS) have hindered policy ambition and slowed real-world deployment, despite their importance for decarbonizing hard-to-abate sectors, including cement and methanol. Given the economic challenges of adopting CCUS in cement and green hydrogen in methanol production separately, we propose a renewable-powered co-prod…
▽ More
High costs of green hydrogen and of carbon capture, utilization, and sequestration (CCUS) have hindered policy ambition and slowed real-world deployment, despite their importance for decarbonizing hard-to-abate sectors, including cement and methanol. Given the economic challenges of adopting CCUS in cement and green hydrogen in methanol production separately, we propose a renewable-powered co-production system that couples electrolytic hydrogen and CCUS through molecule exchange. We optimize system configurations using an hourly-resolved, process-based model incorporating operational flexibility, and explore integrated strategies for plant-level deployment and CO2 source-sink matching across China. We find that co-production could reduce CO2 abatement costs to USD 41-53 per tonne by 2035, significantly lower than approximately USD 75 for standalone cement CCUS and over USD 120 for standalone renewable-based methanol. Co-production is preferentially deployed at cement plants in renewable-rich regions, potentially reshaping national CO2 infrastructure planning. This hydrogen-CCUS coupling paradigm could accelerate industrial decarbonization and scaling for other applications.
△ Less
Submitted 17 September, 2025;
originally announced September 2025.
-
Assessing Data Replication in Symbolic Music via Adapted Structural Similarity Index Measure
Authors:
Shulei Ji,
Zihao Wang,
Le Ma,
Jiaxing Yu,
Kejun Zhang
Abstract:
AI-generated music may inadvertently replicate samples from the training data, raising concerns of plagiarism. Similarity measures can quantify such replication, thereby offering supervision and guidance for music generation models. Existing similarity measure methods for symbolic music mainly target melody repetition, leaving a gap in assessing complex music with rich textures and expressive perf…
▽ More
AI-generated music may inadvertently replicate samples from the training data, raising concerns of plagiarism. Similarity measures can quantify such replication, thereby offering supervision and guidance for music generation models. Existing similarity measure methods for symbolic music mainly target melody repetition, leaving a gap in assessing complex music with rich textures and expressive performance characteristics. To address this gap, we introduce SSIMuse, the first adaptation of the Structural Similarity Index Measure (SSIM) from images to symbolic music. Specifically, we represent symbolic music as image-like piano rolls in binary and velocity-based forms. Build upon these representations, we reinterprete and suitably modify the SSIM components in the musical context to develop two variants, i.e., SSIMuse-B and SSIMuse-V, for evaluating data replication in composition and dynamic performance, respectively. Controlled experiments on synthetic samples from multiple datasets show that SSIMuse can reliably detect exact replication at a granularity of at least one bar. SSIMuse enables open evaluation of replication in music generation and draws attention to its broader ethical, social, legal, and economic implications. The code is available at https://github.com/Tayjsl97/SSIMuse.
△ Less
Submitted 16 September, 2025;
originally announced September 2025.
-
NEFT: A Unified Transformer Framework for Efficient Near-Field CSI Feedback in XL-MIMO Systems
Authors:
Haiyang Li,
Tianqi Mao,
Pengyu Wang,
Ruiqi Liu,
Shunyu Li,
Zhaocheng Wang
Abstract:
Extremely large-scale multiple-input multiple-output (XL-MIMO) systems, operating in the near-field region due to their massive antenna arrays, are a key enabler of next-generation wireless communications but face significant challenges in channel state information (CSI) feedback. Deep learning has emerged as a powerful tool by learning compact CSI representations for feedback. However, existing m…
▽ More
Extremely large-scale multiple-input multiple-output (XL-MIMO) systems, operating in the near-field region due to their massive antenna arrays, are a key enabler of next-generation wireless communications but face significant challenges in channel state information (CSI) feedback. Deep learning has emerged as a powerful tool by learning compact CSI representations for feedback. However, existing methods struggle to capture the intricate structure of near-field CSI while incurring prohibitive computational overhead on practical mobile devices. To overcome these limitations, we propose the Near-Field Efficient Feedback Transformer (NEFT) family for accurate and efficient near-field CSI feedback across diverse hardware platforms. Built on a hierarchical Vision Transformer backbone, NEFT is extended with lightweight variants to meet various deployment constraints: NEFT-Compact applies multi-level knowledge distillation (KD) to reduce complexity while maintaining accuracy, and NEFT-Hybrid and NEFT-Edge address encoder- and edge-constrained scenarios via attention-free encoding and KD. Extensive simulations show that NEFT achieves a 15--21 dB improvement in normalized mean-squared error (NMSE) over state-of-the-art methods, while NEFT-Compact and NEFT-Edge reduce total FLOPs by 25--36% with negligible accuracy loss. Moreover, NEFT-Hybrid lowers encoder-side complexity by up to 64%, enabling deployment in highly asymmetric device scenarios. These results establish NEFT as a practical and scalable solution for near-field CSI feedback in XL-MIMO systems.
△ Less
Submitted 16 September, 2025;
originally announced September 2025.
-
PaiP: An Operational Aware Interactive Planner for Unknown Cabinet Environments
Authors:
Chengjin Wang,
Zheng Yan,
Yanmin Zhou,
Runjie Shen,
Zhipeng Wang,
Bin Cheng,
Bin He
Abstract:
Box/cabinet scenarios with stacked objects pose significant challenges for robotic motion due to visual occlusions and constrained free space. Traditional collision-free trajectory planning methods often fail when no collision-free paths exist, and may even lead to catastrophic collisions caused by invisible objects. To overcome these challenges, we propose an operational aware interactive motion…
▽ More
Box/cabinet scenarios with stacked objects pose significant challenges for robotic motion due to visual occlusions and constrained free space. Traditional collision-free trajectory planning methods often fail when no collision-free paths exist, and may even lead to catastrophic collisions caused by invisible objects. To overcome these challenges, we propose an operational aware interactive motion planner (PaiP) a real-time closed-loop planning framework utilizing multimodal tactile perception. This framework autonomously infers object interaction features by perceiving motion effects at interaction interfaces. These interaction features are incorporated into grid maps to generate operational cost maps. Building upon this representation, we extend sampling-based planning methods to interactive planning by optimizing both path cost and operational cost. Experimental results demonstrate that PaiP achieves robust motion in narrow spaces.
△ Less
Submitted 14 September, 2025;
originally announced September 2025.
-
Uplink and Downlink Communications in Segmented Waveguide-Enabled Pinching-Antenna Systems (SWANs)
Authors:
Chongjun Ouyang,
Hao Jiang,
Zhaolin Wang,
Yuanwei Liu,
Zhiguo Ding
Abstract:
A segmented waveguide-enabled pinching-antenna system (SWAN) is proposed, in which a segmented waveguide composed of multiple short dielectric waveguide segments is employed to radiate or receive signals through the pinching antennas (PAs) deployed on each segment. Based on this architecture, three practical operating protocols are proposed: segment selection (SS), segment aggregation (SA), and se…
▽ More
A segmented waveguide-enabled pinching-antenna system (SWAN) is proposed, in which a segmented waveguide composed of multiple short dielectric waveguide segments is employed to radiate or receive signals through the pinching antennas (PAs) deployed on each segment. Based on this architecture, three practical operating protocols are proposed: segment selection (SS), segment aggregation (SA), and segment multiplexing (SM). For uplink SWAN communications, where one PA is activated per segment, the segmented structure eliminates the inter-antenna radiation effect, i.e., signals captured by one PA may re-radiate through other PAs along the same waveguide. This yields a tractable and physically consistent uplink signal model for a multi-PA pinching-antenna system (PASS), which has not been established for conventional PASS using a single long waveguide. Building on this model, PA placement algorithms are proposed to maximize the uplink signal-to-noise ratio (SNR). Closed-form expressions for the received SNR under the three protocols are derived, and the corresponding scaling laws with respect to the number of segments are analyzed. It is proven that the segmented architecture reduces both the average PA-to-user distance and the PA-to-feed distance, thereby mitigating both large-scale path loss and in-waveguide propagation loss. These results are extended to downlink SWAN communications, where multiple PAs are activated per segment, and PA placement methods are proposed to maximize the downlink received SNR under the three protocols. Numerical results demonstrate that: \romannumeral1) among the three protocols, SM achieves the best performance, followed by SA and then SS; and \romannumeral2) for all protocols, the proposed SWAN achieves a higher SNR than conventional PASS with a single long waveguide in both uplink and downlink scenarios.
△ Less
Submitted 12 September, 2025;
originally announced September 2025.
-
Low-Complexity Null-Space-Based Simultaneous Wireless Information and Power Transfer Scheme
Authors:
Cheng Luo,
Jie Hu,
Luping Xiang,
Kun Yang,
Zhiqin Wang
Abstract:
Simultaneous wireless information and power transfer (SWIPT) has attracted sustained interest. We propose a null-space-based transmission scheme for multiuser SWIPT serving both energy users (EUs) and information users (IUs). Under a practical nonlinear energy-harvesting (EH) model and multiple waveform options, we revisit the role of dedicated energy beams (EBs). We show that, in general, dedicat…
▽ More
Simultaneous wireless information and power transfer (SWIPT) has attracted sustained interest. We propose a null-space-based transmission scheme for multiuser SWIPT serving both energy users (EUs) and information users (IUs). Under a practical nonlinear energy-harvesting (EH) model and multiple waveform options, we revisit the role of dedicated energy beams (EBs). We show that, in general, dedicated EBs are unnecessary because information beams (IBs) with Gaussian signaling can simultaneously support wireless energy transfer (WET) and wireless information transfer (WIT), unless special energy-centric waveforms (e.g., deterministic sinusoidal waveforms) are employed and provide sufficient gains. Guided by these insights, we formulate an optimization problem for EB design to enable dedicated waveform transmission for WET, and we develop a low-complexity algorithm that reduces computation by ignoring the WET contribution of IBs during optimization. Numerical results corroborate that deterministic sinusoidal waveforms outperform Gaussian signaling when the received RF power lies in the EH high-efficiency region, making dedicated EBs beneficial. The proposed scheme achieves computational complexity reductions of 91.43\% and 98.54\% for the cases $M=8,,K^I=K^E=2$ and $M=16,,K^I=K^E=4$, respectively, with negligible performance loss, thereby validating the efficiency of the low-complexity algorithm.
△ Less
Submitted 12 September, 2025;
originally announced September 2025.
-
First-Principle Modeling Framework of Boost Converter Dynamics for Precise Energy Conversions in Space
Authors:
Yifan Wang,
Wenhua Li,
Zhenlong Wang,
Xinrui Zhang,
Jianfeng Sun,
Qianfu Xia,
Zhongtao Gou,
Jiangang Rong,
Tao Ye
Abstract:
Boost converters are essential for modern electrification and intelligent technologies. However, conventional Boost converter models relying on steady-state assumptions fail to accurately predict transient behaviors during input voltage and load fluctuations, which cause significant output voltage overshoots and instability, resulting in failures of electrical systems, thereby restricting their us…
▽ More
Boost converters are essential for modern electrification and intelligent technologies. However, conventional Boost converter models relying on steady-state assumptions fail to accurately predict transient behaviors during input voltage and load fluctuations, which cause significant output voltage overshoots and instability, resulting in failures of electrical systems, thereby restricting their use in space. This study introduces a first-principle modeling framework that derives precise dynamic equations for Boost converters by incorporating non-ideal component coupling. As compared to the most accurate existing Boost converter model, the proposed models reduce steady-state and dynamic-state errors between experimental and simulated output voltages by factors of 11.0 (from 20.9% to 1.9%) and 15.4 (from 77.1% to 5.0%) under input voltage variations, and by factors of 10.2 (from 15.3% to 1.5%) and 35.1 (from 42.1% to 1.2%) under load changes, respectively. Consequently, a reliable Boost converter is accordingly designed and on-orbit deployed for precise energy conversions.
△ Less
Submitted 8 September, 2025;
originally announced September 2025.
-
VQualA 2025 Challenge on Image Super-Resolution Generated Content Quality Assessment: Methods and Results
Authors:
Yixiao Li,
Xin Li,
Chris Wei Zhou,
Shuo Xing,
Hadi Amirpour,
Xiaoshuai Hao,
Guanghui Yue,
Baoquan Zhao,
Weide Liu,
Xiaoyuan Yang,
Zhengzhong Tu,
Xinyu Li,
Chuanbiao Song,
Chenqi Zhang,
Jun Lan,
Huijia Zhu,
Weiqiang Wang,
Xiaoyan Sun,
Shishun Tian,
Dongyang Yan,
Weixia Zhang,
Junlin Chen,
Wei Sun,
Zhihua Wang,
Zhuohang Shi
, et al. (6 additional authors not shown)
Abstract:
This paper presents the ISRGC-Q Challenge, built upon the Image Super-Resolution Generated Content Quality Assessment (ISRGen-QA) dataset, and organized as part of the Visual Quality Assessment (VQualA) Competition at the ICCV 2025 Workshops. Unlike existing Super-Resolution Image Quality Assessment (SR-IQA) datasets, ISRGen-QA places a greater emphasis on SR images generated by the latest generat…
▽ More
This paper presents the ISRGC-Q Challenge, built upon the Image Super-Resolution Generated Content Quality Assessment (ISRGen-QA) dataset, and organized as part of the Visual Quality Assessment (VQualA) Competition at the ICCV 2025 Workshops. Unlike existing Super-Resolution Image Quality Assessment (SR-IQA) datasets, ISRGen-QA places a greater emphasis on SR images generated by the latest generative approaches, including Generative Adversarial Networks (GANs) and diffusion models. The primary goal of this challenge is to analyze the unique artifacts introduced by modern super-resolution techniques and to evaluate their perceptual quality effectively. A total of 108 participants registered for the challenge, with 4 teams submitting valid solutions and fact sheets for the final testing phase. These submissions demonstrated state-of-the-art (SOTA) performance on the ISRGen-QA dataset. The project is publicly available at: https://github.com/Lighting-YXLI/ISRGen-QA.
△ Less
Submitted 8 September, 2025;
originally announced September 2025.
-
Pinching Antenna System (PASS) Enhanced Covert Communications: Against Warden via Sensing
Authors:
Hao Jiang,
Zhaolin Wang,
Yuanwei Liu,
Arumugam Nallanathan,
Zhiguo Ding
Abstract:
A sensing-aided covert communication network empowered by pinching antenna systems (PASS) is proposed in this work. Unlike conventional fixed-position MIMO arrays, PASS dynamically reconfigures its pinching antennas (PAs) closer to the legitimate user, substantially enhancing covertness. To further secure the adversary's channel state information (CSI), a sensing function is leveraged to track the…
▽ More
A sensing-aided covert communication network empowered by pinching antenna systems (PASS) is proposed in this work. Unlike conventional fixed-position MIMO arrays, PASS dynamically reconfigures its pinching antennas (PAs) closer to the legitimate user, substantially enhancing covertness. To further secure the adversary's channel state information (CSI), a sensing function is leveraged to track the malicious warden's movements. In particular, this paper first proposes an extended Kalman filter (EKF) based approach to fulfilling the tracking function. Building on this, a covert communication problem is formulated with a joint design of beamforming, artificial noise (AN) signals, and the position of PAs. Then, the beamforming and AN design subproblems are resolved jointly with a subspace approach, while the PA position optimization subproblem is handled by a deep reinforcement learning (DRL) approach by treating the evolution of the warden's mobility status as a temporally corrected process. Numerical results are presented and demonstrate that: i) the EKF approach can accurately track the warden's CSI with low complexity, ii) the effectiveness of the proposed solution is verified by its outperformance over the greedy and searching-based benchmarks, and iii) with new design degrees of freedom (DoFs), the performance of PASS is superior to the conventional fully-digital MIMO systems.
△ Less
Submitted 7 September, 2025;
originally announced September 2025.
-
DeepStream: Prototyping Deep Joint Source-Channel Coding for Real-Time Multimedia Transmissions
Authors:
Kaiyi Chi,
Yinghui He,
Qianqian Yang,
Zhiping Jiang,
Yuanchao Shu,
Zhiqin Wang,
Jun Luo,
Jiming Chen
Abstract:
Deep learning-based joint source-channel coding (DeepJSCC) has emerged as a promising technique in 6G for enhancing the efficiency and reliability of data transmission across diverse modalities, particularly in low signal-to-noise ratio (SNR) environments. This advantage is realized by leveraging powerful neural networks to learn an optimal end-to-end mapping from the source data directly to the t…
▽ More
Deep learning-based joint source-channel coding (DeepJSCC) has emerged as a promising technique in 6G for enhancing the efficiency and reliability of data transmission across diverse modalities, particularly in low signal-to-noise ratio (SNR) environments. This advantage is realized by leveraging powerful neural networks to learn an optimal end-to-end mapping from the source data directly to the transmit symbol sequence, eliminating the need for separate source coding, channel coding, and modulation. Although numerous efforts have been made towards efficient DeepJSCC, they have largely stayed at numerical simulations that can be far from practice, leaving the real-world viability of DeepJSCC largely unverified. To this end, we prototype DeepStream upon orthogonal frequency division multiplexing (OFDM) technology to offer efficient and robust DeepJSCC for multimedia transmission. In conforming to OFDM, we develop both a feature-to-symbol mapping method and a cross-subcarrier precoding method to improve the subcarrier independence and reduce peak-to-average power ratio. To reduce system complexity and enable flexibility in accommodating varying quality of service requirements, we further propose a progressive coding strategy that adjusts the compression ratio based on latency with minimal performance loss. We implement DeepStream for real-time image transmission and video streaming using software-defined radio. Extensive evaluations verify that DeepStream outperforms both the standard scheme and the direct deployment scheme. Particularly, at an SNR of 10 dB, DeepStream achieves a PSNR of 35 dB for image transmission and an MS-SSIM of 20 dB for video streaming, whereas the standard scheme fails to recover meaningful information.
△ Less
Submitted 7 September, 2025;
originally announced September 2025.
-
Multi-modal Uncertainty Robust Tree Cover Segmentation For High-Resolution Remote Sensing Images
Authors:
Yuanyuan Gui,
Wei Li,
Yinjian Wang,
Xiang-Gen Xia,
Mauro Marty,
Christian Ginzler,
Zuyuan Wang
Abstract:
Recent advances in semantic segmentation of multi-modal remote sensing images have significantly improved the accuracy of tree cover mapping, supporting applications in urban planning, forest monitoring, and ecological assessment. Integrating data from multiple modalities-such as optical imagery, light detection and ranging (LiDAR), and synthetic aperture radar (SAR)-has shown superior performance…
▽ More
Recent advances in semantic segmentation of multi-modal remote sensing images have significantly improved the accuracy of tree cover mapping, supporting applications in urban planning, forest monitoring, and ecological assessment. Integrating data from multiple modalities-such as optical imagery, light detection and ranging (LiDAR), and synthetic aperture radar (SAR)-has shown superior performance over single-modality methods. However, these data are often acquired days or even months apart, during which various changes may occur, such as vegetation disturbances (e.g., logging, and wildfires) and variations in imaging quality. Such temporal misalignments introduce cross-modal uncertainty, especially in high-resolution imagery, which can severely degrade segmentation accuracy. To address this challenge, we propose MURTreeFormer, a novel multi-modal segmentation framework that mitigates and leverages aleatoric uncertainty for robust tree cover mapping. MURTreeFormer treats one modality as primary and others as auxiliary, explicitly modeling patch-level uncertainty in the auxiliary modalities via a probabilistic latent representation. Uncertain patches are identified and reconstructed from the primary modality's distribution through a VAE-based resampling mechanism, producing enhanced auxiliary features for fusion. In the decoder, a gradient magnitude attention (GMA) module and a lightweight refinement head (RH) are further integrated to guide attention toward tree-like structures and to preserve fine-grained spatial details. Extensive experiments on multi-modal datasets from Shanghai and Zurich demonstrate that MURTreeFormer significantly improves segmentation performance and effectively reduces the impact of temporally induced aleatoric uncertainty.
△ Less
Submitted 5 September, 2025;
originally announced September 2025.
-
Generalist versus Specialist Vision Foundation Models for Ocular Disease and Oculomics
Authors:
Yukun Zhou,
Paul Nderitu,
Jocelyn Hui Lin Goh,
Justin Engelmann,
Siegfried K. Wagner,
Anran Ran,
Hongyang Jiang,
Lie Ju,
Ke Zou,
Sahana Srinivasan,
Hyunmin Kim,
Takahiro Ninomiya,
Zheyuan Wang,
Gabriel Dawei Yang,
Eden Ruffell,
Dominic Williamson,
Rui Santos,
Gabor Mark Somfai,
Carol Y. Cheung,
Tien Yin Wong,
Daniel C. Alexander,
Yih Chung Tham,
Pearse A. Keane
Abstract:
Medical foundation models, pre-trained with large-scale clinical data, demonstrate strong performance in diverse clinically relevant applications. RETFound, trained on nearly one million retinal images, exemplifies this approach in applications with retinal images. However, the emergence of increasingly powerful and multifold larger generalist foundation models such as DINOv2 and DINOv3 raises the…
▽ More
Medical foundation models, pre-trained with large-scale clinical data, demonstrate strong performance in diverse clinically relevant applications. RETFound, trained on nearly one million retinal images, exemplifies this approach in applications with retinal images. However, the emergence of increasingly powerful and multifold larger generalist foundation models such as DINOv2 and DINOv3 raises the question of whether domain-specific pre-training remains essential, and if so, what gap persists. To investigate this, we systematically evaluated the adaptability of DINOv2 and DINOv3 in retinal image applications, compared to two specialist RETFound models, RETFound-MAE and RETFound-DINOv2. We assessed performance on ocular disease detection and systemic disease prediction using two adaptation strategies: fine-tuning and linear probing. Data efficiency and adaptation efficiency were further analysed to characterise trade-offs between predictive performance and computational cost. Our results show that although scaling generalist models yields strong adaptability across diverse tasks, RETFound-DINOv2 consistently outperforms these generalist foundation models in ocular-disease detection and oculomics tasks, demonstrating stronger generalisability and data efficiency. These findings suggest that specialist retinal foundation models remain the most effective choice for clinical applications, while the narrowing gap with generalist foundation models suggests that continued data and model scaling can deliver domain-relevant gains and position them as strong foundations for future medical foundation models.
△ Less
Submitted 3 September, 2025;
originally announced September 2025.
-
autoPET IV challenge: Incorporating organ supervision and human guidance for lesion segmentation in PET/CT
Authors:
Junwei Huang,
Yingqi Hao,
Yitong Luo,
Ziyu Wang,
Mingxuan Liu,
Yifei Chen,
Yuanhan Wang,
Lei Xiang,
Qiyuan Tian
Abstract:
Lesion Segmentation in PET/CT scans is an essential part of modern oncological workflows. To address the challenges of time-intensive manual annotation and high inter-observer variability, the autoPET challenge series seeks to advance automated segmentation methods in complex multi-tracer and multi-center settings. Building on this foundation, autoPET IV introduces a human-in-the-loop scenario to…
▽ More
Lesion Segmentation in PET/CT scans is an essential part of modern oncological workflows. To address the challenges of time-intensive manual annotation and high inter-observer variability, the autoPET challenge series seeks to advance automated segmentation methods in complex multi-tracer and multi-center settings. Building on this foundation, autoPET IV introduces a human-in-the-loop scenario to efficiently utilize interactive human guidance in segmentation tasks. In this work, we incorporated tracer classification, organ supervision and simulated clicks guidance into the nnUNet Residual Encoder framework, forming an integrated pipeline that demonstrates robust performance in a fully automated (zero-guidance) context and efficiently leverages iterative interactions to progressively enhance segmentation accuracy.
△ Less
Submitted 2 September, 2025;
originally announced September 2025.
-
Affine-Doppler Division Multiplexing for High-Mobility Wireless Communications Systems
Authors:
Yuanfang Ma,
Zulin Wang,
Peng Yuan,
Qin Huang,
Yuanhan Ni
Abstract:
Affine Frequency Division Multiplexing (AFDM) has been regarded as a candidate integrated sensing and communications (ISAC) waveform owing to its superior communication performance, outperforming the Orthogonal Time-Frequency Space (OTFS) that has been researched for a longer time. However, since the above two waveforms are incompatible with each other, the state-of-the-art methods well-designed f…
▽ More
Affine Frequency Division Multiplexing (AFDM) has been regarded as a candidate integrated sensing and communications (ISAC) waveform owing to its superior communication performance, outperforming the Orthogonal Time-Frequency Space (OTFS) that has been researched for a longer time. However, since the above two waveforms are incompatible with each other, the state-of-the-art methods well-designed for OTFS may not be directly applicable to AFDM. This paper introduces a new orthogonal multicarrier waveform, namely Affine-Doppler Division Multiplexing (ADDM), which can provide a generic framework and subsume the existing OTFS and AFDM as a particular case. ADDM modulating information symbols in the Affine-Doppler (A-D) domain based on a two-dimensional (2D) transform can enjoy both excellent unambiguous Doppler and Doppler resolution, which is the same as AFDM but outperforms OTFS. Moreover, benefiting from the 2D transform, the symbols block of ADDM in the A-D domain undergoes a 2D cyclic shift produced by the delay and the Doppler of the channel, similar to the 2D cyclic shift in the delay-Doppler domain of cyclic prefix (CP)-OTFS. This offers a potential to directly apply the state-of-the-art methods well-designed for OTFS and AFDM to ADDM. Numerical results show that ADDM achieves comparable BER performance with AFDM but outperforms OTFS in high-mobility scenarios.
△ Less
Submitted 4 September, 2025; v1 submitted 2 September, 2025;
originally announced September 2025.
-
Efficient River Water Level Sensing Using Cellular CSI and Joint Space-Time Processing
Authors:
Khawaja Fahad Masood,
Kai Wu,
Zhongqin Wang,
J. Andrew Zhang,
Shu-Lin Chen,
Y. Jay Guo
Abstract:
Accurate and timely water level monitoring is critical for flood prevention, environmental management, and emerging smart infrastructure systems. Traditional water sensing methods often rely on dedicated sensors, which can be costly to deploy and difficult to maintain and are vulnerable to damage during floods.In this work, we propose a novel cellular signalbased sensing scheme that passively esti…
▽ More
Accurate and timely water level monitoring is critical for flood prevention, environmental management, and emerging smart infrastructure systems. Traditional water sensing methods often rely on dedicated sensors, which can be costly to deploy and difficult to maintain and are vulnerable to damage during floods.In this work, we propose a novel cellular signalbased sensing scheme that passively estimates water level changes using downlink mobile signals from existing communication infrastructure. By capturing subtle variations in channel state information (CSI), the proposed method estimates the length changes of the water-reflected signal path, which correspond to water level variations. A space-time processing framework is developed to jointly estimate the angle of arrival and Doppler shift, enabling isolation and enhancement of the water-reflected path via beamforming, while effectively suppressing environmental noise. The phase evolution of the beamformed signal is then extracted to infer water level changes. To address clock asynchronism between the transmitter and receiver inherent in bistatic systems, we introduce a beamforming-based compensation technique for removing time-varying random phase offsets in CSI. Field experiments conducted across a river demonstrate that the proposed method enables accurate and reliable water level estimation, achieving a mean accuracy ranging from 1.5 cm to 3.05 cm across different receiver configurations and deployments.
△ Less
Submitted 1 September, 2025;
originally announced September 2025.
-
Learn2Reg 2024: New Benchmark Datasets Driving Progress on New Challenges
Authors:
Lasse Hansen,
Wiebke Heyer,
Christoph Großbröhmer,
Frederic Madesta,
Thilo Sentker,
Wang Jiazheng,
Yuxi Zhang,
Hang Zhang,
Min Liu,
Junyi Wang,
Xi Zhu,
Yuhua Li,
Liwen Wang,
Daniil Morozov,
Nazim Haouchine,
Joel Honkamaa,
Pekka Marttinen,
Yichao Zhou,
Zuopeng Tan,
Zhuoyuan Wang,
Yi Wang,
Hongchao Zhou,
Shunbo Hu,
Yi Zhang,
Qian Tao
, et al. (29 additional authors not shown)
Abstract:
Medical image registration is critical for clinical applications, and fair benchmarking of different methods is essential for monitoring ongoing progress. To date, the Learn2Reg 2020-2023 challenges have released several complementary datasets and established metrics for evaluations. However, these editions did not capture all aspects of the registration problem, particularly in terms of modality…
▽ More
Medical image registration is critical for clinical applications, and fair benchmarking of different methods is essential for monitoring ongoing progress. To date, the Learn2Reg 2020-2023 challenges have released several complementary datasets and established metrics for evaluations. However, these editions did not capture all aspects of the registration problem, particularly in terms of modality diversity and task complexity. To address these limitations, the 2024 edition introduces three new tasks, including large-scale multi-modal registration and unsupervised inter-subject brain registration, as well as the first microscopy-focused benchmark within Learn2Reg. The new datasets also inspired new method developments, including invertibility constraints, pyramid features, keypoints alignment and instance optimisation.
△ Less
Submitted 8 September, 2025; v1 submitted 1 September, 2025;
originally announced September 2025.
-
Doubly-Dispersive Continuous MIMO Systems: Channel Modeling and Beamforming Design
Authors:
Kuranage Roche Rayan Ranasinghe,
Zhaolin Wang,
Hyeon Seok Rou,
Giuseppe Thadeu Freitas de Abreu,
Emil Björnson
Abstract:
We address the modeling and optimal beamforming (BF) design for multiple-input multiple-output (MIMO) continuous aperture array (CAPA) systems operating over doubly-dispersive (DD) channels. First, a comprehensive DD continuous MIMO (DDC MIMO) channel model that incorporates CAPAs at both the transmitter (TX) and receiver (RX) is derived, which is used to obtain explicit input-output (I/O) relatio…
▽ More
We address the modeling and optimal beamforming (BF) design for multiple-input multiple-output (MIMO) continuous aperture array (CAPA) systems operating over doubly-dispersive (DD) channels. First, a comprehensive DD continuous MIMO (DDC MIMO) channel model that incorporates CAPAs at both the transmitter (TX) and receiver (RX) is derived, which is used to obtain explicit input-output (I/O) relations for various waveforms well suited to integrated sensing and communications (ISAC) and robust to DD channels, namely orthogonal frequency division multiplexing (OFDM), orthogonal time frequency space (OTFS), and affine frequency division multiplexing (AFDM). Then, functional optimization problems are formulated for the design of TX and RX BF matrices that maximize received power, in which novel low-complexity, closed-form solutions are obtained via the calculus of variations (CoV) method, yielding expressions closely related to the classical matched filter commonly used in conventional MIMO systems. Simulation results confirm that the proposed TX/RX BF designs with CAPAs provide significant performance and computational complexity gains over conventional MIMO systems in DD channels.
△ Less
Submitted 4 September, 2025; v1 submitted 31 August, 2025;
originally announced September 2025.
-
CoMET: A Contrastive-Masked Brain Foundation Model for Universal EEG Representation
Authors:
Ang Li,
Zikai Wang,
Liuyin Yang,
Zhenyu Wang,
Tianheng Xu,
Honglin Hu,
Marc M. Van Hulle
Abstract:
Electroencephalography (EEG) is a non-invasive technique for recording brain activity, widely used in brain-computer interfaces, clinic, and healthcare. Traditional EEG deep models typically focus on specific dataset and task, limiting model size and generalization. Recently, self-supervised brain foundation models have emerged and been applied to various downstream tasks. Nevertheless, these mode…
▽ More
Electroencephalography (EEG) is a non-invasive technique for recording brain activity, widely used in brain-computer interfaces, clinic, and healthcare. Traditional EEG deep models typically focus on specific dataset and task, limiting model size and generalization. Recently, self-supervised brain foundation models have emerged and been applied to various downstream tasks. Nevertheless, these models still have limitations: current SOTA models typically rely on masked reconstruction strategy; however, EEG features of adjacent channels are highly correlated, which causes the pre-training to overly focus on low-dimensional signal-similarity features in local regions and neglect the global discriminative patterns vital for downstream tasks. To address these limitations, we propose a brain foundation model called CoMET. Specifically, we employ the masked autoencoder with redesigned patching and embedding for EEG as backbone and devise a novel contrastive learning framework with mirror-scale augmentation to strengthen the global discrimination ability. CoMET is pre-trained on mixed EEG datasets over 3000 subjects with over one million samples. It is evaluated on ten different downstream datasets, and the SOTA results demonstrate CoMET's superior ability in extracting universal EEG representations and strong clinical potential.
△ Less
Submitted 29 August, 2025;
originally announced September 2025.
-
Neural Spline Operators for Risk Quantification in Stochastic Systems
Authors:
Zhuoyuan Wang,
Raffaele Romagnoli,
Kamyar Azizzadenesheli,
Yorie Nakahira
Abstract:
Accurately quantifying long-term risk probabilities in diverse stochastic systems is essential for safety-critical control. However, existing sampling-based and partial differential equation (PDE)-based methods often struggle to handle complex varying dynamics. Physics-informed neural networks learn surrogate mappings for risk probabilities from varying system parameters of fixed and finite dimens…
▽ More
Accurately quantifying long-term risk probabilities in diverse stochastic systems is essential for safety-critical control. However, existing sampling-based and partial differential equation (PDE)-based methods often struggle to handle complex varying dynamics. Physics-informed neural networks learn surrogate mappings for risk probabilities from varying system parameters of fixed and finite dimensions, yet can not account for functional variations in system dynamics. To address these challenges, we introduce physics-informed neural operator (PINO) methods to risk quantification problems, to learn mappings from varying \textit{functional} system dynamics to corresponding risk probabilities. Specifically, we propose Neural Spline Operators (NeSO), a PINO framework that leverages B-spline representations to improve training efficiency and achieve better initial and boundary condition enforcements, which are crucial for accurate risk quantification. We provide theoretical analysis demonstrating the universal approximation capability of NeSO. We also present two case studies, one with varying functional dynamics and another with high-dimensional multi-agent dynamics, to demonstrate the efficacy of NeSO and its significant online speed-up over existing methods. The proposed framework and the accompanying universal approximation theorem are expected to be beneficial for other control or PDE-related problems beyond risk quantification.
△ Less
Submitted 27 August, 2025;
originally announced August 2025.
-
UltraEar: a multicentric, large-scale database combining ultra-high-resolution computed tomography and clinical data for ear diseases
Authors:
Ruowei Tang,
Pengfei Zhao,
Xiaoguang Li,
Ning Xu,
Yue Cheng,
Mengshi Zhang,
Zhixiang Wang,
Zhengyu Zhang,
Hongxia Yin,
Heyu Ding,
Shusheng Gong,
Yuhe Liu,
Zhenchang Wang
Abstract:
Ear diseases affect billions of people worldwide, leading to substantial health and socioeconomic burdens. Computed tomography (CT) plays a pivotal role in accurate diagnosis, treatment planning, and outcome evaluation. The objective of this study is to present the establishment and design of UltraEar Database, a large-scale, multicentric repository of isotropic 0.1 mm ultra-high-resolution CT (U-…
▽ More
Ear diseases affect billions of people worldwide, leading to substantial health and socioeconomic burdens. Computed tomography (CT) plays a pivotal role in accurate diagnosis, treatment planning, and outcome evaluation. The objective of this study is to present the establishment and design of UltraEar Database, a large-scale, multicentric repository of isotropic 0.1 mm ultra-high-resolution CT (U-HRCT) images and associated clinical data dedicated to ear diseases. UltraEar recruits patients from 11 tertiary hospitals between October 2020 and October 2035, integrating U-HRCT images, structured CT reports, and comprehensive clinical information, including demographics, audiometric profiles, surgical records, and pathological findings. A broad spectrum of otologic disorders is covered, such as otitis media, cholesteatoma, ossicular chain malformation, temporal bone fracture, inner ear malformation, cochlear aperture stenosis, enlarged vestibular aqueduct, and sigmoid sinus bony deficiency. Standardized preprocessing pipelines have been developed for geometric calibration, image annotation, and multi-structure segmentation. All personal identifiers in DICOM headers and metadata are removed or anonymized to ensure compliance with data privacy regulation. Data collection and curation are coordinated through monthly expert panel meetings, with secure storage on an offline cloud system. UltraEar provides an unprecedented ultra-high-resolution reference atlas with both technical fidelity and clinical relevance. This resource has significant potential to advance radiological research, enable development and validation of AI algorithms, serve as an educational tool for training in otologic imaging, and support multi-institutional collaborative studies. UltraEar will be continuously updated and expanded, ensuring long-term accessibility and usability for the global otologic research community.
△ Less
Submitted 27 August, 2025;
originally announced August 2025.
-
Low-Cost Architecture and Efficient Pattern Synthesis for Polarimetric Phased Array Based on Polarization Coding Reconfigurable Elements
Authors:
Yiqing Wang,
Jian Zhou,
Chen Pang,
Wenyang Man,
Zixiang Xiong,
Ke Meng,
Zhanling Wang,
Yongzhen Li
Abstract:
Polarimetric phased arrays (PPAs) enhance radar target detection and anti-jamming capabilities. However, the dual transmit/receive (T/R) channel requirement leads to high costs and system complexity. To address this, this paper introduces a polarization-coding reconfigurable phased array (PCRPA) and associated pattern synthesis techniques to reduce PPA costs while minimizing performance degradatio…
▽ More
Polarimetric phased arrays (PPAs) enhance radar target detection and anti-jamming capabilities. However, the dual transmit/receive (T/R) channel requirement leads to high costs and system complexity. To address this, this paper introduces a polarization-coding reconfigurable phased array (PCRPA) and associated pattern synthesis techniques to reduce PPA costs while minimizing performance degradation. Each PCRPA element connects to a single T/R channel and incorporates two-level RF switches for real-time control of polarization states and waveforms. By adjusting element codes and excitation weights, the PCRPA can generate arbitrarily polarized and dual-polarized beams. Efficient beam pattern synthesis methods are also proposed, featuring novel optimization constraints derived from theoretical and analytical analysis of PCRPAs. Simulations demonstrate that the approach achieves low cross-polarization and sidelobe levels comparable to conventional architectures within the scan range, particularly for large arrays. However, the channel reduction inevitably incurs power and directivity loss. Experiments conducted on an $8\times 8$ X-band array antenna validate the effectiveness of the proposed system. The PCRPA and synthesis methods are well-suited for large-scale PPA systems, offering significant cost-effectiveness while maintaining good sidelobe suppression and polarization control performance.
△ Less
Submitted 28 August, 2025; v1 submitted 27 August, 2025;
originally announced August 2025.
-
VibeVoice Technical Report
Authors:
Zhiliang Peng,
Jianwei Yu,
Wenhui Wang,
Yaoyao Chang,
Yutao Sun,
Li Dong,
Yi Zhu,
Weijiang Xu,
Hangbo Bao,
Zehua Wang,
Shaohan Huang,
Yan Xia,
Furu Wei
Abstract:
This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression…
▽ More
This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.
△ Less
Submitted 26 August, 2025;
originally announced August 2025.
-
The Sound of Risk: A Multimodal Physics-Informed Acoustic Model for Forecasting Market Volatility and Enhancing Market Interpretability
Authors:
Xiaoliang Chen,
Xin Yu,
Le Chang,
Teng Jing,
Jiashuai He,
Ze Wang,
Yangjun Luo,
Xingyu Chen,
Jiayue Liang,
Yuchen Wang,
Jiaying Xie
Abstract:
Information asymmetry in financial markets, often amplified by strategically crafted corporate narratives, undermines the effectiveness of conventional textual analysis. We propose a novel multimodal framework for financial risk assessment that integrates textual sentiment with paralinguistic cues derived from executive vocal tract dynamics in earnings calls. Central to this framework is the Physi…
▽ More
Information asymmetry in financial markets, often amplified by strategically crafted corporate narratives, undermines the effectiveness of conventional textual analysis. We propose a novel multimodal framework for financial risk assessment that integrates textual sentiment with paralinguistic cues derived from executive vocal tract dynamics in earnings calls. Central to this framework is the Physics-Informed Acoustic Model (PIAM), which applies nonlinear acoustics to robustly extract emotional signatures from raw teleconference sound subject to distortions such as signal clipping. Both acoustic and textual emotional states are projected onto an interpretable three-dimensional Affective State Label (ASL) space-Tension, Stability, and Arousal. Using a dataset of 1,795 earnings calls (approximately 1,800 hours), we construct features capturing dynamic shifts in executive affect between scripted presentation and spontaneous Q&A exchanges. Our key finding reveals a pronounced divergence in predictive capacity: while multimodal features do not forecast directional stock returns, they explain up to 43.8% of the out-of-sample variance in 30-day realized volatility. Importantly, volatility predictions are strongly driven by emotional dynamics during executive transitions from scripted to spontaneous speech, particularly reduced textual stability and heightened acoustic instability from CFOs, and significant arousal variability from CEOs. An ablation study confirms that our multimodal approach substantially outperforms a financials-only baseline, underscoring the complementary contributions of acoustic and textual modalities. By decoding latent markers of uncertainty from verifiable biometric signals, our methodology provides investors and regulators a powerful tool for enhancing market interpretability and identifying hidden corporate uncertainty.
△ Less
Submitted 25 August, 2025;
originally announced August 2025.
-
H-PRM: A Pluggable Hotword Pre-Retrieval Module for Various Speech Recognition Systems
Authors:
Huangyu Dai,
Lingtao Mao,
Ben Chen,
Zihan Wang,
Zihan Liang,
Ying Han,
Chenyi Lei,
Han Li
Abstract:
Hotword customization is crucial in ASR to enhance the accuracy of domain-specific terms. It has been primarily driven by the advancements in traditional models and Audio large language models (LLMs). However, existing models often struggle with large-scale hotwords, as the recognition rate drops dramatically with the number of hotwords increasing. In this paper, we introduce a novel hotword custo…
▽ More
Hotword customization is crucial in ASR to enhance the accuracy of domain-specific terms. It has been primarily driven by the advancements in traditional models and Audio large language models (LLMs). However, existing models often struggle with large-scale hotwords, as the recognition rate drops dramatically with the number of hotwords increasing. In this paper, we introduce a novel hotword customization system that utilizes a hotword pre-retrieval module (H-PRM) to identify the most relevant hotword candidate by measuring the acoustic similarity between the hotwords and the speech segment. This plug-and-play solution can be easily integrated into traditional models such as SeACo-Paraformer, significantly enhancing hotwords post-recall rate (PRR). Additionally, we incorporate H-PRM into Audio LLMs through a prompt-based approach, enabling seamless customization of hotwords. Extensive testing validates that H-PRM can outperform existing methods, showing a new direction for hotword customization in ASR.
△ Less
Submitted 22 August, 2025;
originally announced August 2025.
-
A Disease-Centric Vision-Language Foundation Model for Precision Oncology in Kidney Cancer
Authors:
Yuhui Tao,
Zhongwei Zhao,
Zilong Wang,
Xufang Luo,
Feng Chen,
Kang Wang,
Chuanfu Wu,
Xue Zhang,
Shaoting Zhang,
Jiaxi Yao,
Xingwei Jin,
Xinyang Jiang,
Yifan Yang,
Dongsheng Li,
Lili Qiu,
Zhiqiang Shao,
Jianming Guo,
Nengwang Yu,
Shuo Wang,
Ying Xiong
Abstract:
The non-invasive assessment of increasingly incidentally discovered renal masses is a critical challenge in urologic oncology, where diagnostic uncertainty frequently leads to the overtreatment of benign or indolent tumors. In this study, we developed and validated RenalCLIP using a dataset of 27,866 CT scans from 8,809 patients across nine Chinese medical centers and the public TCIA cohort, a vis…
▽ More
The non-invasive assessment of increasingly incidentally discovered renal masses is a critical challenge in urologic oncology, where diagnostic uncertainty frequently leads to the overtreatment of benign or indolent tumors. In this study, we developed and validated RenalCLIP using a dataset of 27,866 CT scans from 8,809 patients across nine Chinese medical centers and the public TCIA cohort, a visual-language foundation model for characterization, diagnosis and prognosis of renal mass. The model was developed via a two-stage pre-training strategy that first enhances the image and text encoders with domain-specific knowledge before aligning them through a contrastive learning objective, to create robust representations for superior generalization and diagnostic precision. RenalCLIP achieved better performance and superior generalizability across 10 core tasks spanning the full clinical workflow of kidney cancer, including anatomical assessment, diagnostic classification, and survival prediction, compared with other state-of-the-art general-purpose CT foundation models. Especially, for complicated task like recurrence-free survival prediction in the TCIA cohort, RenalCLIP achieved a C-index of 0.726, representing a substantial improvement of approximately 20% over the leading baselines. Furthermore, RenalCLIP's pre-training imparted remarkable data efficiency; in the diagnostic classification task, it only needs 20% training data to achieve the peak performance of all baseline models even after they were fully fine-tuned on 100% of the data. Additionally, it achieved superior performance in report generation, image-text retrieval and zero-shot diagnosis tasks. Our findings establish that RenalCLIP provides a robust tool with the potential to enhance diagnostic accuracy, refine prognostic stratification, and personalize the management of patients with kidney cancer.
△ Less
Submitted 22 August, 2025;
originally announced August 2025.
-
AIM 2025 challenge on Inverse Tone Mapping Report: Methods and Results
Authors:
Chao Wang,
Francesco Banterle,
Bin Ren,
Radu Timofte,
Xin Lu,
Yufeng Peng,
Chengjie Ge,
Zhijing Sun,
Ziang Zhou,
Zihao Li,
Zishun Liao,
Qiyu Kang,
Xueyang Fu,
Zheng-Jun Zha,
Zhijing Sun,
Xingbo Wang,
Kean Liu,
Senyan Xu,
Yang Qiu,
Yifan Ding,
Gabriel Eilertsen,
Jonas Unger,
Zihao Wang,
Ke Wu,
Jinshan Pan
, et al. (4 additional authors not shown)
Abstract:
This paper presents a comprehensive review of the AIM 2025 Challenge on Inverse Tone Mapping (ITM). The challenge aimed to push forward the development of effective ITM algorithms for HDR image reconstruction from single LDR inputs, focusing on perceptual fidelity and numerical consistency. A total of \textbf{67} participants submitted \textbf{319} valid results, from which the best five teams wer…
▽ More
This paper presents a comprehensive review of the AIM 2025 Challenge on Inverse Tone Mapping (ITM). The challenge aimed to push forward the development of effective ITM algorithms for HDR image reconstruction from single LDR inputs, focusing on perceptual fidelity and numerical consistency. A total of \textbf{67} participants submitted \textbf{319} valid results, from which the best five teams were selected for detailed analysis. This report consolidates their methodologies and performance, with the lowest PU21-PSNR among the top entries reaching 29.22 dB. The analysis highlights innovative strategies for enhancing HDR reconstruction quality and establishes strong benchmarks to guide future research in inverse tone mapping.
△ Less
Submitted 21 September, 2025; v1 submitted 18 August, 2025;
originally announced August 2025.
-
Stochastic Black Start Resource Allocation to Enable Dynamic Formation of Networked Microgrids and DER-aided Restoration
Authors:
Cong Bai,
Salish Maharjan,
Han Wang,
Zhaoyu Wang
Abstract:
Extended outages in distributed systems (DSs) dominated by distributed energy resources (DERs) require innovative strategies to efficiently and securely deploy black start (BS) resources. To address the need, this paper proposes a two-stage stochastic resource allocation method within synchronizing dynamic microgrids (MGs) for black start (SDMG-BS), enabling risk-averse and adaptive restoration ac…
▽ More
Extended outages in distributed systems (DSs) dominated by distributed energy resources (DERs) require innovative strategies to efficiently and securely deploy black start (BS) resources. To address the need, this paper proposes a two-stage stochastic resource allocation method within synchronizing dynamic microgrids (MGs) for black start (SDMG-BS), enabling risk-averse and adaptive restoration across various scenarios while ensuring frequency security. Virtual synchronous generator (VSG)-controlled grid-forming inverters (GFMIs) equipped with primary frequency governors (PFGs) are modeled as BS resources. Their frequency response is characterized by three transient indices, which are deployed as frequency dynamic constraints on load pick-up events to ensure frequency stability during the BS process. SDMG-BS framework facilitates location-independent synchronization among restored MGs and with the transmission grid (TG) with the help of smart switches (SSWs). The model incorporates scenario-based stochastic programming to address multi-source uncertainties, including season-dependent operational conditions and unpredictable TG outage durations, ensuring a resilient allocation plan. The proposed approach is validated on a modified IEEE 123-node feeder with three study cases designed across sixteen uncertainty scenarios.
△ Less
Submitted 18 August, 2025;
originally announced August 2025.
-
Exploiting Convexity of Neural Networks in Dynamic Operating Envelope Optimization for Distributed Energy Resources
Authors:
Hongyi Li,
Liming Liu,
Yunyi Li,
Zhaoyu Wang
Abstract:
The increasing penetration of distributed energy resources (DERs) brings opportunities and challenges to the operation of distribution systems. To ensure network integrity, dynamic operating envelopes (DOEs) are issued by utilities to DERs as their time-varying export/import power limits. Due to the non-convex nature of power flow equations, the optimization of DOEs faces a dilemma of solution acc…
▽ More
The increasing penetration of distributed energy resources (DERs) brings opportunities and challenges to the operation of distribution systems. To ensure network integrity, dynamic operating envelopes (DOEs) are issued by utilities to DERs as their time-varying export/import power limits. Due to the non-convex nature of power flow equations, the optimization of DOEs faces a dilemma of solution accuracy and computation efficiency. To bridge this gap, in this paper, we facilitate DOE optimization by exploiting the convexity of input convex neural networks (ICNNs). A DOE optimization model is first presented, comprehensively considering multiple operational constraints. We propose a constraint embedding method that allows us to replace the non-convex power flow constraints with trained ICNN models and convexify the problem. To further speed up DOE optimization, we propose a linear relaxation of the ICNN-based DOE optimization problem, for which the tightness is theoretically proven. The effectiveness of the proposed method is validated with numerical case studies. Results show that the proposed ICNN-based method outperforms other benchmark methods in optimizing DOEs in terms of both solution quality and solution time.
△ Less
Submitted 18 August, 2025;
originally announced August 2025.
-
Grid Edge Intelligence-Assisted Model Predictive Framework for Black Start of Distribution Systems with Inverter-Based Resources
Authors:
Junyuan Zheng,
Salish Maharjan,
Zhaoyu Wang
Abstract:
The growing proliferation of distributed energy resources (DERs) is significantly enhancing the resilience and reliability of distribution systems. However, a substantial portion of behind-the-meter (BTM) DERs is often overlooked during black start (BS) and restoration processes. Existing BS strategies that utilize grid-forming (GFM) battery energy storage systems (BESS) frequently ignore critical…
▽ More
The growing proliferation of distributed energy resources (DERs) is significantly enhancing the resilience and reliability of distribution systems. However, a substantial portion of behind-the-meter (BTM) DERs is often overlooked during black start (BS) and restoration processes. Existing BS strategies that utilize grid-forming (GFM) battery energy storage systems (BESS) frequently ignore critical frequency security and synchronization constraints. To address these limitations, this paper proposes a predictive framework for bottom-up BS that leverages the flexibility of BTM DERs through Grid Edge Intelligence (GEI). A predictive model is developed for GEI to estimate multi-period flexibility ranges and track dispatch signals from the utility. A frequency-constrained BS strategy is then introduced, explicitly incorporating constraints on frequency nadir, rate-of-change-of-frequency (RoCoF), and quasi-steady-state (QSS) frequency. The framework also includes synchronizing switches to enable faster and more secure load restoration. Notably, it requires GEI devices to communicate only their flexibility ranges and the utility to send dispatch signals without exchanging detailed asset information. The proposed framework is validated using a modified IEEE 123-bus test system, and the impact of GEI is demonstrated by comparing results across various GEI penetration scenarios.
△ Less
Submitted 18 August, 2025;
originally announced August 2025.
-
LLM-RIMSA: Large Language Models driven Reconfigurable Intelligent Metasurface Antenna Systems
Authors:
Yunsong Huang,
Hui-Ming Wang,
Qingli Yan,
Zhaowei Wang
Abstract:
The evolution of 6G networks demands ultra-massive connectivity and intelligent radio environments, yet existing reconfigurable intelligent surface (RIS) technologies face critical limitations in hardware efficiency, dynamic control, and scalability. This paper introduces LLM-RIMSA, a transformative framework that integrates large language models (LLMs) with a novel reconfigurable intelligent meta…
▽ More
The evolution of 6G networks demands ultra-massive connectivity and intelligent radio environments, yet existing reconfigurable intelligent surface (RIS) technologies face critical limitations in hardware efficiency, dynamic control, and scalability. This paper introduces LLM-RIMSA, a transformative framework that integrates large language models (LLMs) with a novel reconfigurable intelligent metasurface antenna (RIMSA) architecture to address these challenges. Unlike conventional RIS designs, RIMSA employs parallel coaxial feeding and 2D metasurface integration, enabling each individual metamaterial element to independently adjust both its amplitude and phase. While traditional optimization and deep learning (DL) methods struggle with high-dimensional state spaces and prohibitive training costs for RIMSA control, LLM-RIMSA leverages pre-trained LLMs cross-modal reasoning and few-shot learning capabilities to dynamically optimize RIMSA configurations. Simulations demonstrate that LLM-RIMSA achieves state-of-the-art performance, outperforming conventional DL-based methods in sum rate while reducing training overhead. The proposed framework pave the way for LLM-driven intelligent radio environments.
△ Less
Submitted 18 August, 2025;
originally announced August 2025.
-
Towards SISO Bistatic Sensing for ISAC
Authors:
Zhongqin Wang,
J. Andrew Zhang,
Kai Wu,
Min Xu,
Y. Jay Guo
Abstract:
Integrated Sensing and Communication (ISAC) is a key enabler for next-generation wireless systems. However, real-world deployment is often limited to low-cost, single-antenna transceivers. In such bistatic Single-Input Single-Output (SISO) setup, clock asynchrony introduces random phase offsets in Channel State Information (CSI), which cannot be mitigated using conventional multi-antenna methods.…
▽ More
Integrated Sensing and Communication (ISAC) is a key enabler for next-generation wireless systems. However, real-world deployment is often limited to low-cost, single-antenna transceivers. In such bistatic Single-Input Single-Output (SISO) setup, clock asynchrony introduces random phase offsets in Channel State Information (CSI), which cannot be mitigated using conventional multi-antenna methods. This work proposes WiDFS 3.0, a lightweight bistatic SISO sensing framework that enables accurate delay and Doppler estimation from distorted CSI by effectively suppressing Doppler mirroring ambiguity. It operates with only a single antenna at both the transmitter and receiver, making it suitable for low-complexity deployments. We propose a self-referencing cross-correlation (SRCC) method for SISO random phase removal and employ delay-domain beamforming to resolve Doppler ambiguity. The resulting unambiguous delay-Doppler-time features enable robust sensing with compact neural networks. Extensive experiments show that WiDFS 3.0 achieves accurate parameter estimation, with performance comparable to or even surpassing that of prior multi-antenna methods, especially in delay estimation. Validated under single- and multi-target scenarios, the extracted ambiguity-resolved features show strong sensing accuracy and generalization. For example, when deployed on the embedded-friendly MobileViT-XXS with only 1.3M parameters, WiDFS 3.0 consistently outperforms conventional features such as CSI amplitude, mirrored Doppler, and multi-receiver aggregated Doppler.
△ Less
Submitted 18 August, 2025;
originally announced August 2025.
-
Data-driven quantification and visualization of resilience metrics of power distribution system
Authors:
Dingwei Wang,
Salish Maharjan,
Junyuan Zheng,
Liming Liu,
Zhaoyu Wang
Abstract:
This paper presents a data-driven approach for quantifying the resilience of distribution power grids to extreme weather events using two key metrics: (a) the number of outages and (b) restoration time. The method leverages historical outage records maintained by power utilities and weather measurements collected by the National Oceanic and Atmospheric Administration (NOAA) to evaluate resilience…
▽ More
This paper presents a data-driven approach for quantifying the resilience of distribution power grids to extreme weather events using two key metrics: (a) the number of outages and (b) restoration time. The method leverages historical outage records maintained by power utilities and weather measurements collected by the National Oceanic and Atmospheric Administration (NOAA) to evaluate resilience across a utility's service territory. The proposed framework consists of three stages. First, outage events are systematically extracted from the outage records by temporally and spatially aggregating coincident component outages. In the second stage, weather zones across the service territory are delineated using a Voronoi polygon approach, based on the locations of NOAA weather sensors. Finally, data-driven models for outage fragility and restoration time are developed for each weather zone. These models enable the quantification and visualization of resilience metrics under varying intensities of extreme weather events. The proposed method is demonstrated using real-world data from a US distribution utility, located in Indianapolis, focused on wind- and precipitation-related events. The dataset spans two decades and includes over 160,000 outage records.
△ Less
Submitted 17 August, 2025;
originally announced August 2025.
-
PUB: A Plasma-Propelled Ultra-Quiet Blimp with Two-DOF Vector Thrusting
Authors:
Zihan Wang
Abstract:
This study presents the design and control of a Plasma-propelled Ultra-silence Blimp (PUB), a novel aerial robot employing plasma vector propulsion for ultra-quiet flight without mechanical propellers. The system utilizes a helium-lift platform for extended endurance and a four-layer ring asymmetric capacitor to generate ionic wind thrust. The modular propulsion units allow flexible configuration…
▽ More
This study presents the design and control of a Plasma-propelled Ultra-silence Blimp (PUB), a novel aerial robot employing plasma vector propulsion for ultra-quiet flight without mechanical propellers. The system utilizes a helium-lift platform for extended endurance and a four-layer ring asymmetric capacitor to generate ionic wind thrust. The modular propulsion units allow flexible configuration to meet mission-specific requirements, while a two-degree-of-freedom (DOF) head enables thrust vector control. A closed-loop slip control scheme is implemented for stable maneuvering. Flight experiments demonstrate full-envelope capability, including take-off, climb, hover, descent, and smooth landing, confirming the feasibility of plasma vector propulsion, the effectiveness of DOF vector control, and the stability of the control system. Owing to its low acoustic signature, structural simplicity, and high maneuverability, PUB is well suited for noise-sensitive, enclosed, and near-space applications.
△ Less
Submitted 28 August, 2025; v1 submitted 17 August, 2025;
originally announced August 2025.
-
Jamming Identification with Differential Transformer for Low-Altitude Wireless Networks
Authors:
Pengyu Wang,
Zhaocheng Wang,
Tianqi Mao,
Weijie Yuan,
Haijun Zhang,
George K. Karagiannidis
Abstract:
Wireless jamming identification, which detects and classifies electromagnetic jamming from non-cooperative devices, is crucial for emerging low-altitude wireless networks consisting of many drone terminals that are highly susceptible to electromagnetic jamming. However, jamming identification schemes adopting deep learning (DL) are vulnerable to attacks involving carefully crafted adversarial samp…
▽ More
Wireless jamming identification, which detects and classifies electromagnetic jamming from non-cooperative devices, is crucial for emerging low-altitude wireless networks consisting of many drone terminals that are highly susceptible to electromagnetic jamming. However, jamming identification schemes adopting deep learning (DL) are vulnerable to attacks involving carefully crafted adversarial samples, resulting in inevitable robustness degradation. To address this issue, we propose a differential transformer framework for wireless jamming identification. Firstly, we introduce a differential transformer network in order to distinguish jamming signals, which overcomes the attention noise when compared with its traditional counterpart by performing self-attention operations in a differential manner. Secondly, we propose a randomized masking training strategy to improve network robustness, which leverages the patch partitioning mechanism inherent to transformer architectures in order to create parallel feature extraction branches. Each branch operates on a distinct, randomly masked subset of patches, which fundamentally constrains the propagation of adversarial perturbations across the network. Additionally, the ensemble effect generated by fusing predictions from these diverse branches demonstrates superior resilience against adversarial attacks. Finally, we introduce a novel consistent training framework that significantly enhances adversarial robustness through dualbranch regularization. Simulation results demonstrate that our proposed methodology is superior to existing methods in boosting robustness to adversarial samples.
△ Less
Submitted 17 August, 2025;
originally announced August 2025.
-
DermINO: Hybrid Pretraining for a Versatile Dermatology Foundation Model
Authors:
Jingkai Xu,
De Cheng,
Xiangqian Zhao,
Jungang Yang,
Zilong Wang,
Xinyang Jiang,
Xufang Luo,
Lili Chen,
Xiaoli Ning,
Chengxu Li,
Xinzhu Zhou,
Xuejiao Song,
Ang Li,
Qingyue Xia,
Zhou Zhuang,
Hongfei Ouyang,
Ke Xue,
Yujun Sheng,
Rusong Meng,
Feng Xu,
Xi Yang,
Weimin Ma,
Yusheng Lee,
Dongsheng Li,
Xinbo Gao
, et al. (5 additional authors not shown)
Abstract:
Skin diseases impose a substantial burden on global healthcare systems, driven by their high prevalence (affecting up to 70% of the population), complex diagnostic processes, and a critical shortage of dermatologists in resource-limited areas. While artificial intelligence(AI) tools have demonstrated promise in dermatological image analysis, current models face limitations-they often rely on large…
▽ More
Skin diseases impose a substantial burden on global healthcare systems, driven by their high prevalence (affecting up to 70% of the population), complex diagnostic processes, and a critical shortage of dermatologists in resource-limited areas. While artificial intelligence(AI) tools have demonstrated promise in dermatological image analysis, current models face limitations-they often rely on large, manually labeled datasets and are built for narrow, specific tasks, making them less effective in real-world settings. To tackle these limitations, we present DermNIO, a versatile foundation model for dermatology. Trained on a curated dataset of 432,776 images from three sources (public repositories, web-sourced images, and proprietary collections), DermNIO incorporates a novel hybrid pretraining framework that augments the self-supervised learning paradigm through semi-supervised learning and knowledge-guided prototype initialization. This integrated method not only deepens the understanding of complex dermatological conditions, but also substantially enhances the generalization capability across various clinical tasks. Evaluated across 20 datasets, DermNIO consistently outperforms state-of-the-art models across a wide range of tasks. It excels in high-level clinical applications including malignancy classification, disease severity grading, multi-category diagnosis, and dermatological image caption, while also achieving state-of-the-art performance in low-level tasks such as skin lesion segmentation. Furthermore, DermNIO demonstrates strong robustness in privacy-preserving federated learning scenarios and across diverse skin types and sexes. In a blinded reader study with 23 dermatologists, DermNIO achieved 95.79% diagnostic accuracy (versus clinicians' 73.66%), and AI assistance improved clinician performance by 17.21%.
△ Less
Submitted 24 September, 2025; v1 submitted 16 August, 2025;
originally announced August 2025.
-
EVTP-IVS: Effective Visual Token Pruning For Unifying Instruction Visual Segmentation In Multi-Modal Large Language Models
Authors:
Wenhui Zhu,
Xiwen Chen,
Zhipeng Wang,
Shao Tang,
Sayan Ghosh,
Xuanzhao Dong,
Rajat Koner,
Yalin Wang
Abstract:
Instructed Visual Segmentation (IVS) tasks require segmenting objects in images or videos based on natural language instructions. While recent multimodal large language models (MLLMs) have achieved strong performance on IVS, their inference cost remains a major bottleneck, particularly in video. We empirically analyze visual token sampling in MLLMs and observe a strong correlation between subset t…
▽ More
Instructed Visual Segmentation (IVS) tasks require segmenting objects in images or videos based on natural language instructions. While recent multimodal large language models (MLLMs) have achieved strong performance on IVS, their inference cost remains a major bottleneck, particularly in video. We empirically analyze visual token sampling in MLLMs and observe a strong correlation between subset token coverage and segmentation performance. This motivates our design of a simple and effective token pruning method that selects a compact yet spatially representative subset of tokens to accelerate inference. In this paper, we introduce a novel visual token pruning method for IVS, called EVTP-IV, which builds upon the k-center by integrating spatial information to ensure better coverage. We further provide an information-theoretic analysis to support our design. Experiments on standard IVS benchmarks show that our method achieves up to 5X speed-up on video tasks and 3.5X on image tasks, while maintaining comparable accuracy using only 20% of the tokens. Our method also consistently outperforms state-of-the-art pruning baselines under varying pruning ratios.
△ Less
Submitted 15 August, 2025;
originally announced August 2025.
-
UWB-PostureGuard: A Privacy-Preserving RF Sensing System for Continuous Ergonomic Sitting Posture Monitoring
Authors:
Haotang Li,
Zhenyu Qi,
Sen He,
Kebin Peng,
Sheng Tan,
Yili Ren,
Tomas Cerny,
Jiyue Zhao,
Zi Wang
Abstract:
Improper sitting posture during prolonged computer use has become a significant public health concern. Traditional posture monitoring solutions face substantial barriers, including privacy concerns with camera-based systems and user discomfort with wearable sensors. This paper presents UWB-PostureGuard, a privacy-preserving ultra-wideband (UWB) sensing system that advances mobile technologies for…
▽ More
Improper sitting posture during prolonged computer use has become a significant public health concern. Traditional posture monitoring solutions face substantial barriers, including privacy concerns with camera-based systems and user discomfort with wearable sensors. This paper presents UWB-PostureGuard, a privacy-preserving ultra-wideband (UWB) sensing system that advances mobile technologies for preventive health management through continuous, contactless monitoring of ergonomic sitting posture. Our system leverages commercial UWB devices, utilizing comprehensive feature engineering to extract multiple ergonomic sitting posture features. We develop PoseGBDT to effectively capture temporal dependencies in posture patterns, addressing limitations of traditional frame-wise classification approaches. Extensive real-world evaluation across 10 participants and 19 distinct postures demonstrates exceptional performance, achieving 99.11% accuracy while maintaining robustness against environmental variables such as clothing thickness, additional devices, and furniture configurations. Our system provides a scalable, privacy-preserving mobile health solution on existing platforms for proactive ergonomic management, improving quality of life at low costs.
△ Less
Submitted 14 August, 2025;
originally announced August 2025.
-
Advances in Speech Separation: Techniques, Challenges, and Future Trends
Authors:
Kai Li,
Guo Chen,
Wendi Sang,
Yi Luo,
Zhuo Chen,
Shuai Wang,
Shulin He,
Zhong-Qiu Wang,
Andong Li,
Zhiyong Wu,
Xiaolin Hu
Abstract:
The field of speech separation, addressing the "cocktail party problem", has seen revolutionary advances with DNNs. Speech separation enhances clarity in complex acoustic environments and serves as crucial pre-processing for speech recognition and speaker recognition. However, current literature focuses narrowly on specific architectures or isolated approaches, creating fragmented understanding. T…
▽ More
The field of speech separation, addressing the "cocktail party problem", has seen revolutionary advances with DNNs. Speech separation enhances clarity in complex acoustic environments and serves as crucial pre-processing for speech recognition and speaker recognition. However, current literature focuses narrowly on specific architectures or isolated approaches, creating fragmented understanding. This survey addresses this gap by providing systematic examination of DNN-based speech separation techniques. Our work differentiates itself through: (I) Comprehensive perspective: We systematically investigate learning paradigms, separation scenarios with known/unknown speakers, comparative analysis of supervised/self-supervised/unsupervised frameworks, and architectural components from encoders to estimation strategies. (II) Timeliness: Coverage of cutting-edge developments ensures access to current innovations and benchmarks. (III) Unique insights: Beyond summarization, we evaluate technological trajectories, identify emerging patterns, and highlight promising directions including domain-robust frameworks, efficient architectures, multimodal integration, and novel self-supervised paradigms. (IV) Fair evaluation: We provide quantitative evaluations on standard datasets, revealing true capabilities and limitations of different methods. This comprehensive survey serves as an accessible reference for experienced researchers and newcomers navigating speech separation's complex landscape.
△ Less
Submitted 14 August, 2025;
originally announced August 2025.