-
Learning Segmentation from Radiology Reports
Authors:
Pedro R. A. S. Bassi,
Wenxuan Li,
Jieneng Chen,
Zheren Zhu,
Tianyu Lin,
Sergio Decherchi,
Andrea Cavalli,
Kang Wang,
Yang Yang,
Alan L. Yuille,
Zongwei Zhou
Abstract:
Tumor segmentation in CT scans is key for diagnosis, surgery, and prognosis, yet segmentation masks are scarce because their creation requires time and expertise. Public abdominal CT datasets have from dozens to a couple thousand tumor masks, but hospitals have hundreds of thousands of tumor CTs with radiology reports. Thus, leveraging reports to improve segmentation is key for scaling. In this pa…
▽ More
Tumor segmentation in CT scans is key for diagnosis, surgery, and prognosis, yet segmentation masks are scarce because their creation requires time and expertise. Public abdominal CT datasets have from dozens to a couple thousand tumor masks, but hospitals have hundreds of thousands of tumor CTs with radiology reports. Thus, leveraging reports to improve segmentation is key for scaling. In this paper, we propose a report-supervision loss (R-Super) that converts radiology reports into voxel-wise supervision for tumor segmentation AI. We created a dataset with 6,718 CT-Report pairs (from the UCSF Hospital), and merged it with public CT-Mask datasets (from AbdomenAtlas 2.0). We used our R-Super to train with these masks and reports, and strongly improved tumor segmentation in internal and external validation--F1 Score increased by up to 16% with respect to training with masks only. By leveraging readily available radiology reports to supplement scarce segmentation masks, R-Super strongly improves AI performance both when very few training masks are available (e.g., 50), and when many masks were available (e.g., 1.7K).
Project: https://github.com/MrGiovanni/R-Super
△ Less
Submitted 7 July, 2025;
originally announced July 2025.
-
Enhancing Satellite Quantum Key Distribution with Dual Band Reconfigurable Intelligent Surfaces
Authors:
Muhammad Khalil,
Ke Wang,
Jinho Choi
Abstract:
This paper presents a novel system architecture for hybrid satellite communications, integrating quantum key distribution (QKD) and classical radio frequency (RF) data transmission using a dual-band reconfigurable intelligent surface (RIS). The motivation is to address the growing need for global, secure, and reliable communications by leveraging the security of quantum optical links and the robus…
▽ More
This paper presents a novel system architecture for hybrid satellite communications, integrating quantum key distribution (QKD) and classical radio frequency (RF) data transmission using a dual-band reconfigurable intelligent surface (RIS). The motivation is to address the growing need for global, secure, and reliable communications by leveraging the security of quantum optical links and the robustness of classical RF channels within a unified framework. By employing a frequency-selective RIS, the system independently optimizes both quantum (850 nm) and classical (S-band) channels in real time, dynamically adapting to environmental fluctuations such as atmospheric turbulence and rain attenuation. The joint optimization of the quantum bit error rate (QBER) and the classical signal-to noise ratio (SNR) is formulated as a quadratic unconstrained binary optimization (QUBO) problem, enabling efficient adaptive phase control utilizing both quantum and classical computational methods. Comprehensive theoretical modeling and simulations, benchmarked against experimental data from the Micius satellite, demonstrate substantial performance gains. Notably, the RIS assisted system reduces QBER from approximately 2.5% to 0.7%, increases the secure key rate (SKR) to over 30,000 bits per second, and enhances classical RF SNR by about 3 dB at high elevation angles. These results illustrate the practical potential of hybrid RIS-assisted satellite links to deliver robust, efficient, and secure global communications.
△ Less
Submitted 3 July, 2025;
originally announced July 2025.
-
Structure and Smoothness Constrained Dual Networks for MR Bias Field Correction
Authors:
Dong Liang,
Xingyu Qiu,
Yuzhen Li,
Wei Wang,
Kuanquan Wang,
Suyu Dong,
Gongning Luo
Abstract:
MR imaging techniques are of great benefit to disease diagnosis. However, due to the limitation of MR devices, significant intensity inhomogeneity often exists in imaging results, which impedes both qualitative and quantitative medical analysis. Recently, several unsupervised deep learning-based models have been proposed for MR image improvement. However, these models merely concentrate on global…
▽ More
MR imaging techniques are of great benefit to disease diagnosis. However, due to the limitation of MR devices, significant intensity inhomogeneity often exists in imaging results, which impedes both qualitative and quantitative medical analysis. Recently, several unsupervised deep learning-based models have been proposed for MR image improvement. However, these models merely concentrate on global appearance learning, and neglect constraints from image structures and smoothness of bias field, leading to distorted corrected results. In this paper, novel structure and smoothness constrained dual networks, named S2DNets, are proposed aiming to self-supervised bias field correction. S2DNets introduce piece-wise structural constraints and smoothness of bias field for network training to effectively remove non-uniform intensity and retain much more structural details. Extensive experiments executed on both clinical and simulated MR datasets show that the proposed model outperforms other conventional and deep learning-based models. In addition to comparison on visual metrics, downstream MR image segmentation tasks are also used to evaluate the impact of the proposed model. The source code is available at: https://github.com/LeongDong/S2DNets}{https://github.com/LeongDong/S2DNets.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
PanTS: The Pancreatic Tumor Segmentation Dataset
Authors:
Wenxuan Li,
Xinze Zhou,
Qi Chen,
Tianyu Lin,
Pedro R. A. S. Bassi,
Szymon Plotka,
Jaroslaw B. Cwikla,
Xiaoxi Chen,
Chen Ye,
Zheren Zhu,
Kai Ding,
Heng Li,
Kang Wang,
Yang Yang,
Yucheng Tang,
Daguang Xu,
Alan L. Yuille,
Zongwei Zhou
Abstract:
PanTS is a large-scale, multi-institutional dataset curated to advance research in pancreatic CT analysis. It contains 36,390 CT scans from 145 medical centers, with expert-validated, voxel-wise annotations of over 993,000 anatomical structures, covering pancreatic tumors, pancreas head, body, and tail, and 24 surrounding anatomical structures such as vascular/skeletal structures and abdominal/tho…
▽ More
PanTS is a large-scale, multi-institutional dataset curated to advance research in pancreatic CT analysis. It contains 36,390 CT scans from 145 medical centers, with expert-validated, voxel-wise annotations of over 993,000 anatomical structures, covering pancreatic tumors, pancreas head, body, and tail, and 24 surrounding anatomical structures such as vascular/skeletal structures and abdominal/thoracic organs. Each scan includes metadata such as patient age, sex, diagnosis, contrast phase, in-plane spacing, slice thickness, etc. AI models trained on PanTS achieve significantly better performance in pancreatic tumor detection, localization, and segmentation compared to those trained on existing public datasets. Our analysis indicates that these gains are directly attributable to the 16x larger-scale tumor annotations and indirectly supported by the 24 additional surrounding anatomical structures. As the largest and most comprehensive resource of its kind, PanTS offers a new benchmark for developing and evaluating AI models in pancreatic CT analysis.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
Integrated Multimodal Sensing and Communication: Challenges, Technologies, and Architectures
Authors:
Yubo Peng,
Luping Xiang,
Kun Yang,
Feibo Jiang,
Kezhi Wang,
Christos Masouros
Abstract:
The evolution towards 6G networks requires the intelligent integration of communication and sensing capabilities to support diverse and complex applications, such as autonomous driving and immersive services. However, existing integrated sensing and communication (ISAC) systems predominantly rely on single-modal sensors as primary participants, which leads to a limited representation of environmen…
▽ More
The evolution towards 6G networks requires the intelligent integration of communication and sensing capabilities to support diverse and complex applications, such as autonomous driving and immersive services. However, existing integrated sensing and communication (ISAC) systems predominantly rely on single-modal sensors as primary participants, which leads to a limited representation of environmental features and significant performance bottlenecks under the emerging requirements of 6G applications. This limitation motivates a paradigm shift from single-modal to multimodal ISAC. In this article, we first analyze the key challenges in realizing multimodal ISAC, including the fusion of heterogeneous multimodal data, the high communication overhead among distributed sensors, and the design of efficient and scalable system architectures. We then introduce several enabling technologies, such as large AI models, semantic communication, and multi-agent systems, that hold promise for addressing these challenges. To operationalize these technologies, we zoom into three architectural paradigms: fusion-based multimodal ISAC (F-MAC), interaction-based multimodal ISAC (I-MAC), and relay-based multimodal ISAC (R-MAC), each tailored to organize devices and modalities for efficient collaboration in different scenarios. Thereafter, a case study is presented based on the F-MAC scheme, demonstrating that the scheme achieves more comprehensive sensing and improves sensing accuracy by approximately 80% compared to conventional single-modal ISAC systems. Finally, we discuss several open issues to be addressed in the future.
△ Less
Submitted 25 June, 2025;
originally announced June 2025.
-
VS-Singer: Vision-Guided Stereo Singing Voice Synthesis with Consistency Schrödinger Bridge
Authors:
Zijing Zhao,
Kai Wang,
Hao Huang,
Ying Hu,
Liang He,
Jichen Yang
Abstract:
To explore the potential advantages of utilizing spatial cues from images for generating stereo singing voices with room reverberation, we introduce VS-Singer, a vision-guided model designed to produce stereo singing voices with room reverberation from scene images. VS-Singer comprises three modules: firstly, a modal interaction network integrates spatial features into text encoding to create a li…
▽ More
To explore the potential advantages of utilizing spatial cues from images for generating stereo singing voices with room reverberation, we introduce VS-Singer, a vision-guided model designed to produce stereo singing voices with room reverberation from scene images. VS-Singer comprises three modules: firstly, a modal interaction network integrates spatial features into text encoding to create a linguistic representation enriched with spatial information. Secondly, the decoder employs a consistency Schrödinger bridge to facilitate one-step sample generation. Moreover, we utilize the SFE module to improve the consistency of audio-visual matching. To our knowledge, this study is the first to combine stereo singing voice synthesis with visual acoustic matching within a unified framework. Experimental results demonstrate that VS-Singer can effectively generate stereo singing voices that align with the scene perspective in a single step.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
Reliable Evaluation of MRI Motion Correction: Dataset and Insights
Authors:
Kun Wang,
Tobit Klug,
Stefan Ruschke,
Jan S. Kirschke,
Reinhard Heckel
Abstract:
Correcting motion artifacts in MRI is important, as they can hinder accurate diagnosis. However, evaluating deep learning-based and classical motion correction methods remains fundamentally difficult due to the lack of accessible ground-truth target data. To address this challenge, we study three evaluation approaches: real-world evaluation based on reference scans, simulated motion, and reference…
▽ More
Correcting motion artifacts in MRI is important, as they can hinder accurate diagnosis. However, evaluating deep learning-based and classical motion correction methods remains fundamentally difficult due to the lack of accessible ground-truth target data. To address this challenge, we study three evaluation approaches: real-world evaluation based on reference scans, simulated motion, and reference-free evaluation, each with its merits and shortcomings. To enable evaluation with real-world motion artifacts, we release PMoC3D, a dataset consisting of unprocessed Paired Motion-Corrupted 3D brain MRI data. To advance evaluation quality, we introduce MoMRISim, a feature-space metric trained for evaluating motion reconstructions. We assess each evaluation approach and find real-world evaluation together with MoMRISim, while not perfect, to be most reliable. Evaluation based on simulated motion systematically exaggerates algorithm performance, and reference-free evaluation overrates oversmoothed deep learning outputs.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Joint User Association and Beamforming Design for ISAC Networks with Large Language Models
Authors:
Haoyun Li,
Ming Xiao,
Kezhi Wang,
Robert Schober,
Dong In Kim,
Yong Liang Guan
Abstract:
Integrated sensing and communication (ISAC) has been envisioned to play a more important role in future wireless networks. However, the design of ISAC networks is challenging, especially when there are multiple communication and sensing (C\&S) nodes and multiple sensing targets. We investigate a multi-base station (BS) ISAC network in which multiple BSs equipped with multiple antennas simultaneous…
▽ More
Integrated sensing and communication (ISAC) has been envisioned to play a more important role in future wireless networks. However, the design of ISAC networks is challenging, especially when there are multiple communication and sensing (C\&S) nodes and multiple sensing targets. We investigate a multi-base station (BS) ISAC network in which multiple BSs equipped with multiple antennas simultaneously provide C\&S services for multiple ground communication users (CUs) and targets. To enhance the overall performance of C\&S, we formulate a joint user association (UA) and multi-BS transmit beamforming optimization problem with the objective of maximizing the total sum rate of all CUs while ensuring both the minimum target detection and parameter estimation requirements. To efficiently solve the highly non-convex mixed integer nonlinear programming (MINLP) optimization problem, we propose an alternating optimization (AO)-based algorithm that decomposes the problem into two sub-problems, i.e., UA optimization and multi-BS transmit beamforming optimization. Inspired by large language models (LLMs) for prediction and inference, we propose a unified framework integrating LLMs with convex-based optimization methods. First, we propose a comprehensive design of prompt engineering, including few-shot, chain of thought, and self-reflection techniques to guide LLMs in solving the binary integer programming UA optimization problem. Second, we utilize convex-based optimization methods to handle the non-convex beamforming optimization problem based on fractional programming (FP), majorization minimization (MM), and the alternating direction method of multipliers (ADMM) with an optimized UA from LLMs. Numerical results demonstrate that our proposed LLM-enabled AO-based algorithm achieves fast convergence and near upper-bound performance with the GPT-o1 model, outperforming various benchmark schemes.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
Federated Learning Assisted Edge Caching Scheme Based on Lightweight Architecture DDPM
Authors:
Xun Li,
Qiong Wu,
Pingyi Fan,
Kezhi Wang,
Nan Cheng,
Khaled B. Letaief
Abstract:
Edge caching is an emerging technology that empowers caching units at edge nodes, allowing users to fetch contents of interest that have been pre-cached at the edge nodes. The key to pre-caching is to maximize the cache hit percentage for cached content without compromising users' privacy. In this letter, we propose a federated learning (FL) assisted edge caching scheme based on lightweight archit…
▽ More
Edge caching is an emerging technology that empowers caching units at edge nodes, allowing users to fetch contents of interest that have been pre-cached at the edge nodes. The key to pre-caching is to maximize the cache hit percentage for cached content without compromising users' privacy. In this letter, we propose a federated learning (FL) assisted edge caching scheme based on lightweight architecture denoising diffusion probabilistic model (LDPM). Our simulation results verify that our proposed scheme achieves a higher cache hit percentage compared to existing FL-based methods and baseline methods.
△ Less
Submitted 13 June, 2025; v1 submitted 4 June, 2025;
originally announced June 2025.
-
ReFlow-VC: Zero-shot Voice Conversion Based on Rectified Flow and Speaker Feature Optimization
Authors:
Pengyu Ren,
Wenhao Guan,
Kaidi Wang,
Peijie Chen,
Qingyang Hong,
Lin Li
Abstract:
In recent years, diffusion-based generative models have demonstrated remarkable performance in speech conversion, including Denoising Diffusion Probabilistic Models (DDPM) and others. However, the advantages of these models come at the cost of requiring a large number of sampling steps. This limitation hinders their practical application in real-world scenarios. In this paper, we introduce ReFlow-…
▽ More
In recent years, diffusion-based generative models have demonstrated remarkable performance in speech conversion, including Denoising Diffusion Probabilistic Models (DDPM) and others. However, the advantages of these models come at the cost of requiring a large number of sampling steps. This limitation hinders their practical application in real-world scenarios. In this paper, we introduce ReFlow-VC, a novel high-fidelity speech conversion method based on rectified flow. Specifically, ReFlow-VC is an Ordinary Differential Equation (ODE) model that transforms a Gaussian distribution to the true Mel-spectrogram distribution along the most direct path. Furthermore, we propose a modeling approach that optimizes speaker features by utilizing both content and pitch information, allowing speaker features to reflect the properties of the current speech more accurately. Experimental results show that ReFlow-VC performs exceptionally well in small datasets and zero-shot scenarios.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
A Two-Stage Hierarchical Deep Filtering Framework for Real-Time Speech Enhancement
Authors:
Shenghui Lu,
Hukai Huang,
Jinanglong Yao,
Kaidi Wang,
Qingyang Hong,
Lin Li
Abstract:
This paper proposes a model that integrates sub-band processing and deep filtering to fully exploit information from the target time-frequency (TF) bin and its surrounding TF bins for single-channel speech enhancement. The sub-band module captures surrounding frequency bin information at the input, while the deep filtering module applies filtering at the output to both the target TF bin and its su…
▽ More
This paper proposes a model that integrates sub-band processing and deep filtering to fully exploit information from the target time-frequency (TF) bin and its surrounding TF bins for single-channel speech enhancement. The sub-band module captures surrounding frequency bin information at the input, while the deep filtering module applies filtering at the output to both the target TF bin and its surrounding TF bins. To further improve the model performance, we decouple deep filtering into temporal and frequency components and introduce a two-stage framework, reducing the complexity of filter coefficient prediction at each stage. Additionally, we propose the TAConv module to strengthen convolutional feature extraction. Experimental results demonstrate that the proposed hierarchical deep filtering network (HDF-Net) effectively utilizes surrounding TF bin information and outperforms other advanced systems while using fewer resources.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech Codec
Authors:
Peijie Chen,
Wenhao Guan,
Kaidi Wang,
Weijie Wu,
Hukai Huang,
Qingyang Hong,
Lin Li
Abstract:
Neural speech codecs are essential for advancing text-to-speech (TTS) systems. With the recent success of large language models in text generation, developing high-quality speech tokenizers has become increasingly important. This paper introduces DS-Codec, a novel neural speech codec featuring a dual-stage training framework with mirror and non-mirror architectures switching, designed to achieve s…
▽ More
Neural speech codecs are essential for advancing text-to-speech (TTS) systems. With the recent success of large language models in text generation, developing high-quality speech tokenizers has become increasingly important. This paper introduces DS-Codec, a novel neural speech codec featuring a dual-stage training framework with mirror and non-mirror architectures switching, designed to achieve superior speech reconstruction. We conduct extensive experiments and ablation studies to evaluate the effectiveness of our training strategy and compare the performance of the two architectures. Our results show that the mirrored structure significantly enhances the robustness of the learned codebooks, and the training strategy balances the advantages between mirrored and non-mirrored structures, leading to improved high-fidelity speech reconstruction.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion
Authors:
Kaidi Wang,
Wenhao Guan,
Ziyue Jiang,
Hukai Huang,
Peijie Chen,
Weijie Wu,
Qingyang Hong,
Lin Li
Abstract:
Currently, zero-shot voice conversion systems are capable of synthesizing the voice of unseen speakers. However, most existing approaches struggle to accurately replicate the speaking style of the source speaker or mimic the distinctive speaking style of the target speaker, thereby limiting the controllability of voice conversion. In this work, we propose Discl-VC, a novel voice conversion framewo…
▽ More
Currently, zero-shot voice conversion systems are capable of synthesizing the voice of unseen speakers. However, most existing approaches struggle to accurately replicate the speaking style of the source speaker or mimic the distinctive speaking style of the target speaker, thereby limiting the controllability of voice conversion. In this work, we propose Discl-VC, a novel voice conversion framework that disentangles content and prosody information from self-supervised speech representations and synthesizes the target speaker's voice through in-context learning with a flow matching transformer. To enable precise control over the prosody of generated speech, we introduce a mask generative transformer that predicts discrete prosody tokens in a non-autoregressive manner based on prompts. Experimental results demonstrate the superior performance of Discl-VC in zero-shot voice conversion and its remarkable accuracy in prosody control for synthesized speech.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
Latent Representations for Control Design with Provable Stability and Safety Guarantees
Authors:
Paul Lutkus,
Kaiyuan Wang,
Lars Lindemann,
Stephen Tu
Abstract:
We initiate a formal study on the use of low-dimensional latent representations of dynamical systems for verifiable control synthesis. Our main goal is to enable the application of verification techniques -- such as Lyapunov or barrier functions -- that might otherwise be computationally prohibitive when applied directly to the full state representation. Towards this goal, we first provide dynamic…
▽ More
We initiate a formal study on the use of low-dimensional latent representations of dynamical systems for verifiable control synthesis. Our main goal is to enable the application of verification techniques -- such as Lyapunov or barrier functions -- that might otherwise be computationally prohibitive when applied directly to the full state representation. Towards this goal, we first provide dynamics-aware approximate conjugacy conditions which formalize the notion of reconstruction error necessary for systems analysis. We then utilize our conjugacy conditions to transfer the stability and invariance guarantees of a latent certificate function (e.g., a Lyapunov or barrier function) for a latent space controller back to the original system. Importantly, our analysis contains several important implications for learning latent spaces and dynamics, by highlighting the necessary geometric properties which need to be preserved by the latent space, in addition to providing concrete loss functions for dynamics reconstruction that are directly related to control design. We conclude by demonstrating the applicability of our theory to two case studies: (1) stabilization of a cartpole system, and (2) collision avoidance for a two vehicle system.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
From Large AI Models to Agentic AI: A Tutorial on Future Intelligent Communications
Authors:
Feibo Jiang,
Cunhua Pan,
Li Dong,
Kezhi Wang,
Octavia A. Dobre,
Merouane Debbah
Abstract:
With the advent of 6G communications, intelligent communication systems face multiple challenges, including constrained perception and response capabilities, limited scalability, and low adaptability in dynamic environments. This tutorial provides a systematic introduction to the principles, design, and applications of Large Artificial Intelligence Models (LAMs) and Agentic AI technologies in inte…
▽ More
With the advent of 6G communications, intelligent communication systems face multiple challenges, including constrained perception and response capabilities, limited scalability, and low adaptability in dynamic environments. This tutorial provides a systematic introduction to the principles, design, and applications of Large Artificial Intelligence Models (LAMs) and Agentic AI technologies in intelligent communication systems, aiming to offer researchers a comprehensive overview of cutting-edge technologies and practical guidance. First, we outline the background of 6G communications, review the technological evolution from LAMs to Agentic AI, and clarify the tutorial's motivation and main contributions. Subsequently, we present a comprehensive review of the key components required for constructing LAMs. We further categorize LAMs and analyze their applicability, covering Large Language Models (LLMs), Large Vision Models (LVMs), Large Multimodal Models (LMMs), Large Reasoning Models (LRMs), and lightweight LAMs. Next, we propose a LAM-centric design paradigm tailored for communications, encompassing dataset construction and both internal and external learning approaches. Building upon this, we develop an LAM-based Agentic AI system for intelligent communications, clarifying its core components such as planners, knowledge bases, tools, and memory modules, as well as its interaction mechanisms. We also introduce a multi-agent framework with data retrieval, collaborative planning, and reflective evaluation for 6G. Subsequently, we provide a detailed overview of the applications of LAMs and Agentic AI in communication scenarios. Finally, we summarize the research challenges and future directions in current studies, aiming to support the development of efficient, secure, and sustainable next-generation intelligent communication systems.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Delayed-KD: Delayed Knowledge Distillation based CTC for Low-Latency Streaming ASR
Authors:
Longhao Li,
Yangze Li,
Hongfei Xue,
Jie Liu,
Shuai Fang,
Kai Wang,
Lei Xie
Abstract:
CTC-based streaming ASR has gained significant attention in real-world applications but faces two main challenges: accuracy degradation in small chunks and token emission latency. To mitigate these challenges, we propose Delayed-KD, which applies delayed knowledge distillation on CTC posterior probabilities from a non-streaming to a streaming model. Specifically, with a tiny chunk size, we introdu…
▽ More
CTC-based streaming ASR has gained significant attention in real-world applications but faces two main challenges: accuracy degradation in small chunks and token emission latency. To mitigate these challenges, we propose Delayed-KD, which applies delayed knowledge distillation on CTC posterior probabilities from a non-streaming to a streaming model. Specifically, with a tiny chunk size, we introduce a Temporal Alignment Buffer (TAB) that defines a relative delay range compared to the non-streaming teacher model to align CTC outputs and mitigate non-blank token mismatches. Additionally, TAB enables fine-grained control over token emission delay. Experiments on 178-hour AISHELL-1 and 10,000-hour WenetSpeech Mandarin datasets show consistent superiority of Delayed-KD. Impressively, Delayed-KD at 40 ms latency achieves a lower character error rate (CER) of 5.42% on AISHELL-1, comparable to the competitive U2++ model running at 320 ms latency.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Dynamical ON-OFF Control with Trajectory Prediction for Multi-RIS Wireless Networks
Authors:
Kaining Wang,
Bo Yang,
Yusheng Lei,
Zhiwen Yu,
Xuelin Cao,
George C. Alexandropoulos,
Marco Di Renzo,
Chau Yuen
Abstract:
Reconfigurable intelligent surfaces (RISs) have demonstrated an unparalleled ability to reconfigure wireless environments by dynamically controlling the phase, amplitude, and polarization of impinging waves. However, as nearly passive reflective metasurfaces, RISs may not distinguish between desired and interference signals, which can lead to severe spectrum pollution and even affect performance n…
▽ More
Reconfigurable intelligent surfaces (RISs) have demonstrated an unparalleled ability to reconfigure wireless environments by dynamically controlling the phase, amplitude, and polarization of impinging waves. However, as nearly passive reflective metasurfaces, RISs may not distinguish between desired and interference signals, which can lead to severe spectrum pollution and even affect performance negatively. In particular, in large-scale networks, the signal-to-interference-plus-noise ratio (SINR) at the receiving node can be degraded due to excessive interference reflected from the RIS. To overcome this fundamental limitation, we propose in this paper a trajectory prediction-based dynamical control algorithm (TPC) for anticipating RIS ON-OFF states sequence, integrating a long-short-term-memory (LSTM) scheme to predict user trajectories. In particular, through a codebook-based algorithm, the RIS controller adaptively coordinates the configuration of the RIS elements to maximize the received SINR. Our simulation results demonstrate the superiority of the proposed TPC method over various system settings.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
Authors:
Kai Li,
Can Shen,
Yile Liu,
Jirui Han,
Kelong Zheng,
Xuechao Zou,
Zhe Wang,
Xingjian Du,
Shun Zhang,
Hanjun Luo,
Yingbin Jin,
Xinxin Xing,
Ziyang Ma,
Yue Liu,
Xiaojun Jia,
Yifan Zhang,
Junfeng Fang,
Kun Wang,
Yibo Yan,
Haoyang Li,
Yiming Li,
Xiaobin Zhuang,
Yang Liu,
Haibo Hu,
Zhizheng Wu
, et al. (6 additional authors not shown)
Abstract:
The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safet…
▽ More
The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safety dimensions, failing to adequately account for the unique characteristics and application scenarios inherent to the audio modality. We introduce AudioTrust-the first multifaceted trustworthiness evaluation framework and benchmark specifically designed for ALLMs. AudioTrust facilitates assessments across six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. To comprehensively evaluate these dimensions, AudioTrust is structured around 18 distinct experimental setups. Its core is a meticulously constructed dataset of over 4,420 audio/text samples, drawn from real-world scenarios (e.g., daily conversations, emergency calls, voice assistant interactions), specifically designed to probe the multifaceted trustworthiness of ALLMs. For assessment, the benchmark carefully designs 9 audio-specific evaluation metrics, and we employ a large-scale automated pipeline for objective and scalable scoring of model outputs. Experimental results reveal the trustworthiness boundaries and limitations of current state-of-the-art open-source and closed-source ALLMs when confronted with various high-risk audio scenarios, offering valuable insights for the secure and trustworthy deployment of future audio models. Our platform and benchmark are available at https://github.com/JusperLee/AudioTrust.
△ Less
Submitted 1 July, 2025; v1 submitted 22 May, 2025;
originally announced May 2025.
-
To Stay or to Bypass: Unraveling Mainline Vehicles' Aggregate Strategic Decision-Making at Highway Weaving Ramps
Authors:
Haohui He,
Kexin Wang,
Ruolin Li
Abstract:
The weaving ramp scenario is a critical bottleneck in highway networks due to conflicting flows and complex interactions among merging, exiting, and through vehicles. In this work, we propose a game-theoretic model to capture and predict the aggregate lane choice behavior of mainline through vehicles as they approach the weaving zone. Faced with potential conflicts from merging and exiting vehicle…
▽ More
The weaving ramp scenario is a critical bottleneck in highway networks due to conflicting flows and complex interactions among merging, exiting, and through vehicles. In this work, we propose a game-theoretic model to capture and predict the aggregate lane choice behavior of mainline through vehicles as they approach the weaving zone. Faced with potential conflicts from merging and exiting vehicles, mainline vehicles can either bypass the conflict zone by changing to an adjacent lane or stay steadfast in their current lane. Our model effectively captures these strategic choices using a small set of parameters, requiring only limited traffic measurements for calibration. The model's validity is demonstrated through SUMO simulations, achieving high predictive accuracy. The simplicity and flexibility of the proposed framework make it a practical tool for analyzing bottleneck weaving scenarios and informing traffic management strategies.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Non-contact Vital Signs Detection in Dynamic Environments
Authors:
Shuai Sun,
Chong-Xi Liang,
Chengwei Ye,
Huanzhen Zhang,
Kangsheng Wang
Abstract:
Accurate phase demodulation is critical for vital sign detection using millimeter-wave radar. However, in complex environments, time-varying DC offsets and phase imbalances can severely degrade demodulation performance. To address this, we propose a novel DC offset calibration method alongside a Hilbert and Differential Cross-Multiply (HADCM) demodulation algorithm. The approach estimates time-var…
▽ More
Accurate phase demodulation is critical for vital sign detection using millimeter-wave radar. However, in complex environments, time-varying DC offsets and phase imbalances can severely degrade demodulation performance. To address this, we propose a novel DC offset calibration method alongside a Hilbert and Differential Cross-Multiply (HADCM) demodulation algorithm. The approach estimates time-varying DC offsets from neighboring signal peaks and valleys, then employs both differential forms and Hilbert transforms of the I/Q channel signals to extract vital sign information. Simulation and experimental results demonstrate that the proposed method maintains robust performance under low signal-to-noise ratios. Compared to existing demodulation techniques, it offers more accurate signal recovery in challenging scenarios and effectively suppresses noise interference.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Multi-Agent Reinforcement Learning-based Cooperative Autonomous Driving in Smart Intersections
Authors:
Taoyuan Yu,
Kui Wang,
Zongdian Li,
Tao Yu,
Kei Sakaguchi
Abstract:
Unsignalized intersections pose significant safety and efficiency challenges due to complex traffic flows. This paper proposes a novel roadside unit (RSU)-centric cooperative driving system leveraging global perception and vehicle-to-infrastructure (V2I) communication. The core of the system is an RSU-based decision-making module using a two-stage hybrid reinforcement learning (RL) framework. At f…
▽ More
Unsignalized intersections pose significant safety and efficiency challenges due to complex traffic flows. This paper proposes a novel roadside unit (RSU)-centric cooperative driving system leveraging global perception and vehicle-to-infrastructure (V2I) communication. The core of the system is an RSU-based decision-making module using a two-stage hybrid reinforcement learning (RL) framework. At first, policies are pre-trained offline using conservative Q-learning (CQL) combined with behavior cloning (BC) on collected dataset. Subsequently, these policies are fine-tuned in the simulation using multi-agent proximal policy optimization (MAPPO), aligned with a self-attention mechanism to effectively solve inter-agent dependencies. RSUs perform real-time inference based on the trained models to realize vehicle control via V2I communications. Extensive experiments in CARLA environment demonstrate high effectiveness of the proposed system, by: \textit{(i)} achieving failure rates below 0.03\% in coordinating three connected and autonomous vehicles (CAVs) through complex intersection scenarios, significantly outperforming the traditional Autoware control method, and \textit{(ii)} exhibiting strong robustness across varying numbers of controlled agents and shows promising generalization capabilities on other maps.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
A Dataset and Toolkit for Multiparameter Cardiovascular Physiology Sensing on Rings
Authors:
Jiankai Tang,
Kegang Wang,
Yingke Ding,
Jiatong Ji,
Zeyu Wang,
Xiyuxing Zhang,
Ping Chen,
Yuanchun Shi,
Yuntao Wang
Abstract:
Smart rings offer a convenient way to continuously and unobtrusively monitor cardiovascular physiological signals. However, a gap remains between the ring hardware and reliable methods for estimating cardiovascular parameters, partly due to the lack of publicly available datasets and standardized analysis tools. In this work, we present $Ï„$-Ring, the first open-source ring-based dataset designed f…
▽ More
Smart rings offer a convenient way to continuously and unobtrusively monitor cardiovascular physiological signals. However, a gap remains between the ring hardware and reliable methods for estimating cardiovascular parameters, partly due to the lack of publicly available datasets and standardized analysis tools. In this work, we present $Ï„$-Ring, the first open-source ring-based dataset designed for cardiovascular physiological sensing. The dataset comprises photoplethysmography signals (infrared and red channels) and 3-axis accelerometer data collected from two rings (reflective and transmissive optical paths), with 28.21 hours of raw data from 34 subjects across seven activities. $Ï„$-Ring encompasses both stationary and motion scenarios, as well as stimulus-evoked abnormal physiological states, annotated with four ground-truth labels: heart rate, respiratory rate, oxygen saturation, and blood pressure. Using our proposed RingTool toolkit, we evaluated three widely-used physics-based methods and four cutting-edge deep learning approaches. Our results show superior performance compared to commercial rings, achieving best MAE values of 5.18 BPM for heart rate, 2.98 BPM for respiratory rate, 3.22\% for oxygen saturation, and 13.33/7.56 mmHg for systolic/diastolic blood pressure estimation. The open-sourced dataset and toolkit aim to foster further research and community-driven advances in ring-based cardiovascular health sensing.
△ Less
Submitted 8 May, 2025; v1 submitted 7 May, 2025;
originally announced May 2025.
-
Antenna Activation and Resource Allocation in Multi-Waveguide Pinching-Antenna Systems
Authors:
Kaidi Wang,
Zhiguo Ding,
George K. Karagiannidis
Abstract:
Pinching antennas, as a novel flexible-antenna technology capable of establishing line of sight (LoS) connections and effectively mitigating large-scale path loss, have recently attracted considerable research interests. However, the implementation of ideal pinching-antenna systems involves determining and adjusting pinching antennas to an arbitrary position on waveguides, which presents challenge…
▽ More
Pinching antennas, as a novel flexible-antenna technology capable of establishing line of sight (LoS) connections and effectively mitigating large-scale path loss, have recently attracted considerable research interests. However, the implementation of ideal pinching-antenna systems involves determining and adjusting pinching antennas to an arbitrary position on waveguides, which presents challenges to both practical deployment and related optimization. This paper investigates a practical pinching-antennas system in multi-waveguide scenarios, where pinching antennas are installed at pre-configured discrete positions to serve downlink users with non-orthogonal multiple access (NOMA). To improve system throughput, a sophisticated optimization problem is formulated by jointly considering waveguide assignment, antenna activation, successive interference cancellation (SIC) decoding order design, and power allocation. By treating waveguide assignment and antenna activation as two coalition-formation games, a novel game-theoretic algorithm is developed, in which the optimal decoding order is derived and incorporated. For power allocation, monotonic optimization and successive convex approximation (SCA) are employed to construct global optimal and low-complexity solutions, respectively. Simulation results demonstrate that the NOMA-based pinching-antenna system exhibits superior performance compared to the considered benchmark systems, and the proposed solutions provide significant improvement in terms of sum rate and outage probability.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
Extended Hybrid Zero Dynamics for Bipedal Walking of the Knee-less Robot SLIDER
Authors:
Rui Zong,
Martin Liang,
Yuntian Fang,
Ke Wang,
Xiaoshuai Chen,
Wei Chen,
Petar Kormushev
Abstract:
Knee-less bipedal robots like SLIDER have the advantage of ultra-lightweight legs and improved walking energy efficiency compared to traditional humanoid robots. In this paper, we firstly introduce an improved hardware design of the SLIDER bipedal robot with new line-feet and more optimized mass distribution that enables higher locomotion speeds. Secondly, we propose an extended Hybrid Zero Dynami…
▽ More
Knee-less bipedal robots like SLIDER have the advantage of ultra-lightweight legs and improved walking energy efficiency compared to traditional humanoid robots. In this paper, we firstly introduce an improved hardware design of the SLIDER bipedal robot with new line-feet and more optimized mass distribution that enables higher locomotion speeds. Secondly, we propose an extended Hybrid Zero Dynamics (eHZD) method, which can be applied to prismatic joint robots like SLIDER. The eHZD method is then used to generate a library of gaits with varying reference velocities in an offline way. Thirdly, a Guided Deep Reinforcement Learning (DRL) algorithm is proposed to use the pre-generated library to create walking control policies in real-time. This approach allows us to combine the advantages of both HZD (for generating stable gaits with a full-dynamics model) and DRL (for real-time adaptive gait generation). The experimental results show that this approach achieves 150% higher walking velocity than the previous MPC-based approach.
△ Less
Submitted 13 June, 2025; v1 submitted 1 April, 2025;
originally announced April 2025.
-
UniPCGC: Towards Practical Point Cloud Geometry Compression via an Efficient Unified Approach
Authors:
Kangli Wang,
Wei Gao
Abstract:
Learning-based point cloud compression methods have made significant progress in terms of performance. However, these methods still encounter challenges including high complexity, limited compression modes, and a lack of support for variable rate, which restrict the practical application of these methods. In order to promote the development of practical point cloud compression, we propose an effic…
▽ More
Learning-based point cloud compression methods have made significant progress in terms of performance. However, these methods still encounter challenges including high complexity, limited compression modes, and a lack of support for variable rate, which restrict the practical application of these methods. In order to promote the development of practical point cloud compression, we propose an efficient unified point cloud geometry compression framework, dubbed as UniPCGC. It is a lightweight framework that supports lossy compression, lossless compression, variable rate and variable complexity. First, we introduce the Uneven 8-Stage Lossless Coder (UELC) in the lossless mode, which allocates more computational complexity to groups with higher coding difficulty, and merges groups with lower coding difficulty. Second, Variable Rate and Complexity Module (VRCM) is achieved in the lossy mode through joint adoption of a rate modulation module and dynamic sparse convolution. Finally, through the dynamic combination of UELC and VRCM, we achieve lossy compression, lossless compression, variable rate and complexity within a unified framework. Compared to the previous state-of-the-art method, our method achieves a compression ratio (CR) gain of 8.1\% on lossless compression, and a Bjontegaard Delta Rate (BD-Rate) gain of 14.02\% on lossy compression, while also supporting variable rate and variable complexity.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
U2AD: Uncertainty-based Unsupervised Anomaly Detection Framework for Detecting T2 Hyperintensity in MRI Spinal Cord
Authors:
Qi Zhang,
Xiuyuan Chen,
Ziyi He,
Kun Wang,
Lianming Wu,
Hongxing Shen,
Jianqi Sun
Abstract:
T2 hyperintensities in spinal cord MR images are crucial biomarkers for conditions such as degenerative cervical myelopathy. However, current clinical diagnoses primarily rely on manual evaluation. Deep learning methods have shown promise in lesion detection, but most supervised approaches are heavily dependent on large, annotated datasets. Unsupervised anomaly detection (UAD) offers a compelling…
▽ More
T2 hyperintensities in spinal cord MR images are crucial biomarkers for conditions such as degenerative cervical myelopathy. However, current clinical diagnoses primarily rely on manual evaluation. Deep learning methods have shown promise in lesion detection, but most supervised approaches are heavily dependent on large, annotated datasets. Unsupervised anomaly detection (UAD) offers a compelling alternative by eliminating the need for abnormal data annotations. However, existing UAD methods rely on curated normal datasets and their performance frequently deteriorates when applied to clinical datasets due to domain shifts. We propose an Uncertainty-based Unsupervised Anomaly Detection framework, termed U2AD, to address these limitations. Unlike traditional methods, U2AD is designed to be trained and tested within the same clinical dataset, following a "mask-and-reconstruction" paradigm built on a Vision Transformer-based architecture. We introduce an uncertainty-guided masking strategy to resolve task conflicts between normal reconstruction and anomaly detection to achieve an optimal balance. Specifically, we employ a Monte-Carlo sampling technique to estimate reconstruction uncertainty mappings during training. By iteratively optimizing reconstruction training under the guidance of both epistemic and aleatoric uncertainty, U2AD reduces overall reconstruction variance while emphasizing regions. Experimental results demonstrate that U2AD outperforms existing supervised and unsupervised methods in patient-level identification and segment-level localization tasks. This framework establishes a new benchmark for incorporating uncertainty guidance into UAD, highlighting its clinical utility in addressing domain shifts and task conflicts in medical image anomaly detection. Our code is available: https://github.com/zhibaishouheilab/U2AD
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
EgoEvGesture: Gesture Recognition Based on Egocentric Event Camera
Authors:
Luming Wang,
Hao Shi,
Xiaoting Yin,
Kailun Yang,
Kaiwei Wang,
Jian Bai
Abstract:
Egocentric gesture recognition is a pivotal technology for enhancing natural human-computer interaction, yet traditional RGB-based solutions suffer from motion blur and illumination variations in dynamic scenarios. While event cameras show distinct advantages in handling high dynamic range with ultra-low power consumption, existing RGB-based architectures face inherent limitations in processing as…
▽ More
Egocentric gesture recognition is a pivotal technology for enhancing natural human-computer interaction, yet traditional RGB-based solutions suffer from motion blur and illumination variations in dynamic scenarios. While event cameras show distinct advantages in handling high dynamic range with ultra-low power consumption, existing RGB-based architectures face inherent limitations in processing asynchronous event streams due to their synchronous frame-based nature. Moreover, from an egocentric perspective, event cameras record data that includes events generated by both head movements and hand gestures, thereby increasing the complexity of gesture recognition. To address this, we propose a novel network architecture specifically designed for event data processing, incorporating (1) a lightweight CNN with asymmetric depthwise convolutions to reduce parameters while preserving spatiotemporal features, (2) a plug-and-play state-space model as context block that decouples head movement noise from gesture dynamics, and (3) a parameter-free Bins-Temporal Shift Module (BSTM) that shifts features along bins and temporal dimensions to fuse sparse events efficiently. We further establish the EgoEvGesture dataset, the first large-scale dataset for egocentric gesture recognition using event cameras. Experimental results demonstrate that our method achieves 62.7% accuracy tested on unseen subjects with only 7M parameters, 3.1% higher than state-of-the-art approaches. Notable misclassifications in freestyle motions stem from high inter-personal variability and unseen test patterns differing from training data. Moreover, our approach achieved a remarkable accuracy of 97.0% on the DVS128 Gesture, demonstrating the effectiveness and generalization capability of our method on public datasets. The dataset and models are made available at https://github.com/3190105222/EgoEv_Gesture.
△ Less
Submitted 13 April, 2025; v1 submitted 16 March, 2025;
originally announced March 2025.
-
Exploring the Potential of Large Multimodal Models as Effective Alternatives for Pronunciation Assessment
Authors:
Ke Wang,
Lei He,
Kun Liu,
Yan Deng,
Wenning Wei,
Sheng Zhao
Abstract:
Large Multimodal Models (LMMs) have demonstrated exceptional performance across a wide range of domains. This paper explores their potential in pronunciation assessment tasks, with a particular focus on evaluating the capabilities of the Generative Pre-trained Transformer (GPT) model, specifically GPT-4o. Our study investigates its ability to process speech and audio for pronunciation assessment a…
▽ More
Large Multimodal Models (LMMs) have demonstrated exceptional performance across a wide range of domains. This paper explores their potential in pronunciation assessment tasks, with a particular focus on evaluating the capabilities of the Generative Pre-trained Transformer (GPT) model, specifically GPT-4o. Our study investigates its ability to process speech and audio for pronunciation assessment across multiple levels of granularity and dimensions, with an emphasis on feedback generation and scoring. For our experiments, we use the publicly available Speechocean762 dataset. The evaluation focuses on two key aspects: multi-level scoring and the practicality of the generated feedback. Scoring results are compared against the manual scores provided in the Speechocean762 dataset, while feedback quality is assessed using Large Language Models (LLMs). The findings highlight the effectiveness of integrating LMMs with traditional methods for pronunciation assessment, offering insights into the model's strengths and identifying areas for further improvement.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
Context-aware Constrained Reinforcement Learning Based Energy-Efficient Power Scheduling for Non-stationary XR Data Traffic
Authors:
Kexuan Wang,
An Liu
Abstract:
In XR downlink transmission, energy-efficient power scheduling (EEPS) is essential for conserving power resource while delivering large data packets within hard-latency constraints. Traditional constrained reinforcement learning (CRL) algorithms show promise in EEPS but still struggle with non-convex stochastic constraints, non-stationary data traffic, and sparse delayed packet dropout feedback (r…
▽ More
In XR downlink transmission, energy-efficient power scheduling (EEPS) is essential for conserving power resource while delivering large data packets within hard-latency constraints. Traditional constrained reinforcement learning (CRL) algorithms show promise in EEPS but still struggle with non-convex stochastic constraints, non-stationary data traffic, and sparse delayed packet dropout feedback (rewards) in XR. To overcome these challenges, this paper models the EEPS in XR as a dynamic parameter-constrained Markov decision process (DP-CMDP) with a varying transition function linked to the non-stationary data traffic and solves it by a proposed context-aware constrained reinforcement learning (CACRL) algorithm, which consists of a context inference (CI) module and a CRL module. The CI module trains an encoder and multiple potential networks to characterize the current transition function and reshape the packet dropout rewards according to the context, transforming the original DP-CMDP into a general CMDP with immediate dense rewards. The CRL module employs a policy network to make EEPS decisions under this CMDP and optimizes the policy using a constrained stochastic successive convex approximation (CSSCA) method, which is better suited for non-convex stochastic constraints. Finally, theoretical analyses provide deep insights into the CADAC algorithm, while extensive simulations demonstrate that it outperforms advanced baselines in both power conservation and satisfying packet dropout constraints.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
SIMAC: A Semantic-Driven Integrated Multimodal Sensing And Communication Framework
Authors:
Yubo Peng,
Luping Xiang,
Kun Yang,
Feibo Jiang,
Kezhi Wang,
Dapeng Oliver Wu
Abstract:
Traditional single-modality sensing faces limitations in accuracy and capability, and its decoupled implementation with communication systems increases latency in bandwidth-constrained environments. Additionally, single-task-oriented sensing systems fail to address users' diverse demands. To overcome these challenges, we propose a semantic-driven integrated multimodal sensing and communication (SI…
▽ More
Traditional single-modality sensing faces limitations in accuracy and capability, and its decoupled implementation with communication systems increases latency in bandwidth-constrained environments. Additionally, single-task-oriented sensing systems fail to address users' diverse demands. To overcome these challenges, we propose a semantic-driven integrated multimodal sensing and communication (SIMAC) framework. This framework leverages a joint source-channel coding architecture to achieve simultaneous sensing decoding and transmission of sensing results. Specifically, SIMAC first introduces a multimodal semantic fusion (MSF) network, which employs two extractors to extract semantic information from radar signals and images, respectively. MSF then applies cross-attention mechanisms to fuse these unimodal features and generate multimodal semantic representations. Secondly, we present a large language model (LLM)-based semantic encoder (LSE), where relevant communication parameters and multimodal semantics are mapped into a unified latent space and input to the LLM, enabling channel-adaptive semantic encoding. Thirdly, a task-oriented sensing semantic decoder (SSD) is proposed, in which different decoded heads are designed according to the specific needs of tasks. Simultaneously, a multi-task learning strategy is introduced to train the SIMAC framework, achieving diverse sensing services. Finally, experimental simulations demonstrate that the proposed framework achieves diverse sensing services and higher accuracy.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
Semantic Communications with Computer Vision Sensing for Edge Video Transmission
Authors:
Yubo Peng,
Luping Xiang,
Kun Yang,
Kezhi Wang,
Merouane Debbah
Abstract:
Despite the widespread adoption of vision sensors in edge applications, such as surveillance, the transmission of video data consumes substantial spectrum resources. Semantic communication (SC) offers a solution by extracting and compressing information at the semantic level, preserving the accuracy and relevance of transmitted data while significantly reducing the volume of transmitted informatio…
▽ More
Despite the widespread adoption of vision sensors in edge applications, such as surveillance, the transmission of video data consumes substantial spectrum resources. Semantic communication (SC) offers a solution by extracting and compressing information at the semantic level, preserving the accuracy and relevance of transmitted data while significantly reducing the volume of transmitted information. However, traditional SC methods face inefficiencies due to the repeated transmission of static frames in edge videos, exacerbated by the absence of sensing capabilities, which results in spectrum inefficiency. To address this challenge, we propose a SC with computer vision sensing (SCCVS) framework for edge video transmission. The framework first introduces a compression ratio (CR) adaptive SC (CRSC) model, capable of adjusting CR based on whether the frames are static or dynamic, effectively conserving spectrum resources. Additionally, we implement an object detection and semantic segmentation models-enabled sensing (OSMS) scheme, which intelligently senses the changes in the scene and assesses the significance of each frame through in-context analysis. Hence, The OSMS scheme provides CR prompts to the CRSC model based on real-time sensing results. Moreover, both CRSC and OSMS are designed as lightweight models, ensuring compatibility with resource-constrained sensors commonly used in practical edge applications. Experimental simulations validate the effectiveness of the proposed SCCVS framework, demonstrating its ability to enhance transmission efficiency without sacrificing critical semantic information.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
Pathology-Guided AI System for Accurate Segmentation and Diagnosis of Cervical Spondylosis
Authors:
Qi Zhang,
Xiuyuan Chen,
Ziyi He,
Lianming Wu,
Kun Wang,
Jianqi Sun,
Hongxing Shen
Abstract:
Cervical spondylosis, a complex and prevalent condition, demands precise and efficient diagnostic techniques for accurate assessment. While MRI offers detailed visualization of cervical spine anatomy, manual interpretation remains labor-intensive and prone to error. To address this, we developed an innovative AI-assisted Expert-based Diagnosis System that automates both segmentation and diagnosis…
▽ More
Cervical spondylosis, a complex and prevalent condition, demands precise and efficient diagnostic techniques for accurate assessment. While MRI offers detailed visualization of cervical spine anatomy, manual interpretation remains labor-intensive and prone to error. To address this, we developed an innovative AI-assisted Expert-based Diagnosis System that automates both segmentation and diagnosis of cervical spondylosis using MRI. Leveraging a dataset of 960 cervical MRI images from patients with cervical disc herniation, our system features a pathology-guided segmentation model capable of accurately segmenting key cervical anatomical structures. The segmentation is followed by an expert-based diagnostic framework that automates the calculation of critical clinical indicators. Our segmentation model achieved an impressive average Dice coefficient exceeding 0.90 across four cervical spinal anatomies and demonstrated enhanced accuracy in herniation areas. Diagnostic evaluation further showcased the system precision, with a mean absolute error (MAE) of 2.44 degree for the C2-C7 Cobb angle and 3.60 precentage for the Maximum Spinal Cord Compression (MSCC) coefficient. In addition, our method delivered high accuracy, precision, recall, and F1 scores in herniation localization, K-line status assessment, and T2 hyperintensity detection. Comparative analysis demonstrates that our system outperforms existing methods, establishing a new benchmark for segmentation and diagnostic tasks for cervical spondylosis.
△ Less
Submitted 8 March, 2025;
originally announced March 2025.
-
GrInAdapt: Scaling Retinal Vessel Structural Map Segmentation Through Grounding, Integrating and Adapting Multi-device, Multi-site, and Multi-modal Fundus Domains
Authors:
Zixuan Liu,
Aaron Honjaya,
Yuekai Xu,
Yi Zhang,
Hefu Pan,
Xin Wang,
Linda G Shapiro,
Sheng Wang,
Ruikang K Wang
Abstract:
Retinal vessel segmentation is critical for diagnosing ocular conditions, yet current deep learning methods are limited by modality-specific challenges and significant distribution shifts across imaging devices, resolutions, and anatomical regions. In this paper, we propose GrInAdapt, a novel framework for source-free multi-target domain adaptation that leverages multi-view images to refine segmen…
▽ More
Retinal vessel segmentation is critical for diagnosing ocular conditions, yet current deep learning methods are limited by modality-specific challenges and significant distribution shifts across imaging devices, resolutions, and anatomical regions. In this paper, we propose GrInAdapt, a novel framework for source-free multi-target domain adaptation that leverages multi-view images to refine segmentation labels and enhance model generalizability for optical coherence tomography angiography (OCTA) of the fundus of the eye. GrInAdapt follows an intuitive three-step approach: (i) grounding images to a common anchor space via registration, (ii) integrating predictions from multiple views to achieve improved label consensus, and (iii) adapting the source model to diverse target domains. Furthermore, GrInAdapt is flexible enough to incorporate auxiliary modalities such as color fundus photography, to provide complementary cues for robust vessel segmentation. Extensive experiments on a multi-device, multi-site, and multi-modal retinal dataset demonstrate that GrInAdapt significantly outperforms existing domain adaptation methods, achieving higher segmentation accuracy and robustness across multiple domains. These results highlight the potential of GrInAdapt to advance automated retinal vessel analysis and support robust clinical decision-making.
△ Less
Submitted 7 March, 2025;
originally announced March 2025.
-
HealthiVert-GAN: A Novel Framework of Pseudo-Healthy Vertebral Image Synthesis for Interpretable Compression Fracture Grading
Authors:
Qi Zhang,
Shunan Zhang,
Ziqi Zhao,
Kun Wang,
Jun Xu,
Jianqi Sun
Abstract:
Osteoporotic vertebral compression fractures (VCFs) are prevalent in the elderly population, typically assessed on computed tomography (CT) scans by evaluating vertebral height loss. This assessment helps determine the fracture's impact on spinal stability and the need for surgical intervention. However, clinical data indicate that many VCFs exhibit irregular compression, complicating accurate dia…
▽ More
Osteoporotic vertebral compression fractures (VCFs) are prevalent in the elderly population, typically assessed on computed tomography (CT) scans by evaluating vertebral height loss. This assessment helps determine the fracture's impact on spinal stability and the need for surgical intervention. However, clinical data indicate that many VCFs exhibit irregular compression, complicating accurate diagnosis. While deep learning methods have shown promise in aiding VCFs screening, they often lack interpretability and sufficient sensitivity, limiting their clinical applicability. To address these challenges, we introduce a novel vertebra synthesis-height loss quantification-VCFs grading framework. Our proposed model, HealthiVert-GAN, utilizes a coarse-to-fine synthesis network designed to generate pseudo-healthy vertebral images that simulate the pre-fracture state of fractured vertebrae. This model integrates three auxiliary modules that leverage the morphology and height information of adjacent healthy vertebrae to ensure anatomical consistency. Additionally, we introduce the Relative Height Loss of Vertebrae (RHLV) as a quantification metric, which divides each vertebra into three sections to measure height loss between pre-fracture and post-fracture states, followed by fracture severity classification using a Support Vector Machine (SVM). Our approach achieves state-of-the-art classification performance on both the Verse2019 dataset and our private dataset, and it provides cross-sectional distribution maps of vertebral height loss. This practical tool enhances diagnostic sensitivity in clinical settings and assisting in surgical decision-making. Our code is available: https://github.com/zhibaishouheilab/HealthiVert-GAN.
△ Less
Submitted 7 March, 2025;
originally announced March 2025.
-
Omnidirectional Multi-Object Tracking
Authors:
Kai Luo,
Hao Shi,
Sheng Wu,
Fei Teng,
Mengfei Duan,
Chang Huang,
Yuhang Wang,
Kaiwei Wang,
Kailun Yang
Abstract:
Panoramic imagery, with its 360° field of view, offers comprehensive information to support Multi-Object Tracking (MOT) in capturing spatial and temporal relationships of surrounding objects. However, most MOT algorithms are tailored for pinhole images with limited views, impairing their effectiveness in panoramic settings. Additionally, panoramic image distortions, such as resolution loss, geomet…
▽ More
Panoramic imagery, with its 360° field of view, offers comprehensive information to support Multi-Object Tracking (MOT) in capturing spatial and temporal relationships of surrounding objects. However, most MOT algorithms are tailored for pinhole images with limited views, impairing their effectiveness in panoramic settings. Additionally, panoramic image distortions, such as resolution loss, geometric deformation, and uneven lighting, hinder direct adaptation of existing MOT methods, leading to significant performance degradation. To address these challenges, we propose OmniTrack, an omnidirectional MOT framework that incorporates Tracklet Management to introduce temporal cues, FlexiTrack Instances for object localization and association, and the CircularStatE Module to alleviate image and geometric distortions. This integration enables tracking in panoramic field-of-view scenarios, even under rapid sensor motion. To mitigate the lack of panoramic MOT datasets, we introduce the QuadTrack dataset--a comprehensive panoramic dataset collected by a quadruped robot, featuring diverse challenges such as panoramic fields of view, intense motion, and complex environments. Extensive experiments on the public JRDB dataset and the newly introduced QuadTrack benchmark demonstrate the state-of-the-art performance of the proposed framework. OmniTrack achieves a HOTA score of 26.92% on JRDB, representing an improvement of 3.43%, and further achieves 23.45% on QuadTrack, surpassing the baseline by 6.81%. The established dataset and source code are available at https://github.com/xifen523/OmniTrack.
△ Less
Submitted 23 March, 2025; v1 submitted 6 March, 2025;
originally announced March 2025.
-
Brain Foundation Models: A Survey on Advancements in Neural Signal Processing and Brain Discovery
Authors:
Xinliang Zhou,
Chenyu Liu,
Zhisheng Chen,
Kun Wang,
Yi Ding,
Ziyu Jia,
Qingsong Wen
Abstract:
Brain foundation models (BFMs) have emerged as a transformative paradigm in computational neuroscience, offering a revolutionary framework for processing diverse neural signals across different brain-related tasks. These models leverage large-scale pre-training techniques, allowing them to generalize effectively across multiple scenarios, tasks, and modalities, thus overcoming the traditional limi…
▽ More
Brain foundation models (BFMs) have emerged as a transformative paradigm in computational neuroscience, offering a revolutionary framework for processing diverse neural signals across different brain-related tasks. These models leverage large-scale pre-training techniques, allowing them to generalize effectively across multiple scenarios, tasks, and modalities, thus overcoming the traditional limitations faced by conventional artificial intelligence (AI) approaches in understanding complex brain data. By tapping into the power of pretrained models, BFMs provide a means to process neural data in a more unified manner, enabling advanced analysis and discovery in the field of neuroscience. In this survey, we define BFMs for the first time, providing a clear and concise framework for constructing and utilizing these models in various applications. We also examine the key principles and methodologies for developing these models, shedding light on how they transform the landscape of neural signal processing. This survey presents a comprehensive review of the latest advancements in BFMs, covering the most recent methodological innovations, novel views of application areas, and challenges in the field. Notably, we highlight the future directions and key challenges that need to be addressed to fully realize the potential of BFMs. These challenges include improving the quality of brain data, optimizing model architecture for better generalization, increasing training efficiency, and enhancing the interpretability and robustness of BFMs in real-world applications.
△ Less
Submitted 1 March, 2025;
originally announced March 2025.
-
Towards a Molecular Computer: Enabling Arithmetic Operations in Molecular Communication
Authors:
Jianqiao Long,
Lei Zhang,
Miaowen Wen,
Kezhi Wang,
Natalio Krasnogor,
Jichun Li
Abstract:
In current molecular communication (MC) systems, performing computational operations at the nanoscale remains challenging, restricting their applicability in complex scenarios such as adaptive biochemical control and advanced nanoscale sensing. To overcome this challenge, this paper proposes a novel framework that seamlessly integrates computation into the molecular communication process. The syst…
▽ More
In current molecular communication (MC) systems, performing computational operations at the nanoscale remains challenging, restricting their applicability in complex scenarios such as adaptive biochemical control and advanced nanoscale sensing. To overcome this challenge, this paper proposes a novel framework that seamlessly integrates computation into the molecular communication process. The system enables arithmetic operations, namely addition, subtraction, multiplication, and division, by encoding numerical values into two types of molecules emitted by each transmitter to represent positive and negative values, respectively. Specifically, addition is achieved by transmitting non-reactive molecules, while subtraction employs reactive molecules that interact during propagation. The receiver demodulates molecular counts to directly compute the desired results. Theoretical analysis for an upper bound on the bit error rate (BER), and computational simulations confirm the system's robustness in performing complex arithmetic tasks. Compared to conventional MC methods, the proposed approach not only enables fundamental computational operations at the nanoscale but also lays the groundwork for intelligent, autonomous molecular networks.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
Pseudo-Measurement Enhancement in Power Distribution Systems
Authors:
Tao Xu,
Kaiqi Wang,
Jiadong Zhang,
Ji Qiao,
Zixuan Zhao,
Hong Zhu,
Kai Sun
Abstract:
With the rapid development of smart distribution networks (DNs), the integrity and accuracy of grid measurement data are crucial to the safety and stability of the entire system. However, the quality of the user power consumption data cannot be guaranteed during the collection and transmission process. To this end, this paper proposes a low-rank tensor completion model based on CANDECOMP/PARAFAC d…
▽ More
With the rapid development of smart distribution networks (DNs), the integrity and accuracy of grid measurement data are crucial to the safety and stability of the entire system. However, the quality of the user power consumption data cannot be guaranteed during the collection and transmission process. To this end, this paper proposes a low-rank tensor completion model based on CANDECOMP/PARAFAC decomposition (CPD-LRTC) to enhance the quality of the measurement data of the DNs. Firstly, the causes and the associated characteristics of the missing data are analyzed, and a third-order standard tensor is constructed as a mathematical model of the measurement data of the DN. Then, a completion model is established based on the characteristics of measurement data and the low rank of the completion tensor, and the alternating direction method of multipliers (ADMM) is used to solve it iteratively. Finally, the proposed model is verified through two case studies, the completion accuracy, the computational efficiency, and the memory usage are compared to traditional methods.
△ Less
Submitted 22 February, 2025;
originally announced February 2025.
-
SpeHeatal: A Cluster-Enhanced Segmentation Method for Sperm Morphology Analysis
Authors:
Yi Shi,
Yunkai Wang,
Xupeng Tian,
Tieyi Zhang,
Bing Yao,
Hui Wang,
Yong Shao,
Cencen Wang,
Rong Zeng
Abstract:
The accurate assessment of sperm morphology is crucial in andrological diagnostics, where the segmentation of sperm images presents significant challenges. Existing approaches frequently rely on large annotated datasets and often struggle with the segmentation of overlapping sperm and the presence of dye impurities. To address these challenges, this paper first analyzes the issue of overlapping sp…
▽ More
The accurate assessment of sperm morphology is crucial in andrological diagnostics, where the segmentation of sperm images presents significant challenges. Existing approaches frequently rely on large annotated datasets and often struggle with the segmentation of overlapping sperm and the presence of dye impurities. To address these challenges, this paper first analyzes the issue of overlapping sperm tails from a geometric perspective and introduces a novel clustering algorithm, Con2Dis, which effectively segments overlapping tails by considering three essential factors: CONnectivity, CONformity, and DIStance. Building on this foundation, we propose an unsupervised method, SpeHeatal, designed for the comprehensive segmentation of the SPErm HEAd and TAiL. SpeHeatal employs the Segment Anything Model(SAM) to generate masks for sperm heads while filtering out dye impurities, utilizes Con2Dis to segment tails, and then applies a tailored mask splicing technique to produce complete sperm masks. Experimental results underscore the superior performance of SpeHeatal, particularly in handling images with overlapping sperm.
△ Less
Submitted 18 February, 2025;
originally announced February 2025.
-
Event-aided Semantic Scene Completion
Authors:
Shangwei Guo,
Hao Shi,
Song Wang,
Xiaoting Yin,
Kailun Yang,
Kaiwei Wang
Abstract:
Autonomous driving systems rely on robust 3D scene understanding. Recent advances in Semantic Scene Completion (SSC) for autonomous driving underscore the limitations of RGB-based approaches, which struggle under motion blur, poor lighting, and adverse weather. Event cameras, offering high dynamic range and low latency, address these challenges by providing asynchronous data that complements RGB i…
▽ More
Autonomous driving systems rely on robust 3D scene understanding. Recent advances in Semantic Scene Completion (SSC) for autonomous driving underscore the limitations of RGB-based approaches, which struggle under motion blur, poor lighting, and adverse weather. Event cameras, offering high dynamic range and low latency, address these challenges by providing asynchronous data that complements RGB inputs. We present DSEC-SSC, the first real-world benchmark specifically designed for event-aided SSC, which includes a novel 4D labeling pipeline for generating dense, visibility-aware labels that adapt dynamically to object motion. Our proposed RGB-Event fusion framework, EvSSC, introduces an Event-aided Lifting Module (ELM) that effectively bridges 2D RGB-Event features to 3D space, enhancing view transformation and the robustness of 3D volume construction across SSC models. Extensive experiments on DSEC-SSC and simulated SemanticKITTI-E demonstrate that EvSSC is adaptable to both transformer-based and LSS-based SSC architectures. Notably, evaluations on SemanticKITTI-C demonstrate that EvSSC achieves consistently improved prediction accuracy across five degradation modes and both In-domain and Out-of-domain settings, achieving up to a 52.5% relative improvement in mIoU when the image sensor partially fails. Additionally, we quantitatively and qualitatively validate the superiority of EvSSC under motion blur and extreme weather conditions, where autonomous driving is challenged. The established datasets and our codebase will be made publicly at https://github.com/Pandapan01/EvSSC.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
ISAC MIMO Systems with OTFS Waveforms and Virtual Arrays
Authors:
Kailong Wang,
Athina Petropulu
Abstract:
A novel Integrated Sensing-Communication (ISAC) system is proposed that can accommodate high mobility scenarios while making efficient use of bandwidth for both communication and sensing. The system comprises a monostatic multiple-input multiple-output (MIMO) radar that transmits orthogonal time frequency space (OTFS) waveforms. Bandwidth efficiency is achieved by making Doppler-delay (DD) domain…
▽ More
A novel Integrated Sensing-Communication (ISAC) system is proposed that can accommodate high mobility scenarios while making efficient use of bandwidth for both communication and sensing. The system comprises a monostatic multiple-input multiple-output (MIMO) radar that transmits orthogonal time frequency space (OTFS) waveforms. Bandwidth efficiency is achieved by making Doppler-delay (DD) domain bins available for shared use by the transmit antennas. For maximum communication rate, all DD-domain bins are used as shared, but in this case, the target resolution is limited by the aperture of the receive array. A low-complexity method is proposed for obtaining coarse estimates of the radar targets parameters in that case. A novel approach is also proposed to construct a virtual array (VA) for achieving a target resolution higher than that allowed by the receive array. The VA is formed by enforcing zeros on certain time-frequency (TF) domain bins, thereby creating private bins assigned to specific transmit antennas. The TF signals received on these private bins are orthogonal, enabling the synthesis of a VA. When combined with coarse target estimates, this approach provides high-accuracy target estimation. To preserve DD-domain information, the introduction of private bins requires reducing the number of DD-domain symbols, resulting in a trade-off between communication rate and sensing performance. However, even a small number of private bins is sufficient to achieve significant sensing gains with minimal communication rate loss. The proposed system is robust to Doppler frequency shifts that arise in high mobility scenarios.
△ Less
Submitted 3 February, 2025;
originally announced February 2025.
-
Tumor Detection, Segmentation and Classification Challenge on Automated 3D Breast Ultrasound: The TDSC-ABUS Challenge
Authors:
Gongning Luo,
Mingwang Xu,
Hongyu Chen,
Xinjie Liang,
Xing Tao,
Dong Ni,
Hyunsu Jeong,
Chulhong Kim,
Raphael Stock,
Michael Baumgartner,
Yannick Kirchhoff,
Maximilian Rokuss,
Klaus Maier-Hein,
Zhikai Yang,
Tianyu Fan,
Nicolas Boutry,
Dmitry Tereshchenko,
Arthur Moine,
Maximilien Charmetant,
Jan Sauer,
Hao Du,
Xiang-Hui Bai,
Vipul Pai Raikar,
Ricardo Montoya-del-Angel,
Robert Marti
, et al. (12 additional authors not shown)
Abstract:
Breast cancer is one of the most common causes of death among women worldwide. Early detection helps in reducing the number of deaths. Automated 3D Breast Ultrasound (ABUS) is a newer approach for breast screening, which has many advantages over handheld mammography such as safety, speed, and higher detection rate of breast cancer. Tumor detection, segmentation, and classification are key componen…
▽ More
Breast cancer is one of the most common causes of death among women worldwide. Early detection helps in reducing the number of deaths. Automated 3D Breast Ultrasound (ABUS) is a newer approach for breast screening, which has many advantages over handheld mammography such as safety, speed, and higher detection rate of breast cancer. Tumor detection, segmentation, and classification are key components in the analysis of medical images, especially challenging in the context of 3D ABUS due to the significant variability in tumor size and shape, unclear tumor boundaries, and a low signal-to-noise ratio. The lack of publicly accessible, well-labeled ABUS datasets further hinders the advancement of systems for breast tumor analysis. Addressing this gap, we have organized the inaugural Tumor Detection, Segmentation, and Classification Challenge on Automated 3D Breast Ultrasound 2023 (TDSC-ABUS2023). This initiative aims to spearhead research in this field and create a definitive benchmark for tasks associated with 3D ABUS image analysis. In this paper, we summarize the top-performing algorithms from the challenge and provide critical analysis for ABUS image examination. We offer the TDSC-ABUS challenge as an open-access platform at https://tdsc-abus2023.grand-challenge.org/ to benchmark and inspire future developments in algorithmic research.
△ Less
Submitted 26 January, 2025;
originally announced January 2025.
-
Vehicular Multi-Tier Distributed Computing with Hybrid THz-RF Transmission in Satellite-Terrestrial Integrated Networks
Authors:
Ni Zhang,
Kunlun Wang,
Wen Chen,
Jing Xu,
Yonghui Li,
Arumugam Nallanathan
Abstract:
In this paper, we propose a Satellite-Terrestrial Integrated Network (STIN) assisted vehicular multi-tier distributed computing (VMDC) system leveraging hybrid terahertz (THz) and radio frequency (RF) communication technologies. Task offloading for satellite edge computing is enabled by THz communication using the orthogonal frequency division multiple access (OFDMA) technique. For terrestrial edg…
▽ More
In this paper, we propose a Satellite-Terrestrial Integrated Network (STIN) assisted vehicular multi-tier distributed computing (VMDC) system leveraging hybrid terahertz (THz) and radio frequency (RF) communication technologies. Task offloading for satellite edge computing is enabled by THz communication using the orthogonal frequency division multiple access (OFDMA) technique. For terrestrial edge computing, we employ non-orthogonal multiple access (NOMA) and vehicle clustering to realize task offloading. We formulate a non-convex optimization problem aimed at maximizing computation efficiency by jointly optimizing bandwidth allocation, task allocation, subchannel-vehicle matching and power allocation. To address this non-convex optimization problem, we decompose the original problem into four sub-problems and solve them using an alternating iterative optimization approach. For the subproblem of task allocation, we solve it by linear programming. To solve the subproblem of sub-channel allocation, we exploit many-to-one matching theory to obtain the result. The subproblem of bandwidth allocation of OFDMA and the subproblem of power allocation of NOMA are solved by quadratic transformation method. Finally, the simulation results show that our proposed scheme significantly enhances the computation efficiency of the STIN-based VMDC system compared with the benchmark schemes.
△ Less
Submitted 26 January, 2025;
originally announced January 2025.
-
RadGPT: Constructing 3D Image-Text Tumor Datasets
Authors:
Pedro R. A. S. Bassi,
Mehmet Can Yavuz,
Kang Wang,
Xiaoxi Chen,
Wenxuan Li,
Sergio Decherchi,
Andrea Cavalli,
Yang Yang,
Alan Yuille,
Zongwei Zhou
Abstract:
With over 85 million CT scans performed annually in the United States, creating tumor-related reports is a challenging and time-consuming task for radiologists. To address this need, we present RadGPT, an Anatomy-Aware Vision-Language AI Agent for generating detailed reports from CT scans. RadGPT first segments tumors, including benign cysts and malignant tumors, and their surrounding anatomical s…
▽ More
With over 85 million CT scans performed annually in the United States, creating tumor-related reports is a challenging and time-consuming task for radiologists. To address this need, we present RadGPT, an Anatomy-Aware Vision-Language AI Agent for generating detailed reports from CT scans. RadGPT first segments tumors, including benign cysts and malignant tumors, and their surrounding anatomical structures, then transforms this information into both structured reports and narrative reports. These reports provide tumor size, shape, location, attenuation, volume, and interactions with surrounding blood vessels and organs. Extensive evaluation on unseen hospitals shows that RadGPT can produce accurate reports, with high sensitivity/specificity for small tumor (<2 cm) detection: 80/73% for liver tumors, 92/78% for kidney tumors, and 77/77% for pancreatic tumors. For large tumors, sensitivity ranges from 89% to 97%. The results significantly surpass the state-of-the-art in abdominal CT report generation.
RadGPT generated reports for 17 public datasets. Through radiologist review and refinement, we have ensured the reports' accuracy, and created the first publicly available image-text 3D medical dataset, comprising over 1.8 million text tokens and 2.7 million images from 9,262 CT scans, including 2,947 tumor scans/reports of 8,562 tumor instances. Our reports can: (1) localize tumors in eight liver sub-segments and three pancreatic sub-segments annotated per-voxel; (2) determine pancreatic tumor stage (T1-T4) in 260 reports; and (3) present individual analyses of multiple tumors--rare in human-made reports. Importantly, 948 of the reports are for early-stage tumors.
△ Less
Submitted 8 January, 2025;
originally announced January 2025.
-
Learning-Based Stable Optimal Guidance for Spacecraft Close-Proximity Operations
Authors:
Kun Wang,
Roberto Armellin,
Adam Evans,
Harry Holt,
Zheng Chen
Abstract:
Machine learning techniques have demonstrated their effectiveness in achieving autonomy and optimality for nonlinear and high-dimensional dynamical systems. However, traditional black-box machine learning methods often lack formal stability guarantees, which are critical for safety-sensitive aerospace applications. This paper proposes a comprehensive framework that combines control Lyapunov functi…
▽ More
Machine learning techniques have demonstrated their effectiveness in achieving autonomy and optimality for nonlinear and high-dimensional dynamical systems. However, traditional black-box machine learning methods often lack formal stability guarantees, which are critical for safety-sensitive aerospace applications. This paper proposes a comprehensive framework that combines control Lyapunov functions with supervised learning to provide certifiably stable, time- and fuel-optimal guidance for rendezvous maneuvers governed by Clohessy-Wiltshire dynamics. The framework is easily extensible to nonlinear control-affine systems. A novel neural candidate Lyapunov function is developed to ensure positive definiteness. Subsequently, a control policy is defined, in which the thrust direction vector minimizes the Lyapunov function's time derivative, and the thrust throttle is determined using minimal required throttle. This approach ensures that all loss terms related to the control Lyapunov function are either naturally satisfied or replaced by the derived control policy. To jointly supervise the Lyapunov function and the control policy, a simple loss function is introduced, leveraging optimal state-control pairs obtained by a polynomial maps based method. Consequently, the trained neural network not only certifies the Lyapunov function but also generates a near-optimal guidance policy, even for the bang-bang fuel-optimal problem. Extensive numerical simulations are presented to validate the proposed method.
△ Less
Submitted 2 January, 2025;
originally announced January 2025.
-
Completion as Enhancement: A Degradation-Aware Selective Image Guided Network for Depth Completion
Authors:
Zhiqiang Yan,
Zhengxue Wang,
Kun Wang,
Jun Li,
Jian Yang
Abstract:
In this paper, we introduce the Selective Image Guided Network (SigNet), a novel degradation-aware framework that transforms depth completion into depth enhancement for the first time. Moving beyond direct completion using convolutional neural networks (CNNs), SigNet initially densifies sparse depth data through non-CNN densification tools to obtain coarse yet dense depth. This approach eliminates…
▽ More
In this paper, we introduce the Selective Image Guided Network (SigNet), a novel degradation-aware framework that transforms depth completion into depth enhancement for the first time. Moving beyond direct completion using convolutional neural networks (CNNs), SigNet initially densifies sparse depth data through non-CNN densification tools to obtain coarse yet dense depth. This approach eliminates the mismatch and ambiguity caused by direct convolution over irregularly sampled sparse data. Subsequently, SigNet redefines completion as enhancement, establishing a self-supervised degradation bridge between the coarse depth and the targeted dense depth for effective RGB-D fusion. To achieve this, SigNet leverages the implicit degradation to adaptively select high-frequency components (e.g., edges) of RGB data to compensate for the coarse depth. This degradation is further integrated into a multi-modal conditional Mamba, dynamically generating the state parameters to enable efficient global high-frequency information interaction. We conduct extensive experiments on the NYUv2, DIML, SUN RGBD, and TOFDC datasets, demonstrating the state-of-the-art (SOTA) performance of SigNet.
△ Less
Submitted 7 March, 2025; v1 submitted 26 December, 2024;
originally announced December 2024.
-
Text-Driven Tumor Synthesis
Authors:
Xinran Li,
Yi Shuai,
Chen Liu,
Qi Chen,
Qilong Wu,
Pengfei Guo,
Dong Yang,
Can Zhao,
Pedro R. A. S. Bassi,
Daguang Xu,
Kang Wang,
Yang Yang,
Alan Yuille,
Zongwei Zhou
Abstract:
Tumor synthesis can generate examples that AI often misses or over-detects, improving AI performance by training on these challenging cases. However, existing synthesis methods, which are typically unconditional -- generating images from random variables -- or conditioned only by tumor shapes, lack controllability over specific tumor characteristics such as texture, heterogeneity, boundaries, and…
▽ More
Tumor synthesis can generate examples that AI often misses or over-detects, improving AI performance by training on these challenging cases. However, existing synthesis methods, which are typically unconditional -- generating images from random variables -- or conditioned only by tumor shapes, lack controllability over specific tumor characteristics such as texture, heterogeneity, boundaries, and pathology type. As a result, the generated tumors may be overly similar or duplicates of existing training data, failing to effectively address AI's weaknesses. We propose a new text-driven tumor synthesis approach, termed TextoMorph, that provides textual control over tumor characteristics. This is particularly beneficial for examples that confuse the AI the most, such as early tumor detection (increasing Sensitivity by +8.5%), tumor segmentation for precise radiotherapy (increasing DSC by +6.3%), and classification between benign and malignant tumors (improving Sensitivity by +8.2%). By incorporating text mined from radiology reports into the synthesis process, we increase the variability and controllability of the synthetic tumors to target AI's failure cases more precisely. Moreover, TextoMorph uses contrastive learning across different texts and CT scans, significantly reducing dependence on scarce image-report pairs (only 141 pairs used in this study) by leveraging a large corpus of 34,035 radiology reports. Finally, we have developed rigorous tests to evaluate synthetic tumors, including Text-Driven Visual Turing Test and Radiomics Pattern Analysis, showing that our synthetic tumors is realistic and diverse in texture, heterogeneity, boundaries, and pathology.
△ Less
Submitted 24 December, 2024;
originally announced December 2024.
-
PowerRadio: Manipulate Sensor Measurementvia Power GND Radiation
Authors:
Yan Jiang,
Xiaoyu Ji,
Yancheng Jiang,
Kai Wang,
Chenren Xu,
Wenyuan Xu
Abstract:
Sensors are key components enabling various applications, e.g., home intrusion detection and environmental monitoring. While various software defenses and physical protections are used to prevent sensor manipulation, this paper introduces a new threat vector, PowerRadio, that bypasses existing protections and changes sensor readings from a distance. PowerRadio leverages interconnected ground (GND)…
▽ More
Sensors are key components enabling various applications, e.g., home intrusion detection and environmental monitoring. While various software defenses and physical protections are used to prevent sensor manipulation, this paper introduces a new threat vector, PowerRadio, that bypasses existing protections and changes sensor readings from a distance. PowerRadio leverages interconnected ground (GND) wires, a standard practice for electrical safety at home, to inject malicious signals. The injected signal is coupled by the sensor's analog measurement wire and eventually survives the noise filters, inducing incorrect measurement. We present three methods to manipulate sensors by inducing static bias, periodical signals, or pulses. For instance, we show adding stripes into the captured images of a surveillance camera or injecting inaudible voice commands into conference microphones. We study the underlying principles of PowerRadio and identify its root causes: (1) the lack of shielding between ground and data signal wires and (2) the asymmetry of circuit impedance that enables interference to bypass filtering. We validate PowerRadio against a surveillance system, broadcast systems, and various sensors. We believe that PowerRadio represents an emerging threat, exhibiting the advantages of both radiated and conducted EMI, e.g., expanding the effective attack distance of radiated EMI yet eliminating the requirement of line-of-sight or approaching physically. Our insights shall provide guidance for enhancing the sensors' security and power wiring during the design phases.
△ Less
Submitted 23 December, 2024;
originally announced December 2024.
-
CALLIC: Content Adaptive Learning for Lossless Image Compression
Authors:
Daxin Li,
Yuanchao Bai,
Kai Wang,
Junjun Jiang,
Xianming Liu,
Wen Gao
Abstract:
Learned lossless image compression has achieved significant advancements in recent years. However, existing methods often rely on training amortized generative models on massive datasets, resulting in sub-optimal probability distribution estimation for specific testing images during encoding process. To address this challenge, we explore the connection between the Minimum Description Length (MDL)…
▽ More
Learned lossless image compression has achieved significant advancements in recent years. However, existing methods often rely on training amortized generative models on massive datasets, resulting in sub-optimal probability distribution estimation for specific testing images during encoding process. To address this challenge, we explore the connection between the Minimum Description Length (MDL) principle and Parameter-Efficient Transfer Learning (PETL), leading to the development of a novel content-adaptive approach for learned lossless image compression, dubbed CALLIC. Specifically, we first propose a content-aware autoregressive self-attention mechanism by leveraging convolutional gating operations, termed Masked Gated ConvFormer (MGCF), and pretrain MGCF on training dataset. Cache then Crop Inference (CCI) is proposed to accelerate the coding process. During encoding, we decompose pre-trained layers, including depth-wise convolutions, using low-rank matrices and then adapt the incremental weights on testing image by Rate-guided Progressive Fine-Tuning (RPFT). RPFT fine-tunes with gradually increasing patches that are sorted in descending order by estimated entropy, optimizing learning process and reducing adaptation time. Extensive experiments across diverse datasets demonstrate that CALLIC sets a new state-of-the-art (SOTA) for learned lossless image compression.
△ Less
Submitted 23 December, 2024;
originally announced December 2024.
-
Text2midi: Generating Symbolic Music from Captions
Authors:
Keshav Bhandari,
Abhinaba Roy,
Kyra Wang,
Geeta Puri,
Simon Colton,
Dorien Herremans
Abstract:
This paper introduces text2midi, an end-to-end model to generate MIDI files from textual descriptions. Leveraging the growing popularity of multimodal generative approaches, text2midi capitalizes on the extensive availability of textual data and the success of large language models (LLMs). Our end-to-end system harnesses the power of LLMs to generate symbolic music in the form of MIDI files. Speci…
▽ More
This paper introduces text2midi, an end-to-end model to generate MIDI files from textual descriptions. Leveraging the growing popularity of multimodal generative approaches, text2midi capitalizes on the extensive availability of textual data and the success of large language models (LLMs). Our end-to-end system harnesses the power of LLMs to generate symbolic music in the form of MIDI files. Specifically, we utilize a pretrained LLM encoder to process captions, which then condition an autoregressive transformer decoder to produce MIDI sequences that accurately reflect the provided descriptions. This intuitive and user-friendly method significantly streamlines the music creation process by allowing users to generate music pieces using text prompts. We conduct comprehensive empirical evaluations, incorporating both automated and human studies, that show our model generates MIDI files of high quality that are indeed controllable by text captions that may include music theory terms such as chords, keys, and tempo. We release the code and music samples on our demo page (https://github.com/AMAAI-Lab/Text2midi) for users to interact with text2midi.
△ Less
Submitted 31 December, 2024; v1 submitted 21 December, 2024;
originally announced December 2024.