-
Probabilistic approximation of fully nonlinear second-order PIDEs with convergence rates for the universal robust limit theorem
Authors:
Lianzi Jiang,
Mingshang Hu,
Gechun Liang
Abstract:
This paper develops a probabilistic approximation scheme for a class of nonstandard, fully nonlinear second-order partial integro-differential equations (PIDEs) arising from nonlinear Lévy processes under Peng's G-expectation framework. The PIDE involves a supremum over a set of \(α\)-stable Lévy measures, potentially with degenerate diffusion and a non-separable uncertainty set, which renders exi…
▽ More
This paper develops a probabilistic approximation scheme for a class of nonstandard, fully nonlinear second-order partial integro-differential equations (PIDEs) arising from nonlinear Lévy processes under Peng's G-expectation framework. The PIDE involves a supremum over a set of \(α\)-stable Lévy measures, potentially with degenerate diffusion and a non-separable uncertainty set, which renders existing numerical results inapplicable. We construct a recursive, piecewise-constant approximation to the viscosity solution and derive explicit error bounds. A key application of our analysis is the quantification of convergence rates for the universal robust limit theorem under sublinear expectations, unifying Peng's robust central limit theorem, laws of large numbers, and the \(α\)-stable limit theorem of Bayraktar and Munk, with explicit Berry--Esseen-type bounds.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
Rapeseed population point cloud completion network (RP-PCN) with dynamic graph convolution for 3D reconstruction of crop canopy occlusion architecture
Authors:
Ziyue Guo,
Xin Yang,
Yutao Shen,
Yang Zhu,
Lixi Jiang,
Haiyan Cen
Abstract:
Quantitative descriptions of complete canopy architecture are crucial for evaluating crop photosynthesis and yield to guide ideotype design. Although three-dimensional (3D) sensing technologies have been developed for plant and canopy reconstruction, severe occlusion and complex architectures hinder accurate canopy descriptions. In this study, we propose a point cloud completion model for 3D recon…
▽ More
Quantitative descriptions of complete canopy architecture are crucial for evaluating crop photosynthesis and yield to guide ideotype design. Although three-dimensional (3D) sensing technologies have been developed for plant and canopy reconstruction, severe occlusion and complex architectures hinder accurate canopy descriptions. In this study, we propose a point cloud completion model for 3D reconstruction of rapeseed populations from seeding to silique stages using multi-view imaging. A complete point cloud generation framework was developed with the virtual-real integration (VRI) simulation method and occlusion point detection algorithm to annotate the training dataset by distinguishing surface from occluded points. The rapeseed population point cloud completion network (RP-PCN) was designed with a multi-resolution dynamic graph convolutional encoder (MRDG) and point pyramid decoder (PPD) to predict occluded points based on input surface point clouds. A dynamic graph convolutional feature extractor (DGCFE) was introduced to capture structural variations across the growth period. The effectiveness of point cloud completion was validated by predicting yield using architectural indicators from complete point clouds of rapeseed population. The results demonstrated that RP-PCN achieved chamfer distance (CD) values of 3.35 cm, 3.46 cm, 4.32 cm, and 4.51 cm at the seedling, bolting, flowering, and silique stages, respectively. Ablation studies showed the effectiveness of the MRDG and DGCFE modules, reducing CD values by 10% and 23%, respectively. The silique efficiency index (SEI) from RP-PCN improved yield prediction accuracy by 11.2% compared to incomplete point clouds. The RP-PCN pipeline proposed in this study has the potential to be extended to other crops, significantly enhancing the analysis of population canopy architectures in field environments.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
ThermalLoc: A Vision Transformer-Based Approach for Robust Thermal Camera Relocalization in Large-Scale Environments
Authors:
Yu Liu,
Yangtao Meng,
Xianfei Pan,
Jie Jiang,
Changhao Chen
Abstract:
Thermal cameras capture environmental data through heat emission, a fundamentally different mechanism compared to visible light cameras, which rely on pinhole imaging. As a result, traditional visual relocalization methods designed for visible light images are not directly applicable to thermal images. Despite significant advancements in deep learning for camera relocalization, approaches specific…
▽ More
Thermal cameras capture environmental data through heat emission, a fundamentally different mechanism compared to visible light cameras, which rely on pinhole imaging. As a result, traditional visual relocalization methods designed for visible light images are not directly applicable to thermal images. Despite significant advancements in deep learning for camera relocalization, approaches specifically tailored for thermal camera-based relocalization remain underexplored. To address this gap, we introduce ThermalLoc, a novel end-to-end deep learning method for thermal image relocalization. ThermalLoc effectively extracts both local and global features from thermal images by integrating EfficientNet with Transformers, and performs absolute pose regression using two MLP networks. We evaluated ThermalLoc on both the publicly available thermal-odometry dataset and our own dataset. The results demonstrate that ThermalLoc outperforms existing representative methods employed for thermal camera relocalization, including AtLoc, MapNet, PoseNet, and RobustLoc, achieving superior accuracy and robustness.
△ Less
Submitted 22 June, 2025;
originally announced June 2025.
-
Robot Tactile Gesture Recognition Based on Full-body Modular E-skin
Authors:
Shuo Jiang,
Boce Hu,
Linfeng Zhao,
Lawson L. S. Wong
Abstract:
With the development of robot electronic skin technology, various tactile sensors, enhanced by AI, are unlocking a new dimension of perception for robots. In this work, we explore how robots equipped with electronic skin can recognize tactile gestures and interpret them as human commands. We developed a modular robot E-skin, composed of multiple irregularly shaped skin patches, which can be assemb…
▽ More
With the development of robot electronic skin technology, various tactile sensors, enhanced by AI, are unlocking a new dimension of perception for robots. In this work, we explore how robots equipped with electronic skin can recognize tactile gestures and interpret them as human commands. We developed a modular robot E-skin, composed of multiple irregularly shaped skin patches, which can be assembled to cover the robot's body while capturing real-time pressure and pose data from thousands of sensing points. To process this information, we propose an equivariant graph neural network-based recognizer that efficiently and accurately classifies diverse tactile gestures, including poke, grab, stroke, and double-pat. By mapping the recognized gestures to predefined robot actions, we enable intuitive human-robot interaction purely through tactile input.
△ Less
Submitted 22 June, 2025;
originally announced June 2025.
-
Referring Expression Instance Retrieval and A Strong End-to-End Baseline
Authors:
Xiangzhao Hao,
Kuan Zhu,
Hongyu Guo,
Haiyun Guo,
Ning Jiang,
Quan Lu,
Ming Tang,
Jinqiao Wang
Abstract:
Using natural language to query visual information is a fundamental need in real-world applications. Text-Image Retrieval (TIR) retrieves a target image from a gallery based on an image-level description, while Referring Expression Comprehension (REC) localizes a target object within a given image using an instance-level description. However, real-world applications often present more complex dema…
▽ More
Using natural language to query visual information is a fundamental need in real-world applications. Text-Image Retrieval (TIR) retrieves a target image from a gallery based on an image-level description, while Referring Expression Comprehension (REC) localizes a target object within a given image using an instance-level description. However, real-world applications often present more complex demands. Users typically query an instance-level description across a large gallery and expect to receive both relevant image and the corresponding instance location. In such scenarios, TIR struggles with fine-grained descriptions and object-level localization, while REC is limited in its ability to efficiently search large galleries and lacks an effective ranking mechanism. In this paper, we introduce a new task called \textbf{Referring Expression Instance Retrieval (REIR)}, which supports both instance-level retrieval and localization based on fine-grained referring expressions. First, we propose a large-scale benchmark for REIR, named REIRCOCO, constructed by prompting advanced vision-language models to generate high-quality referring expressions for instances in the MSCOCO and RefCOCO datasets. Second, we present a baseline method, Contrastive Language-Instance Alignment with Relation Experts (CLARE), which employs a dual-stream architecture to address REIR in an end-to-end manner. Given a referring expression, the textual branch encodes it into a query embedding. The visual branch detects candidate objects and extracts their instance-level visual features. The most similar candidate to the query is selected for bounding box prediction. CLARE is first trained on object detection and REC datasets to establish initial grounding capabilities, then optimized via Contrastive Language-Instance Alignment (CLIA) for improved retrieval across images. We will release our code and benchmark publicly.
△ Less
Submitted 26 June, 2025; v1 submitted 22 June, 2025;
originally announced June 2025.
-
CT Radiomics-Based Explainable Machine Learning Model for Accurate Differentiation of Malignant and Benign Endometrial Tumors: A Two-Center Study
Authors:
Tingrui Zhang,
Honglin Wu,
Zekun Jiang,
Yingying Wang,
Rui Ye,
Huiming Ni,
Chang Liu,
Jin Cao,
Xuan Sun,
Rong Shao,
Xiaorong Wei,
Yingchun Sun
Abstract:
Aimed to develop and validate a CT radiomics-based explainable machine learning model for diagnosing malignancy and benignity specifically in endometrial cancer (EC) patients. A total of 83 EC patients from two centers, including 46 with malignant and 37 with benign conditions, were included, with data split into a training set (n=59) and a testing set (n=24). The regions of interest (ROIs) were m…
▽ More
Aimed to develop and validate a CT radiomics-based explainable machine learning model for diagnosing malignancy and benignity specifically in endometrial cancer (EC) patients. A total of 83 EC patients from two centers, including 46 with malignant and 37 with benign conditions, were included, with data split into a training set (n=59) and a testing set (n=24). The regions of interest (ROIs) were manually segmented from pre-surgical CT scans, and 1132 radiomic features were extracted from the pre-surgical CT scans using Pyradiomics. Six explainable machine learning modeling algorithms were implemented respectively, for determining the optimal radiomics pipeline. The diagnostic performance of the radiomic model was evaluated by using sensitivity, specificity, accuracy, precision, F1 score, confusion matrices, and ROC curves. To enhance clinical understanding and usability, we separately implemented SHAP analysis and feature mapping visualization, and evaluated the calibration curve and decision curve. By comparing six modeling strategies, the Random Forest model emerged as the optimal choice for diagnosing EC, with a training AUC of 1.00 and a testing AUC of 0.96. SHAP identified the most important radiomic features, revealing that all selected features were significantly associated with EC (P < 0.05). Radiomics feature maps also provide a feasible assessment tool for clinical applications. DCA indicated a higher net benefit for our model compared to the "All" and "None" strategies, suggesting its clinical utility in identifying high-risk cases and reducing unnecessary interventions. In conclusion, the CT radiomics-based explainable machine learning model achieved high diagnostic performance, which could be used as an intelligent auxiliary tool for the diagnosis of endometrial cancer.
△ Less
Submitted 22 June, 2025;
originally announced June 2025.
-
InspireDebate: Multi-Dimensional Subjective-Objective Evaluation-Guided Reasoning and Optimization for Debating
Authors:
Fuyu Wang,
Jiangtong Li,
Kun Zhu,
Changjun Jiang
Abstract:
With the rapid advancements in large language models (LLMs), debating tasks, such as argument quality assessment and debate process simulation, have made significant progress. However, existing LLM-based debating systems focus on responding to specific arguments while neglecting objective assessments such as authenticity and logical validity. Furthermore, these systems lack a structured approach t…
▽ More
With the rapid advancements in large language models (LLMs), debating tasks, such as argument quality assessment and debate process simulation, have made significant progress. However, existing LLM-based debating systems focus on responding to specific arguments while neglecting objective assessments such as authenticity and logical validity. Furthermore, these systems lack a structured approach to optimize across various dimensions$-$including evaluation metrics, chain-of-thought (CoT) reasoning, and multi-turn debate refinement$-$thereby limiting their effectiveness. To address these interconnected challenges, we propose a dual-component framework: (1) $\textbf{InspireScore}$, a novel evaluation system that establishes a multi-dimensional assessment architecture incorporating four subjective criteria (emotional appeal, argument clarity, argument arrangement, and topic relevance) alongside two objective metrics (fact authenticity and logical validity); and (2) $\textbf{InspireDebate}$, an optimized debating framework employing a phased optimization approach through CoT reasoning enhancement, multi-dimensional Direct Preference Optimization (DPO), and real-time knowledge grounding via web-based Retrieval Augmented Generation (Web-RAG). Empirical evaluations demonstrate that $\textbf{InspireScore}$ achieves 44$\%$ higher correlation with expert judgments compared to existing methods, while $\textbf{InspireDebate}$ shows significant improvements, outperforming baseline models by 57$\%$. Source code is available at https://github.com/fywang12/InspireDebate.
△ Less
Submitted 22 June, 2025;
originally announced June 2025.
-
Pre-Trained LLM is a Semantic-Aware and Generalizable Segmentation Booster
Authors:
Fenghe Tang,
Wenxin Ma,
Zhiyang He,
Xiaodong Tao,
Zihang Jiang,
S. Kevin Zhou
Abstract:
With the advancement of Large Language Model (LLM) for natural language processing, this paper presents an intriguing finding: a frozen pre-trained LLM layer can process visual tokens for medical image segmentation tasks. Specifically, we propose a simple hybrid structure that integrates a pre-trained, frozen LLM layer within the CNN encoder-decoder segmentation framework (LLM4Seg). Surprisingly,…
▽ More
With the advancement of Large Language Model (LLM) for natural language processing, this paper presents an intriguing finding: a frozen pre-trained LLM layer can process visual tokens for medical image segmentation tasks. Specifically, we propose a simple hybrid structure that integrates a pre-trained, frozen LLM layer within the CNN encoder-decoder segmentation framework (LLM4Seg). Surprisingly, this design improves segmentation performance with a minimal increase in trainable parameters across various modalities, including ultrasound, dermoscopy, polypscopy, and CT scans. Our in-depth analysis reveals the potential of transferring LLM's semantic awareness to enhance segmentation tasks, offering both improved global understanding and better local modeling capabilities. The improvement proves robust across different LLMs, validated using LLaMA and DeepSeek.
△ Less
Submitted 22 June, 2025;
originally announced June 2025.
-
LVPNet: A Latent-variable-based Prediction-driven End-to-end Framework for Lossless Compression of Medical Images
Authors:
Chenyue Song,
Chen Hui,
Qing Lin,
Wei Zhang,
Siqiao Li,
Haiqi Zhu,
Zhixuan Li,
Shengping Zhang,
Shaohui Liu,
Feng Jiang,
Xiang Li
Abstract:
Autoregressive Initial Bits is a framework that integrates sub-image autoregression and latent variable modeling, demonstrating its advantages in lossless medical image compression. However, in existing methods, the image segmentation process leads to an even distribution of latent variable information across each sub-image, which in turn causes posterior collapse and inefficient utilization of la…
▽ More
Autoregressive Initial Bits is a framework that integrates sub-image autoregression and latent variable modeling, demonstrating its advantages in lossless medical image compression. However, in existing methods, the image segmentation process leads to an even distribution of latent variable information across each sub-image, which in turn causes posterior collapse and inefficient utilization of latent variables. To deal with these issues, we propose a prediction-based end-to-end lossless medical image compression method named LVPNet, leveraging global latent variables to predict pixel values and encoding predicted probabilities for lossless compression. Specifically, we introduce the Global Multi-scale Sensing Module (GMSM), which extracts compact and informative latent representations from the entire image, effectively capturing spatial dependencies within the latent space. Furthermore, to mitigate the information loss introduced during quantization, we propose the Quantization Compensation Module (QCM), which learns the distribution of quantization errors and refines the quantized features to compensate for quantization loss. Extensive experiments on challenging benchmarks demonstrate that our method achieves superior compression efficiency compared to state-of-the-art lossless image compression approaches, while maintaining competitive inference speed. The code is at https://github.com/scy-Jackel/LVPNet.
△ Less
Submitted 25 June, 2025; v1 submitted 22 June, 2025;
originally announced June 2025.
-
BPCLIP: A Bottom-up Image Quality Assessment from Distortion to Semantics Based on CLIP
Authors:
Chenyue Song,
Chen Hui,
Wei Zhang,
Haiqi Zhu,
Shaohui Liu,
Hong Huang,
Feng Jiang
Abstract:
Image Quality Assessment (IQA) aims to evaluate the perceptual quality of images based on human subjective perception. Existing methods generally combine multiscale features to achieve high performance, but most rely on straightforward linear fusion of these features, which may not adequately capture the impact of distortions on semantic content. To address this, we propose a bottom-up image quali…
▽ More
Image Quality Assessment (IQA) aims to evaluate the perceptual quality of images based on human subjective perception. Existing methods generally combine multiscale features to achieve high performance, but most rely on straightforward linear fusion of these features, which may not adequately capture the impact of distortions on semantic content. To address this, we propose a bottom-up image quality assessment approach based on the Contrastive Language-Image Pre-training (CLIP, a recently proposed model that aligns images and text in a shared feature space), named BPCLIP, which progressively extracts the impact of low-level distortions on high-level semantics. Specifically, we utilize an encoder to extract multiscale features from the input image and introduce a bottom-up multiscale cross attention module designed to capture the relationships between shallow and deep features. In addition, by incorporating 40 image quality adjectives across six distinct dimensions, we enable the pre-trained CLIP text encoder to generate representations of the intrinsic quality of the image, thereby strengthening the connection between image quality perception and human language. Our method achieves superior results on most public Full-Reference (FR) and No-Reference (NR) IQA benchmarks, while demonstrating greater robustness.
△ Less
Submitted 22 June, 2025;
originally announced June 2025.
-
OmniESI: A unified framework for enzyme-substrate interaction prediction with progressive conditional deep learning
Authors:
Zhiwei Nie,
Hongyu Zhang,
Hao Jiang,
Yutian Liu,
Xiansong Huang,
Fan Xu,
Jie Fu,
Zhixiang Ren,
Yonghong Tian,
Wen-Bin Zhang,
Jie Chen
Abstract:
Understanding and modeling enzyme-substrate interactions is crucial for catalytic mechanism research, enzyme engineering, and metabolic engineering. Although a large number of predictive methods have emerged, they do not incorporate prior knowledge of enzyme catalysis to rationally modulate general protein-molecule features that are misaligned with catalytic patterns. To address this issue, we int…
▽ More
Understanding and modeling enzyme-substrate interactions is crucial for catalytic mechanism research, enzyme engineering, and metabolic engineering. Although a large number of predictive methods have emerged, they do not incorporate prior knowledge of enzyme catalysis to rationally modulate general protein-molecule features that are misaligned with catalytic patterns. To address this issue, we introduce a two-stage progressive framework, OmniESI, for enzyme-substrate interaction prediction through conditional deep learning. By decomposing the modeling of enzyme-substrate interactions into a two-stage progressive process, OmniESI incorporates two conditional networks that respectively emphasize enzymatic reaction specificity and crucial catalysis-related interactions, facilitating a gradual feature modulation in the latent space from general protein-molecule domain to catalysis-aware domain. On top of this unified architecture, OmniESI can adapt to a variety of downstream tasks, including enzyme kinetic parameter prediction, enzyme-substrate pairing prediction, enzyme mutational effect prediction, and enzymatic active site annotation. Under the multi-perspective performance evaluation of in-distribution and out-of-distribution settings, OmniESI consistently delivered superior performance than state-of-the-art specialized methods across seven benchmarks. More importantly, the proposed conditional networks were shown to internalize the fundamental patterns of catalytic efficiency while significantly improving prediction performance, with only negligible parameter increases (0.16%), as demonstrated by ablation studies on key components. Overall, OmniESI represents a unified predictive approach for enzyme-substrate interactions, providing an effective tool for catalytic mechanism cracking and enzyme engineering with strong generalization and broad applicability.
△ Less
Submitted 22 June, 2025;
originally announced June 2025.
-
Near-Field Propagation and Spatial Non-Stationarity Channel Model for 6-24 GHz (FR3) Extremely Large-Scale MIMO: Adopted by 3GPP for 6G
Authors:
Huixin Xu,
Jianhua Zhang,
Pan Tang,
Hongbo Xing,
Haiyang Miao,
Nan Zhang,
Jian Li,
Jianming Wu,
Wenfei Yang,
Zhening Zhang,
Wei Jiang,
Zijian He,
Afshin Haghighat,
Qixing Wang,
Guangyi Liu
Abstract:
Next generation cellular deployments are expected to exploit the 6-24 GHz frequency range 3 (FR3) and extremely large-scale multiple-input multiple-output (XL-MIMO) to enable ultra-high data rates and reliability. However, the significantly enlarged antenna apertures and higher carrier frequencies render the far-field and spatial stationarity assumptions in the existing 3rd generation partnership…
▽ More
Next generation cellular deployments are expected to exploit the 6-24 GHz frequency range 3 (FR3) and extremely large-scale multiple-input multiple-output (XL-MIMO) to enable ultra-high data rates and reliability. However, the significantly enlarged antenna apertures and higher carrier frequencies render the far-field and spatial stationarity assumptions in the existing 3rd generation partnership project (3GPP) channel models invalid, giving rise to new features such as near-field propagation and spatial non-stationarity (SNS). Despite extensive prior research, incorporating these new features within the standardized channel modeling framework remains an open issue. To address this, this paper presents a channel modeling framework for XL-MIMO systems that incorporates both near-field and SNS features, adopted by 3GPP. For the near-field propagation feature, the framework models the distances from the base station (BS) and user equipment to the spherical-wave sources associated with clusters. These distances are used to characterize element-wise variations of path parameters, such as nonlinear changes in phase and angle. To capture the effect of SNS at the BS side, a stochastic-based approach is proposed to model SNS caused by incomplete scattering, by establishing power attenuation factors from visibility probability and visibility region to characterize antenna element-wise path power variation. In addition, a physical blocker-based approach is introduced to model SNS effects caused by partial blockage. Finally, a simulation framework for near-field and SNS is developed within the structure of the existing 3GPP channel model. Performance evaluations demonstrate that the near-field model captures higher channel capacity potential compared to the far-field model. Coupling loss results indicate that SNS leads to more pronounced propagation fading relative to the spatial stationary model.
△ Less
Submitted 21 June, 2025;
originally announced June 2025.
-
Incorporating Rather Than Eliminating: Achieving Fairness for Skin Disease Diagnosis Through Group-Specific Expert
Authors:
Gelei Xu,
Yuying Duan,
Zheyuan Liu,
Xueyang Li,
Meng Jiang,
Michael Lemmon,
Wei Jin,
Yiyu Shi
Abstract:
AI-based systems have achieved high accuracy in skin disease diagnostics but often exhibit biases across demographic groups, leading to inequitable healthcare outcomes and diminished patient trust. Most existing bias mitigation methods attempt to eliminate the correlation between sensitive attributes and diagnostic prediction, but those methods often degrade performance due to the lost of clinical…
▽ More
AI-based systems have achieved high accuracy in skin disease diagnostics but often exhibit biases across demographic groups, leading to inequitable healthcare outcomes and diminished patient trust. Most existing bias mitigation methods attempt to eliminate the correlation between sensitive attributes and diagnostic prediction, but those methods often degrade performance due to the lost of clinically relevant diagnostic cues. In this work, we propose an alternative approach that incorporates sensitive attributes to achieve fairness. We introduce FairMoE, a framework that employs layer-wise mixture-of-experts modules to serve as group-specific learners. Unlike traditional methods that rigidly assign data based on group labels, FairMoE dynamically routes data to the most suitable expert, making it particularly effective for handling cases near group boundaries. Experimental results show that, unlike previous fairness approaches that reduce performance, FairMoE achieves substantial accuracy improvements while preserving comparable fairness metrics.
△ Less
Submitted 21 June, 2025;
originally announced June 2025.
-
Room-temperature intrinsic nonlinear planar Hall effect in TaIrTe$_4$
Authors:
Chang Jiang,
Fan Yang,
Jinshan Yang,
Peng Yu,
Huiying Liu,
Yuda Zhang,
Zehao Jia,
Xiangyu Cao,
Jingyi Yan,
Zheng Liu,
Xian-Lei Sheng,
Cong Xiao,
Shengyuan A. Yang,
Shaoming Dong,
Faxian Xiu
Abstract:
Intrinsic responses are of paramount importance in physics research, as they represent the inherent properties of materials, independent of extrinsic factors that vary from sample to sample, and often reveal the intriguing quantum geometry of the band structure. Here, we report the experimental discovery of a new intrinsic response in charge transport, specifically the intrinsic nonlinear planar H…
▽ More
Intrinsic responses are of paramount importance in physics research, as they represent the inherent properties of materials, independent of extrinsic factors that vary from sample to sample, and often reveal the intriguing quantum geometry of the band structure. Here, we report the experimental discovery of a new intrinsic response in charge transport, specifically the intrinsic nonlinear planar Hall effect (NPHE), in the topological semimetal TaIrTe$_4$. This effect is characterized by an induced Hall current that is quadratic in the driving electric field and linear in the in-plane magnetic field. The response coefficient is determined by the susceptibility tensor of Berry-connection polarizability dipole, which is an intrinsic band geometric quantity. Remarkably, the signal persists up to room temperature. Our theoretical calculations show excellent agreement with the experimental results and further elucidate the significance of a previously unknown orbital mechanism in intrinsic NPHE. This finding not only establishes a novel intrinsic material property but also opens a new route toward innovative nonlinear devices capable of operating at room temperature.
△ Less
Submitted 21 June, 2025;
originally announced June 2025.
-
New Determination of the $^{14}$C(n, $γ$)$^{15}$C Reaction Rate and Its Astrophysical Implications
Authors:
Yuchen Jiang,
Zhenyu He,
Yudong Luo,
Wenyu Xin,
Jie Chen,
Xinyue Li,
Yangping Shen,
Bing Guo,
Guo Li,
Danyang Pang,
Tianli Ma,
Weike Nan,
Toshitaka Kajino,
Weiping Liu
Abstract:
We present a novel experiment to investigate the spectroscopic factor of the $^{15}$C ground state for the first time using single-neutron $removal$ transfer reactions on $^{15}$C. Two consistent spectroscopic factors were derived from the (p, d) and (d, t) reactions, which were subsequently used to deduce the $^{14}$C(n, $γ$)$^{15}$C reaction cross section and the corresponding stellar reaction r…
▽ More
We present a novel experiment to investigate the spectroscopic factor of the $^{15}$C ground state for the first time using single-neutron $removal$ transfer reactions on $^{15}$C. Two consistent spectroscopic factors were derived from the (p, d) and (d, t) reactions, which were subsequently used to deduce the $^{14}$C(n, $γ$)$^{15}$C reaction cross section and the corresponding stellar reaction rate. A typical cross section of (3.89 $\pm$ 0.76) $μ$b is determined at $E_\mathrm{_{c.m.}}$ = 23.3 keV. At the temperature range of 0.01-4 GK, our new reaction rate is 2.4-3.7 times higher than that of the first direct measurement and 20\%-25\% lower than that of the most recent direct measurement, respectively. Moreover, it is interesting that we can associate a long-standing nuclear structure issue, i.e., the so-called ``quenching'' effect, with this astrophysically relevant reaction. Finally, motivated by astrophysical interests of this reaction decades ago, implications of our new rate on several astrophysical problems are evaluated using state-of-the-art theoretical models. Our calculations demonstrate that the abundances of $^{14}$N and $^{15}$N can be enhanced in the inner regions of asymptotic giant branch (AGB) stars, though with minimal impact on the chemical compositions of the interstellar medium. In the inhomogeneous Big Bang nucleosynthesis, the updated reaction rate can lead to a $\sim 20\%$ variation in the final yields of $^{15}$N in neutron rich regions. For the $r$-process in the core-collapse supernovae, a slight difference of $\sim 0.2\%$ in the final abundances of heavy elements with $A > 90$ can be found by using our new rate.
△ Less
Submitted 21 June, 2025;
originally announced June 2025.
-
Improving Compiler Bug Isolation by Leveraging Large Language Models
Authors:
Yixian Qi,
Jiajun Jiang,
Fengjie Li,
Bowen Chen,
Hongyu Zhang,
Junjie Chen
Abstract:
Compilers play a foundational role in building reliable software systems, and bugs within them can lead to catastrophic consequences. The compilation process typically involves hundreds of files, making traditional automated bug isolation techniques inapplicable due to scalability or effectiveness issues. Current mainstream compiler bug localization techniques have limitations in test program muta…
▽ More
Compilers play a foundational role in building reliable software systems, and bugs within them can lead to catastrophic consequences. The compilation process typically involves hundreds of files, making traditional automated bug isolation techniques inapplicable due to scalability or effectiveness issues. Current mainstream compiler bug localization techniques have limitations in test program mutation and resource consumption. Inspired by the recent advances of pre-trained Large Language Models (LLMs), we propose an innovative approach named AutoCBI, which (1) uses LLMs to summarize compiler file functions and (2) employs specialized prompts to guide LLM in reordering suspicious file rankings. This approach leverages four types of information: the failing test program, source file function summaries, lists of suspicious files identified through analyzing test coverage, as well as compilation configurations with related output messages, resulting in a refined ranking of suspicious files. Our evaluation of AutoCBI against state-of-the-art approaches (DiWi, RecBi and FuseFL) on 120 real-world bugs from the widely-used GCC and LLVM compilers demonstrates its effectiveness. Specifically, AutoCBI isolates 66.67%/69.23%, 300%/340%, and 100%/57.14% more bugs than RecBi, DiWi, and FuseFL, respectively, in the Top-1 ranked results for GCC/LLVM. Additionally, the ablation study underscores the significance of each component in our approach.
△ Less
Submitted 21 June, 2025;
originally announced June 2025.
-
Measuring and Augmenting Large Language Models for Solving Capture-the-Flag Challenges
Authors:
Zimo Ji,
Daoyuan Wu,
Wenyuan Jiang,
Pingchuan Ma,
Zongjie Li,
Shuai Wang
Abstract:
Capture-the-Flag (CTF) competitions are crucial for cybersecurity education and training. As large language models (LLMs) evolve, there is increasing interest in their ability to automate CTF challenge solving. For example, DARPA has organized the AIxCC competition since 2023 to advance AI-powered automated offense and defense. However, this demands a combination of multiple abilities, from knowle…
▽ More
Capture-the-Flag (CTF) competitions are crucial for cybersecurity education and training. As large language models (LLMs) evolve, there is increasing interest in their ability to automate CTF challenge solving. For example, DARPA has organized the AIxCC competition since 2023 to advance AI-powered automated offense and defense. However, this demands a combination of multiple abilities, from knowledge to reasoning and further to actions. In this paper, we highlight the importance of technical knowledge in solving CTF problems and deliberately construct a focused benchmark, CTFKnow, with 3,992 questions to measure LLMs' performance in this core aspect. Our study offers a focused and innovative measurement of LLMs' capability in understanding CTF knowledge and applying it to solve CTF challenges. Our key findings reveal that while LLMs possess substantial technical knowledge, they falter in accurately applying this knowledge to specific scenarios and adapting their strategies based on feedback from the CTF environment.
Based on insights derived from this measurement study, we propose CTFAgent, a novel LLM-driven framework for advancing CTF problem-solving. CTFAgent introduces two new modules: two-stage Retrieval Augmented Generation (RAG) and interactive Environmental Augmentation, which enhance LLMs' technical knowledge and vulnerability exploitation on CTF, respectively. Our experimental results show that, on two popular CTF datasets, CTFAgent both achieves over 80% performance improvement. Moreover, in the recent picoCTF2024 hosted by CMU, CTFAgent ranked in the top 23.6% of nearly 7,000 participating teams. This reflects the benefit of our measurement study and the potential of our framework in advancing LLMs' capabilities in CTF problem-solving.
△ Less
Submitted 21 June, 2025;
originally announced June 2025.
-
Taming the Untamed: Graph-Based Knowledge Retrieval and Reasoning for MLLMs to Conquer the Unknown
Authors:
Bowen Wang,
Zhouqiang Jiang,
Yasuaki Susumu,
Shotaro Miwa,
Tianwei Chen,
Yuta Nakashima
Abstract:
The real value of knowledge lies not just in its accumulation, but in its potential to be harnessed effectively to conquer the unknown. Although recent multimodal large language models (MLLMs) exhibit impressing multimodal capabilities, they often fail in rarely encountered domain-specific tasks due to limited relevant knowledge. To explore this, we adopt visual game cognition as a testbed and sel…
▽ More
The real value of knowledge lies not just in its accumulation, but in its potential to be harnessed effectively to conquer the unknown. Although recent multimodal large language models (MLLMs) exhibit impressing multimodal capabilities, they often fail in rarely encountered domain-specific tasks due to limited relevant knowledge. To explore this, we adopt visual game cognition as a testbed and select Monster Hunter: World as the target to construct a multimodal knowledge graph (MH-MMKG), which incorporates multi-modalities and intricate entity relations. We also design a series of challenging queries based on MH-MMKG to evaluate the models' ability for complex knowledge retrieval and reasoning. Furthermore, we propose a multi-agent retriever that enables a model to autonomously search relevant knowledge without additional training. Experimental results show that our approach significantly enhances the performance of MLLMs, providing a new perspective on multimodal knowledge-augmented reasoning and laying a solid foundation for future research.
△ Less
Submitted 25 June, 2025; v1 submitted 21 June, 2025;
originally announced June 2025.
-
VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models
Authors:
Chongkai Gao,
Zixuan Liu,
Zhenghao Chi,
Junshan Huang,
Xin Fei,
Yiwen Hou,
Yuxuan Zhang,
Yudi Lin,
Zhirui Fang,
Zeyu Jiang,
Lin Shao
Abstract:
Recent studies on Vision-Language-Action (VLA) models have shifted from the end-to-end action-generation paradigm toward a pipeline involving task planning followed by action generation, demonstrating improved performance on various complex, long-horizon manipulation tasks. However, existing approaches vary significantly in terms of network architectures, planning paradigms, representations, and t…
▽ More
Recent studies on Vision-Language-Action (VLA) models have shifted from the end-to-end action-generation paradigm toward a pipeline involving task planning followed by action generation, demonstrating improved performance on various complex, long-horizon manipulation tasks. However, existing approaches vary significantly in terms of network architectures, planning paradigms, representations, and training data sources, making it challenging for researchers to identify the precise sources of performance gains and components to be further improved. To systematically investigate the impacts of different planning paradigms and representations isolating from network architectures and training data, in this paper, we introduce VLA-OS, a unified VLA architecture series capable of various task planning paradigms, and design a comprehensive suite of controlled experiments across diverse object categories (rigid and deformable), visual modalities (2D and 3D), environments (simulation and real-world), and end-effectors (grippers and dexterous hands). Our results demonstrate that: 1) visually grounded planning representations are generally better than language planning representations; 2) the Hierarchical-VLA paradigm generally achieves superior or comparable performance than other paradigms on task performance, pretraining, generalization ability, scalability, and continual learning ability, albeit at the cost of slower training and inference speeds.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations
Authors:
Zhihao Yuan,
Shuyi Jiang,
Chun-Mei Feng,
Yaolun Zhang,
Shuguang Cui,
Zhen Li,
Na Zhao
Abstract:
Currently, utilizing large language models to understand the 3D world is becoming popular. Yet existing 3D-aware LLMs act as black boxes: they output bounding boxes or textual answers without revealing how those decisions are made, and they still rely on pre-trained 3D detectors to supply object proposals. We introduce Scene-R1, a video-grounded framework that learns to reason about 3D scenes with…
▽ More
Currently, utilizing large language models to understand the 3D world is becoming popular. Yet existing 3D-aware LLMs act as black boxes: they output bounding boxes or textual answers without revealing how those decisions are made, and they still rely on pre-trained 3D detectors to supply object proposals. We introduce Scene-R1, a video-grounded framework that learns to reason about 3D scenes without any point-wise 3D instance supervision by pairing reinforcement-learning-driven reasoning with a two-stage grounding pipeline. In the temporal grounding stage, we explicitly reason about the video and select the video snippets most relevant to an open-ended query. In the subsequent image grounding stage, we analyze the image and predict the 2D bounding box. After that, we track the object using SAM2 to produce pixel-accurate masks in RGB frames, and project them back into 3D, thereby eliminating the need for 3D detector-based proposals while capturing fine geometry and material cues. Scene-R1 can also adapt to the 3D visual question answering task to answer free-form questions directly from video. Our training pipeline only needs task-level 2D boxes or textual labels without dense 3D point-wise labels. Scene-R1 surpasses existing open-vocabulary baselines on multiple datasets, while delivering transparent, step-by-step rationales. These results show that reinforcement-learning-based reasoning combined with RGB-D video alone offers a practical, annotation-efficient route to trustworthy 3D scene understanding.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
Exploring Strategies for Personalized Radiation Therapy Part I Unlocking Response-Related Tumor Subregions with Class Activation Mapping
Authors:
Hao Peng,
Steve Jiang,
Robert Timmerman
Abstract:
Personalized precision radiation therapy requires more than simple classification, it demands the identification of prognostic, spatially informative features and the ability to adapt treatment based on individual response. This study compares three approaches for predicting treatment response: standard radiomics, gradient based features, and convolutional neural networks enhanced with Class Activ…
▽ More
Personalized precision radiation therapy requires more than simple classification, it demands the identification of prognostic, spatially informative features and the ability to adapt treatment based on individual response. This study compares three approaches for predicting treatment response: standard radiomics, gradient based features, and convolutional neural networks enhanced with Class Activation Mapping. We analyzed 69 brain metastases from 39 patients treated with Gamma Knife radiosurgery. An integrated autoencoder classifier model was used to predict whether tumor volume would shrink by more than 20 percent at a three months follow up, framed as a binary classification task. The results highlight their strength in hierarchical feature extraction and the classifiers discriminative capacity. Among the models, pixel wise CAM provides the most detailed spatial insight, identifying lesion specific regions rather than relying on fixed patterns, demonstrating strong generalization. In non responding lesions, the activated regions may indicate areas of radio resistance. Pixel wise CAM outperformed both radiomics and gradient based methods in classification accuracy. Moreover, its fine grained spatial features allow for alignment with cellular level data, supporting biological validation and deeper understanding of heterogeneous treatment responses. Although further validation is necessary, these findings underscore the promise in guiding personalized and adaptive radiotherapy strategies for both photon and particle therapies.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
Exploring Strategies for Personalized Radiation Therapy Part II Predicting Tumor Drift Patterns with Diffusion Models
Authors:
Hao Peng,
Steve Jiang,
Robert Timmerman
Abstract:
Radiation therapy outcomes are decided by two key parameters, dose and timing, whose best values vary substantially across patients. This variability is especially critical in the treatment of brain cancer, where fractionated or staged stereotactic radiosurgery improves safety compared to single fraction approaches, but complicates the ability to predict treatment response. To address this challen…
▽ More
Radiation therapy outcomes are decided by two key parameters, dose and timing, whose best values vary substantially across patients. This variability is especially critical in the treatment of brain cancer, where fractionated or staged stereotactic radiosurgery improves safety compared to single fraction approaches, but complicates the ability to predict treatment response. To address this challenge, we employ Personalized Ultra-fractionated Stereotactic Adaptive Radiotherapy (PULSAR), a strategy that dynamically adjusts treatment based on how each tumor evolves over time. However, the success of PULSAR and other adaptive approaches depends on predictive tools that can guide early treatment decisions and avoid both overtreatment and undertreatment. However, current radiomics and dosiomics models offer limited insight into the evolving spatial and temporal patterns of tumor response. To overcome these limitations, we propose a novel framework using Denoising Diffusion Implicit Models (DDIM), which learns data-driven mappings from pre to post treatment imaging. In this study, we developed single step and iterative denoising strategies and compared their performance. The results show that diffusion models can effectively simulate patient specific tumor evolution and localize regions associated with treatment response. The proposed strategy provides a promising foundation for modeling heterogeneous treatment response and enabling early, adaptive interventions, paving the way toward more personalized and biologically informed radiotherapy.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?
Authors:
Mingyuan Wu,
Meitang Li,
Jingcheng Yang,
Jize Jiang,
Kaizhuo Yan,
Zhaoheng Li,
Minjia Zhang,
Klara Nahrstedt
Abstract:
Recent advances in large language models (LLMs) have demonstrated that inference-time computation techniques, such as decoding-time scaling and self-refinement, can significantly enhance reasoning capabilities without relying on external knowledge. A key driver of this success is the emergence of self-correction and self-verification behaviors, often elicited through reinforcement learning (RL). I…
▽ More
Recent advances in large language models (LLMs) have demonstrated that inference-time computation techniques, such as decoding-time scaling and self-refinement, can significantly enhance reasoning capabilities without relying on external knowledge. A key driver of this success is the emergence of self-correction and self-verification behaviors, often elicited through reinforcement learning (RL). In this paper, we investigate whether these inference-time techniques extend effectively to vision-language models (VLMs), particularly those trained with RL. We find that while decoding strategies such as majority voting and best-of-N selection with self-verification all improve VLM reasoning performance, generation-reliant methods such as the former achieve significantly higher gains versus verification-reliant methods such as the latter. Additionally, the self-correction behavior often associated with RL-tuned models, such as aha moment, does not lead to measurable gains. We show via extensive experimentation within the inference-time scaling framework to identify a key root cause: RL-trained VLMs still lack robust self-verification capabilities across both visual and textual modalities.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
GTA: Grouped-head latenT Attention
Authors:
Luoyang Sun,
Jiwen Jiang,
Cheng Deng,
Xinjian Wu,
Haifeng Zhang,
Lei Chen,
Lionel Ni,
Jun Wang
Abstract:
Attention mechanisms underpin the success of large language models (LLMs), yet their substantial computational and memory overhead poses challenges for optimizing efficiency and performance. A critical bottleneck arises as KV cache and attention computations scale rapidly with text length, challenging deployment on hardware with limited computational and memory resources. We observe that attention…
▽ More
Attention mechanisms underpin the success of large language models (LLMs), yet their substantial computational and memory overhead poses challenges for optimizing efficiency and performance. A critical bottleneck arises as KV cache and attention computations scale rapidly with text length, challenging deployment on hardware with limited computational and memory resources. We observe that attention mechanisms exhibit substantial redundancy, since the KV cache can be significantly compressed and attention maps across heads display high similarity, revealing that much of the computation and storage is unnecessary. Leveraging these insights, we propose \textbf{G}rouped-Head Laten\textbf{T} \textbf{A}ttention (GTA), a novel attention mechanism that reduces memory usage and computational complexity while maintaining performance. GTA comprises two components: (1) a shared attention map mechanism that reuses attention scores across multiple heads, decreasing the key cache size; and (2) a nonlinear value decoder with learned projections that compresses the value cache into a latent space, further cutting memory needs. GTA cuts attention computation FLOPs by up to \emph{62.5\%} versus Grouped-Query Attention and shrink the KV cache by up to \emph{70\%}, all while avoiding the extra overhead of Multi-Head Latent Attention to improve LLM deployment efficiency. Consequently, GTA models achieve a \emph{2x} increase in end-to-end inference speed, with prefill benefiting from reduced computational cost and decoding benefiting from the smaller cache footprint.
△ Less
Submitted 15 June, 2025;
originally announced June 2025.
-
QUST_NLP at SemEval-2025 Task 7: A Three-Stage Retrieval Framework for Monolingual and Crosslingual Fact-Checked Claim Retrieval
Authors:
Youzheng Liu,
Jiyan Liu,
Xiaoman Xu,
Taihang Wang,
Yimin Wang,
Ye Jiang
Abstract:
This paper describes the participation of QUST_NLP in the SemEval-2025 Task 7. We propose a three-stage retrieval framework specifically designed for fact-checked claim retrieval. Initially, we evaluate the performance of several retrieval models and select the one that yields the best results for candidate retrieval. Next, we employ multiple re-ranking models to enhance the candidate results, wit…
▽ More
This paper describes the participation of QUST_NLP in the SemEval-2025 Task 7. We propose a three-stage retrieval framework specifically designed for fact-checked claim retrieval. Initially, we evaluate the performance of several retrieval models and select the one that yields the best results for candidate retrieval. Next, we employ multiple re-ranking models to enhance the candidate results, with each model selecting the Top-10 outcomes. In the final stage, we utilize weighted voting to determine the final retrieval outcomes. Our approach achieved 5th place in the monolingual track and 7th place in the crosslingual track. We release our system code at: https://github.com/warmth27/SemEval2025_Task7
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Analyzing PDFs like Binaries: Adversarially Robust PDF Malware Analysis via Intermediate Representation and Language Model
Authors:
Side Liu,
Jiang Ming,
Guodong Zhou,
Xinyi Liu,
Jianming Fu,
Guojun Peng
Abstract:
Malicious PDF files have emerged as a persistent threat and become a popular attack vector in web-based attacks. While machine learning-based PDF malware classifiers have shown promise, these classifiers are often susceptible to adversarial attacks, undermining their reliability. To address this issue, recent studies have aimed to enhance the robustness of PDF classifiers. Despite these efforts, t…
▽ More
Malicious PDF files have emerged as a persistent threat and become a popular attack vector in web-based attacks. While machine learning-based PDF malware classifiers have shown promise, these classifiers are often susceptible to adversarial attacks, undermining their reliability. To address this issue, recent studies have aimed to enhance the robustness of PDF classifiers. Despite these efforts, the feature engineering underlying these studies remains outdated. Consequently, even with the application of cutting-edge machine learning techniques, these approaches fail to fundamentally resolve the issue of feature instability.
To tackle this, we propose a novel approach for PDF feature extraction and PDF malware detection. We introduce the PDFObj IR (PDF Object Intermediate Representation), an assembly-like language framework for PDF objects, from which we extract semantic features using a pretrained language model. Additionally, we construct an Object Reference Graph to capture structural features, drawing inspiration from program analysis. This dual approach enables us to analyze and detect PDF malware based on both semantic and structural features. Experimental results demonstrate that our proposed classifier achieves strong adversarial robustness while maintaining an exceptionally low false positive rate of only 0.07% on baseline dataset compared to state-of-the-art PDF malware classifiers.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
Large Language Model Unlearning for Source Code
Authors:
Xue Jiang,
Yihong Dong,
Zheng Fang,
Yingwei Ma,
Tangxinyu Wang,
Rongyu Cao,
Binhua Li,
Zhi Jin,
Wenpin Jiao,
Yongbin Li,
Ge Li
Abstract:
LLM4SE has demonstrated significant success, but LLMs' potential memorization of sensitive or outdated training data introduces critical risks to legal compliance, software security, and code quality. LLM unlearning techniques, which can eliminate the influence of undesired data from LLMs in a post-training way, present a promising solution to address these concerns. While recent efforts in LLM un…
▽ More
LLM4SE has demonstrated significant success, but LLMs' potential memorization of sensitive or outdated training data introduces critical risks to legal compliance, software security, and code quality. LLM unlearning techniques, which can eliminate the influence of undesired data from LLMs in a post-training way, present a promising solution to address these concerns. While recent efforts in LLM unlearning show effectiveness in natural language, their applicability to source code remains underexplored. Our empirical study reveals that existing LLM unlearning approaches, when applied to source code, cause severe model utility degradation, rendering models practically unusable for code generation. In this paper, we propose PROD, a novel unlearning approach that enables LLMs to forget undesired code content while effectively preserving their code generation capabilities. PROD suppresses the probability of forget data in LLMs' output distribution while promoting candidate distributional components, enabling the model to jointly learn to forget specific content and retain its general capabilities. To facilitate this study, we establish a benchmark for code unlearning evaluation, which includes three critical downstream tasks: copyrighted code unlearning, insecure code unlearning, and deprecated API unlearning. Our evaluation demonstrates that PROD achieves superior balance between forget quality and model utility compared to existing unlearning approaches across three downstream tasks, while consistently exhibiting improvements when applied to LLMs of varying series. PROD also exhibits superior robustness against adversarial attacks without generating or exposing the data to be forgotten. The results underscore that our approach not only extends the application boundary of unlearning techniques to source code, but also holds significant implications for advancing reliable code generation.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs
Authors:
Haoran Sun,
Yankai Jiang,
Wenjie Lou,
Yujie Zhang,
Wenjie Li,
Lilong Wang,
Mianxin Liu,
Lei Liu,
Xiaosong Wang
Abstract:
Multimodal large language models (MLLMs) have begun to demonstrate robust reasoning capabilities on general tasks, yet their application in the medical domain remains in its early stages. Constructing chain-of-thought (CoT) training data is essential for bolstering the reasoning abilities of medical MLLMs. However, existing approaches exhibit a deficiency in offering a comprehensive framework for…
▽ More
Multimodal large language models (MLLMs) have begun to demonstrate robust reasoning capabilities on general tasks, yet their application in the medical domain remains in its early stages. Constructing chain-of-thought (CoT) training data is essential for bolstering the reasoning abilities of medical MLLMs. However, existing approaches exhibit a deficiency in offering a comprehensive framework for searching and evaluating effective reasoning paths towards critical diagnosis. To address this challenge, we propose Mentor-Intern Collaborative Search (MICS), a novel reasoning-path searching scheme to generate rigorous and effective medical CoT data. MICS first leverages mentor models to initialize the reasoning, one step at a time, then prompts each intern model to continue the thinking along those initiated paths, and finally selects the optimal reasoning path according to the overall reasoning performance of multiple intern models. The reasoning performance is determined by an MICS-Score, which assesses the quality of generated reasoning paths. Eventually, we construct MMRP, a multi-task medical reasoning dataset with ranked difficulty, and Chiron-o1, a new medical MLLM devised via a curriculum learning strategy, with robust visual question-answering and generalizable reasoning capabilities. Extensive experiments demonstrate that Chiron-o1, trained on our CoT dataset constructed using MICS, achieves state-of-the-art performance across a list of medical visual question answering and reasoning benchmarks. Codes are available at GitHub - manglu097/Chiron-o1: Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
RS-Coded Adaptive Dynamic Network for Reliable Long-Term Information Transmission in Disturbed Multimode Fiber
Authors:
Yang Hu,
Minyu Fan,
Kun Liu,
Songsong Zhu,
Nan Jiang,
Sha Wang
Abstract:
Multimode fiber (MMF), due to its large core diameter and high mode capacity, holds potential in high-speed communications. However, inherent modal dispersion causes output speckle distortion, and transmission characteristics are sensitive to environmental disturbances, limiting its reliable application. Conventional transmission matrix (TM) methods face challenges such as complex calibration and…
▽ More
Multimode fiber (MMF), due to its large core diameter and high mode capacity, holds potential in high-speed communications. However, inherent modal dispersion causes output speckle distortion, and transmission characteristics are sensitive to environmental disturbances, limiting its reliable application. Conventional transmission matrix (TM) methods face challenges such as complex calibration and environmental sensitivity. Although current deep learning approaches demonstrate reconstruction potential, they struggle to overcome error accumulation caused by fiber mode drift and lack sufficient environmental adaptability. To address this, the present study proposes an adaptive transmission framework named Residual Reed-Solomon Dynamic Network (RRSDN), which integrates Reed-Solomon (RS) error correction coding with deep residual learning forming a closed-loop system that jointly optimizes encoding, transmission, and reconstruction, to tackle the key challenges of mode instability and error accumulation in dynamic scattering channels. Experimentally, high-fidelity real-time transmission of a 16*16 pixel video stream (H.265 compressed) with zero frame loss and 100% symbol accuracy was achieved under conditions of a 100-meter MMF with manually applied disturbances and no temperature control. This work proposes a solution for stable optical transmission in complex channels. Plus, it integrates error correction coding with neural network training, laying the foundation for adaptive optical systems in longer-distance and more complex scenarios.
△ Less
Submitted 23 June, 2025; v1 submitted 20 June, 2025;
originally announced June 2025.
-
Loupe: A Generalizable and Adaptive Framework for Image Forgery Detection
Authors:
Yuchu Jiang,
Jiaming Chu,
Jian Zhao,
Xin Zhang,
Xu Yang,
Lei Jin,
Chi Zhang,
Xuelong Li
Abstract:
The proliferation of generative models has raised serious concerns about visual content forgery. Existing deepfake detection methods primarily target either image-level classification or pixel-wise localization. While some achieve high accuracy, they often suffer from limited generalization across manipulation types or rely on complex architectures. In this paper, we propose Loupe, a lightweight y…
▽ More
The proliferation of generative models has raised serious concerns about visual content forgery. Existing deepfake detection methods primarily target either image-level classification or pixel-wise localization. While some achieve high accuracy, they often suffer from limited generalization across manipulation types or rely on complex architectures. In this paper, we propose Loupe, a lightweight yet effective framework for joint deepfake detection and localization. Loupe integrates a patch-aware classifier and a segmentation module with conditional queries, allowing simultaneous global authenticity classification and fine-grained mask prediction. To enhance robustness against distribution shifts of test set, Loupe introduces a pseudo-label-guided test-time adaptation mechanism by leveraging patch-level predictions to supervise the segmentation head. Extensive experiments on the DDL dataset demonstrate that Loupe achieves state-of-the-art performance, securing the first place in the IJCAI 2025 Deepfake Detection and Localization Challenge with an overall score of 0.846. Our results validate the effectiveness of the proposed patch-level fusion and conditional query design in improving both classification accuracy and spatial localization under diverse forgery patterns. The code is available at https://github.com/Kamichanw/Loupe.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models
Authors:
Lei Jiang,
Zixun Zhang,
Zizhou Wang,
Xiaobing Sun,
Zhen Li,
Liangli Zhen,
Xiaohua Xu
Abstract:
Large Vision-Language Models (LVLMs) demonstrate exceptional performance across multimodal tasks, yet remain vulnerable to jailbreak attacks that bypass built-in safety mechanisms to elicit restricted content generation. Existing black-box jailbreak methods primarily rely on adversarial textual prompts or image perturbations, yet these approaches are highly detectable by standard content filtering…
▽ More
Large Vision-Language Models (LVLMs) demonstrate exceptional performance across multimodal tasks, yet remain vulnerable to jailbreak attacks that bypass built-in safety mechanisms to elicit restricted content generation. Existing black-box jailbreak methods primarily rely on adversarial textual prompts or image perturbations, yet these approaches are highly detectable by standard content filtering systems and exhibit low query and computational efficiency. In this work, we present Cross-modal Adversarial Multimodal Obfuscation (CAMO), a novel black-box jailbreak attack framework that decomposes malicious prompts into semantically benign visual and textual fragments. By leveraging LVLMs' cross-modal reasoning abilities, CAMO covertly reconstructs harmful instructions through multi-step reasoning, evading conventional detection mechanisms. Our approach supports adjustable reasoning complexity and requires significantly fewer queries than prior attacks, enabling both stealth and efficiency. Comprehensive evaluations conducted on leading LVLMs validate CAMO's effectiveness, showcasing robust performance and strong cross-model transferability. These results underscore significant vulnerabilities in current built-in safety mechanisms, emphasizing an urgent need for advanced, alignment-aware security and safety solutions in vision-language systems.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
Giant Magneto-Optical Effects in Two-Dimensional Flat-Band Antiferromagnets
Authors:
Ping Yang,
Wanxiang Feng,
Siyuan Liu,
Shan Guan,
Liwei Wen,
Wei Jiang,
Gui-Bin Liu,
Yugui Yao
Abstract:
In this work, we reveal giant magneto-optical responses in two-dimensional(2D) antiferromagnets with nearly flat electronic bands, based on first-principles calculations and group-theoretical analysis. We identify a record-large second-order magneto-optical Schafer-Hubert(SH) effect, featuring a polarization rotation angle of 28 degree, in monolayer antiferromagnetic RuOCl2, driven by flatband-enh…
▽ More
In this work, we reveal giant magneto-optical responses in two-dimensional(2D) antiferromagnets with nearly flat electronic bands, based on first-principles calculations and group-theoretical analysis. We identify a record-large second-order magneto-optical Schafer-Hubert(SH) effect, featuring a polarization rotation angle of 28 degree, in monolayer antiferromagnetic RuOCl2, driven by flatband-enhanced interband optical transitions. Both the valence and conduction bands exhibit pronounced directional flatness, giving rise to highly anisotropic optical absorption and broadband hyperbolic frequency windows spanning the entire visible spectrum. This anisotropy leads to an exceptionally strong linear dichroism (LD) reaching 50%, far exceeding values reported in other 2D magnetic systems. Remarkably, the giant SH effect and LD appear at distinct photon energies, reflecting a momentum-direction-dependent crossover between flat and dispersive bands. Both responses are further amplified with increasing RuOCl2 film thickness. Our results establish flat-band antiferromagnets as a fertile platform for realizing giant nonlinear magneto-optical effects and open new avenues for 2D opto-spintronic device applications.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
PPTP: Performance-Guided Physiological Signal-Based Trust Prediction in Human-Robot Collaboration
Authors:
Hao Guo,
Wei Fan,
Shaohui Liu,
Feng Jiang,
Chunzhi Yi
Abstract:
Trust prediction is a key issue in human-robot collaboration, especially in construction scenarios where maintaining appropriate trust calibration is critical for safety and efficiency. This paper introduces the Performance-guided Physiological signal-based Trust Prediction (PPTP), a novel framework designed to improve trust assessment. We designed a human-robot construction scenario with three di…
▽ More
Trust prediction is a key issue in human-robot collaboration, especially in construction scenarios where maintaining appropriate trust calibration is critical for safety and efficiency. This paper introduces the Performance-guided Physiological signal-based Trust Prediction (PPTP), a novel framework designed to improve trust assessment. We designed a human-robot construction scenario with three difficulty levels to induce different trust states. Our approach integrates synchronized multimodal physiological signals (ECG, GSR, and EMG) with collaboration performance evaluation to predict human trust levels. Individual physiological signals are processed using collaboration performance information as guiding cues, leveraging the standardized nature of collaboration performance to compensate for individual variations in physiological responses. Extensive experiments demonstrate the efficacy of our cross-modality fusion method in significantly improving trust classification performance. Our model achieves over 81% accuracy in three-level trust classification, outperforming the best baseline method by 6.7%, and notably reaches 74.3% accuracy in high-resolution seven-level classification, which is a first in trust prediction research. Ablation experiments further validate the superiority of physiological signal processing guided by collaboration performance assessment.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
Observation of a spin-textured nematic Kondo lattice
Authors:
Yu-Xiao Jiang,
Zi-Jia Cheng,
Qiaozhi Xu,
Md Shafayat Hossain,
Xian P. Yang,
Jia-Xin Yin,
Maksim Litskevich,
Tyler A. Cochran,
Byunghoon Kim,
Eduardo Miranda,
Sheng Ran,
Rafael M. Fernandes,
M. Zahid Hasan
Abstract:
The Kondo lattice mode, as one of the most fundamental models in condensed matter physics, has been employed to describe a wide range of quantum materials such as heavy fermions, transition metal dichalcogenides and two-dimensional Moire systems. Discovering new phases on Kondo lattice and unveiling their mechanisms are crucial to the understanding of strongly correlated systems. Here, in a layere…
▽ More
The Kondo lattice mode, as one of the most fundamental models in condensed matter physics, has been employed to describe a wide range of quantum materials such as heavy fermions, transition metal dichalcogenides and two-dimensional Moire systems. Discovering new phases on Kondo lattice and unveiling their mechanisms are crucial to the understanding of strongly correlated systems. Here, in a layered Kondo magnet USbTe, we observe a spin-textured nematic state and visualize a heavy electronic liquid-crystal phase. Employing scanning tunneling microscopy and spectroscopy (STM/STS), we visualize a tetragonal symmetry breaking of heavy electronic states around the Fermi level. Through systematically investigating the temperature and energy dependence of spectroscopic data, we find that the nematic state coincides with the formation of heavy quasi-particles driven by band hybridization. Remarkably, using spin polarized STM, we demonstrate that the nematic state is spin polarized, which not only suggests its intrinsically electronic nature, but also represents the unique magnetic texture of nematic heavy fermions. Our findings unveil a novel correlation-mediated order whose mechanism is inherently tied to Kondo-lattice physics. The observation of heavy nematic states enriches the phase diagram of correlated systems and provides a rare platform to explore the interplay of Kondo physics, spontaneous symmetry breaking and quantum criticality.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details
Authors:
Zeqiang Lai,
Yunfei Zhao,
Haolin Liu,
Zibo Zhao,
Qingxiang Lin,
Huiwen Shi,
Xianghui Yang,
Mingxin Yang,
Shuhui Yang,
Yifei Feng,
Sheng Zhang,
Xin Huang,
Di Luo,
Fan Yang,
Fang Yang,
Lifu Wang,
Sicong Liu,
Yixuan Tang,
Yulin Cai,
Zebin He,
Tian Liu,
Yuhong Liu,
Jie Jiang,
Linus,
Jingwei Huang
, et al. (1 additional authors not shown)
Abstract:
In this report, we present Hunyuan3D 2.5, a robust suite of 3D diffusion models aimed at generating high-fidelity and detailed textured 3D assets. Hunyuan3D 2.5 follows two-stages pipeline of its previous version Hunyuan3D 2.0, while demonstrating substantial advancements in both shape and texture generation. In terms of shape generation, we introduce a new shape foundation model -- LATTICE, which…
▽ More
In this report, we present Hunyuan3D 2.5, a robust suite of 3D diffusion models aimed at generating high-fidelity and detailed textured 3D assets. Hunyuan3D 2.5 follows two-stages pipeline of its previous version Hunyuan3D 2.0, while demonstrating substantial advancements in both shape and texture generation. In terms of shape generation, we introduce a new shape foundation model -- LATTICE, which is trained with scaled high-quality datasets, model-size, and compute. Our largest model reaches 10B parameters and generates sharp and detailed 3D shape with precise image-3D following while keeping mesh surface clean and smooth, significantly closing the gap between generated and handcrafted 3D shapes. In terms of texture generation, it is upgraded with phyiscal-based rendering (PBR) via a novel multi-view architecture extended from Hunyuan3D 2.0 Paint model. Our extensive evaluation shows that Hunyuan3D 2.5 significantly outperforms previous methods in both shape and end-to-end texture generation.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
SO emission in the dynamically perturbed protoplanetary disks around CQ Tau and MWC 758
Authors:
Francesco Zagaria,
Haochang Jiang,
Gianni Cataldi,
Stefano Facchini,
Myriam Benisty,
Yuri Aikawa,
Sean Andrews,
Jaehan Bae,
Marcelo Barraza-Alfaro,
Pietro Curone,
Ian Czekala,
Daniele Fasano,
Cassandra Hall,
Iain Hammond,
Jane Huang,
John D. Ilee,
Andrés F. Izquierdo,
Jensen Lawrence,
Giuseppe Lodato,
François Ménard,
Christophe Pinte,
Giovanni P. Rosotti,
Jochen Stadler,
Richard Teague,
Leonardo Testi
, et al. (3 additional authors not shown)
Abstract:
We report the serendipitous detection of the SO $J_N=6_5-5_4$ (219.949 GHz) rotational transition in archival Atacama Large Millimeter/submillimeter Array (ALMA) observations of the spiral hosting protoplanetary disks around CQ Tau (with $\approx4.9σ$ significance) and MWC 758 (with $\approx3.4σ$ significance). In the former, the SO emission comes in the shape of a ring, arises from the edge of th…
▽ More
We report the serendipitous detection of the SO $J_N=6_5-5_4$ (219.949 GHz) rotational transition in archival Atacama Large Millimeter/submillimeter Array (ALMA) observations of the spiral hosting protoplanetary disks around CQ Tau (with $\approx4.9σ$ significance) and MWC 758 (with $\approx3.4σ$ significance). In the former, the SO emission comes in the shape of a ring, arises from the edge of the continuum cavity, and is qualitatively consistent, at the currently available spectral resolution, with being in Keplerian rotation. In the latter, instead, while arising primarily from inside the continuum cavity, the SO emission also extends to the continuum ring(s), and its morphology and kinematics are less clear. We put these sources in the context of the other protoplanetary disks where SO detections have been previously reported in the literature and discuss the possible origins of SO in terms of (thermal) desorption or formation in the gas phase. We argue that these processes might be fostered by dynamical perturbations caused by unseen embedded massive companions, shadows, or late-time infall, thus suggesting a possible link between perturbed dynamics and SO emission in (these) protoplanetary disks. If confirmed, our interpretation would imply that chemical evolution timescales could be significantly shorter in these systems than is commonly assumed, indicating that dynamical perturbations might influence the composition of newborn (proto-)planets by altering the volatile makeup of their formation environment.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
Collision-assisted information scrambling on a configurable photonic chip
Authors:
Xiao-Wen Shang,
Shu-Yi Liang,
Guan-Ju Yan,
Xin-Yang Jiang,
Zi-Ming Yin,
Hao Tang,
Jian-Peng Dou,
Ze-Kun Jiang,
Yu-Quan Peng,
Xian-Min Jin
Abstract:
Quantum interference and entanglement are in the core of quantum computations. The fast spread of information in the quantum circuit helps to mitigate the circuit depth. Although the information scrambling in the closed systems has been proposed and tested in the digital circuits, how to measure the evolution of quantum correlations between systems and environments remains a delicate and open ques…
▽ More
Quantum interference and entanglement are in the core of quantum computations. The fast spread of information in the quantum circuit helps to mitigate the circuit depth. Although the information scrambling in the closed systems has been proposed and tested in the digital circuits, how to measure the evolution of quantum correlations between systems and environments remains a delicate and open question. Here, we propose a photonic circuit to investigate the information scrambling in an open quantum system by implementing the collision model with cascaded Mach-Zehnder interferometers. We numerically simulate the photon propagation and find that the tripartite mutual information strongly depends on the system-environment and environment-environment interactions. We further reduce the number of observables and the number of shots required to reconstruct the density matrix by designing an enhanced compressed sensing. Our results provide a reconfigurable photonic platform for simulating open quantum systems and pave the way for exploring controllable dissipation and non-Markovianity in discrete-variable photonic computing.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
Investigating Transit Timing Variations in the Ultra-short Period Exoplanet WASP-19b
Authors:
Shraddha Biswas,
Ing-Guey Jiang,
Li-Chin Yeh,
Hsin-Min Liu,
Kaviya Parthasarathy,
Devesh P. Sariya,
D. Bisht,
Mohit Singh Bisht,
A. Raj
Abstract:
In this study, we present a comprehensive analysis of transit timing variations (TTVs) in the ultra-short-period gas giant WASP-19b, which orbits a G-type main-sequence star. Our analysis is based on a dataset comprising 204 transit light curves obtained from the Transiting Exoplanet Survey Satellite (TESS), the Exoplanet Transit Database (ETD), and the ExoClock project, supplemented by 18 publicl…
▽ More
In this study, we present a comprehensive analysis of transit timing variations (TTVs) in the ultra-short-period gas giant WASP-19b, which orbits a G-type main-sequence star. Our analysis is based on a dataset comprising 204 transit light curves obtained from the Transiting Exoplanet Survey Satellite (TESS), the Exoplanet Transit Database (ETD), and the ExoClock project, supplemented by 18 publicly available light curves. Mid-transit times were extracted from these data, and an additional 98 mid-transit times compiled from the literature were incorporated, resulting in a combined dataset spanning approximately 14 years. After excluding light curves significantly impacted by stellar activity, such as starspot anomalies, the final dataset consisted of 252 high-quality mid-transit times. Initial inspection of the transit timing residuals using an apsidal precession model suggested the possible presence of an additional planetary companion. However, subsequent frequency analysis and sinusoidal model fitting indicate that the observed TTVs are more consistently explained by apsidal precession of WASP-19b's orbit. We also considered alternative mechanisms, including the Applegate mechanism and the Shklovskii effect. Our findings suggest that stellar magnetic activity, potentially linked to the Applegate mechanism, may also contribute to the observed timing variations. To further constrain the origin of the TTVs and assess the contributions of these mechanisms, continued high-precision photometric monitoring of the WASP-19 system is strongly recommended.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
A Tractable Approach to Massive Communication and Ubiquitous Connectivity in 6G Standardization
Authors:
Junyi Jiang,
Wei Chen,
Xin Guo,
Shenghui Song,
Ying Jun,
Zhang,
Zhu Han,
Merouane Debbah,
Khaled B. Letaief
Abstract:
The full-scale 6G standardization has attracted considerable recent attention, especially since the first 3GPP-wide 6G workshop held in March 2025. To understand the practical and fundamental values of 6G and facilitate its standardization, it is crucial to explore the theoretical limits of spectrum, energy, and coverage efficiency considering practical hardware and signaling constraints. In this…
▽ More
The full-scale 6G standardization has attracted considerable recent attention, especially since the first 3GPP-wide 6G workshop held in March 2025. To understand the practical and fundamental values of 6G and facilitate its standardization, it is crucial to explore the theoretical limits of spectrum, energy, and coverage efficiency considering practical hardware and signaling constraints. In this paper, we present a mean-field-approximation-based investigation on two out of six use case scenarios defined by IMT-2030, namely, massive communication and ubiquitous connectivity. Being aware of the limitation in interference cancellation owing to constrained cost and hardware complexity, we investigate the spectrum reuse architecture in both usage scenarios. We propose a tractable spectrum reuse with low signaling overhead consumed for channel estimation and channel state information (CSI) feedback. Our analysis indicates that the massive communication over cellular and device-to-device (D2D) networks can benefit from channel orthogonalization, while it is unnecessary to share the CSI of interfering links. Moreover, deploying relays or movable base stations, e.g. unmanned aerial vehicle, yields substantial energy and spectrum gain for ubiquitous connectivity, despite introducing interference. As such, the mean-field-optimization-based evaluation is expected to positively impact 6G and NextG standardization in 3GPP and other standardization bodies.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
Fine-grained Image Retrieval via Dual-Vision Adaptation
Authors:
Xin Jiang,
Meiqi Cao,
Hao Tang,
Fei Shen,
Zechao Li
Abstract:
Fine-Grained Image Retrieval~(FGIR) faces challenges in learning discriminative visual representations to retrieve images with similar fine-grained features. Current leading FGIR solutions typically follow two regimes: enforce pairwise similarity constraints in the semantic embedding space, or incorporate a localization sub-network to fine-tune the entire model. However, such two regimes tend to o…
▽ More
Fine-Grained Image Retrieval~(FGIR) faces challenges in learning discriminative visual representations to retrieve images with similar fine-grained features. Current leading FGIR solutions typically follow two regimes: enforce pairwise similarity constraints in the semantic embedding space, or incorporate a localization sub-network to fine-tune the entire model. However, such two regimes tend to overfit the training data while forgetting the knowledge gained from large-scale pre-training, thus reducing their generalization ability. In this paper, we propose a Dual-Vision Adaptation (DVA) approach for FGIR, which guides the frozen pre-trained model to perform FGIR through collaborative sample and feature adaptation. Specifically, we design Object-Perceptual Adaptation, which modifies input samples to help the pre-trained model perceive critical objects and elements within objects that are helpful for category prediction. Meanwhile, we propose In-Context Adaptation, which introduces a small set of parameters for feature adaptation without modifying the pre-trained parameters. This makes the FGIR task using these adjusted features closer to the task solved during the pre-training. Additionally, to balance retrieval efficiency and performance, we propose Discrimination Perception Transfer to transfer the discriminative knowledge in the object-perceptual adaptation to the image encoder using the knowledge distillation mechanism. Extensive experiments show that DVA has fewer learnable parameters and performs well on three in-distribution and three out-of-distribution fine-grained datasets.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
CapsDT: Diffusion-Transformer for Capsule Robot Manipulation
Authors:
Xiting He,
Mingwu Su,
Xinqi Jiang,
Long Bai,
Jiewen Lai,
Hongliang Ren
Abstract:
Vision-Language-Action (VLA) models have emerged as a prominent research area, showcasing significant potential across a variety of applications. However, their performance in endoscopy robotics, particularly endoscopy capsule robots that perform actions within the digestive system, remains unexplored. The integration of VLA models into endoscopy robots allows more intuitive and efficient interact…
▽ More
Vision-Language-Action (VLA) models have emerged as a prominent research area, showcasing significant potential across a variety of applications. However, their performance in endoscopy robotics, particularly endoscopy capsule robots that perform actions within the digestive system, remains unexplored. The integration of VLA models into endoscopy robots allows more intuitive and efficient interactions between human operators and medical devices, improving both diagnostic accuracy and treatment outcomes. In this work, we design CapsDT, a Diffusion Transformer model for capsule robot manipulation in the stomach. By processing interleaved visual inputs, and textual instructions, CapsDT can infer corresponding robotic control signals to facilitate endoscopy tasks. In addition, we developed a capsule endoscopy robot system, a capsule robot controlled by a robotic arm-held magnet, addressing different levels of four endoscopy tasks and creating corresponding capsule robot datasets within the stomach simulator. Comprehensive evaluations on various robotic tasks indicate that CapsDT can serve as a robust vision-language generalist, achieving state-of-the-art performance in various levels of endoscopy tasks while achieving a 26.25% success rate in real-world simulation manipulation.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
Development of a persuasive User Experience Research (UXR) Point of View for Explainable Artificial Intelligence (XAI)
Authors:
Mohammad Naiseh,
Huseyin Dogan,
Stephen Giff,
Nan Jiang
Abstract:
Explainable Artificial Intelligence (XAI) plays a critical role in fostering user trust and understanding in AI-driven systems. However, the design of effective XAI interfaces presents significant challenges, particularly for UX professionals who may lack technical expertise in AI or machine learning. Existing explanation methods, such as SHAP, LIME, and counterfactual explanations, often rely on…
▽ More
Explainable Artificial Intelligence (XAI) plays a critical role in fostering user trust and understanding in AI-driven systems. However, the design of effective XAI interfaces presents significant challenges, particularly for UX professionals who may lack technical expertise in AI or machine learning. Existing explanation methods, such as SHAP, LIME, and counterfactual explanations, often rely on complex technical language and assumptions that are difficult for non-expert users to interpret. To address these gaps, we propose a UX Research (UXR) Playbook for XAI - a practical framework aimed at supporting UX professionals in designing accessible, transparent, and trustworthy AI experiences. Our playbook offers actionable guidance to help bridge the gap between technical explainability methods and user centred design, empowering designers to create AI interactions that foster better understanding, trust, and responsible AI adoption.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
Single-Microphone-Based Sound Source Localization for Mobile Robots in Reverberant Environments
Authors:
Jiang Wang,
Runwu Shi,
Benjamin Yen,
He Kong,
Kazuhiro Nakadai
Abstract:
Accurately estimating sound source positions is crucial for robot audition. However, existing sound source localization methods typically rely on a microphone array with at least two spatially preconfigured microphones. This requirement hinders the applicability of microphone-based robot audition systems and technologies. To alleviate these challenges, we propose an online sound source localizatio…
▽ More
Accurately estimating sound source positions is crucial for robot audition. However, existing sound source localization methods typically rely on a microphone array with at least two spatially preconfigured microphones. This requirement hinders the applicability of microphone-based robot audition systems and technologies. To alleviate these challenges, we propose an online sound source localization method that uses a single microphone mounted on a mobile robot in reverberant environments. Specifically, we develop a lightweight neural network model with only 43k parameters to perform real-time distance estimation by extracting temporal information from reverberant signals. The estimated distances are then processed using an extended Kalman filter to achieve online sound source localization. To the best of our knowledge, this is the first work to achieve online sound source localization using a single microphone on a moving robot, a gap that we aim to fill in this work. Extensive experiments demonstrate the effectiveness and merits of our approach. To benefit the broader research community, we have open-sourced our code at https://github.com/JiangWAV/single-mic-SSL.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
Regression Testing Optimization for ROS-based Autonomous Systems: A Comprehensive Review of Techniques
Authors:
Yupeng Jiang,
Shuaiyi Sun,
Xi Zheng
Abstract:
Regression testing plays a critical role in maintaining software reliability, particularly for ROS-based autonomous systems (ROSAS), which frequently undergo continuous integration and iterative development. However, conventional regression testing techniques face significant challenges when applied to autonomous systems due to their dynamic and non-deterministic behaviors, complex multi-modal sen…
▽ More
Regression testing plays a critical role in maintaining software reliability, particularly for ROS-based autonomous systems (ROSAS), which frequently undergo continuous integration and iterative development. However, conventional regression testing techniques face significant challenges when applied to autonomous systems due to their dynamic and non-deterministic behaviors, complex multi-modal sensor data, asynchronous distributed architectures, and stringent safety and real-time constraints. Although numerous studies have explored test optimization in traditional software contexts, regression testing optimization specifically for ROSAS remains largely unexplored. To address this gap, we present the first comprehensive survey systematically reviewing regression testing optimization techniques tailored for ROSAS. We analyze and categorize 122 representative studies into regression test case prioritization, minimization, and selection methods. A structured taxonomy is introduced to clearly illustrate their applicability and limitations within ROSAS contexts. Furthermore, we highlight major challenges specific to regression testing for ROSAS, including effectively prioritizing tests in response to frequent system modifications, efficiently minimizing redundant tests, and difficulty in accurately selecting impacted test cases. Finally, we propose research insights and identify promising future directions, such as leveraging frame-to-vector coverage metrics, multi-source foundation models, and neurosymbolic reasoning to enhance regression testing efficiency and effectiveness. This survey provides a foundational reference and practical roadmap for advancing the state-of-the-art in regression testing optimization for ROSAS.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
DualTHOR: A Dual-Arm Humanoid Simulation Platform for Contingency-Aware Planning
Authors:
Boyu Li,
Siyuan He,
Hang Xu,
Haoqi Yuan,
Yu Zang,
Liwei Hu,
Junpeng Yue,
Zhenxiong Jiang,
Pengbo Hu,
Börje F. Karlsson,
Yehui Tang,
Zongqing Lu
Abstract:
Developing embodied agents capable of performing complex interactive tasks in real-world scenarios remains a fundamental challenge in embodied AI. Although recent advances in simulation platforms have greatly enhanced task diversity to train embodied Vision Language Models (VLMs), most platforms rely on simplified robot morphologies and bypass the stochastic nature of low-level execution, which li…
▽ More
Developing embodied agents capable of performing complex interactive tasks in real-world scenarios remains a fundamental challenge in embodied AI. Although recent advances in simulation platforms have greatly enhanced task diversity to train embodied Vision Language Models (VLMs), most platforms rely on simplified robot morphologies and bypass the stochastic nature of low-level execution, which limits their transferability to real-world robots. To address these issues, we present a physics-based simulation platform DualTHOR for complex dual-arm humanoid robots, built upon an extended version of AI2-THOR. Our simulator includes real-world robot assets, a task suite for dual-arm collaboration, and inverse kinematics solvers for humanoid robots. We also introduce a contingency mechanism that incorporates potential failures through physics-based low-level execution, bridging the gap to real-world scenarios. Our simulator enables a more comprehensive evaluation of the robustness and generalization of VLMs in household environments. Extensive evaluations reveal that current VLMs struggle with dual-arm coordination and exhibit limited robustness in realistic environments with contingencies, highlighting the importance of using our simulator to develop more capable VLMs for embodied tasks. The code is available at https://github.com/ds199895/DualTHOR.git.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization
Authors:
Cong Wang,
Zexuan Deng,
Zhiwei Jiang,
Fei Shen,
Yafeng Yin,
Shiwei Gan,
Zifeng Cheng,
Shiping Ge,
Qing Gu
Abstract:
Sign Language Video Generation (SLVG) seeks to generate identity-preserving sign language videos from spoken language texts. Existing methods primarily rely on the single coarse condition (\eg, skeleton sequences) as the intermediary to bridge the translation model and the video generation model, which limits both the naturalness and expressiveness of the generated videos. To overcome these limita…
▽ More
Sign Language Video Generation (SLVG) seeks to generate identity-preserving sign language videos from spoken language texts. Existing methods primarily rely on the single coarse condition (\eg, skeleton sequences) as the intermediary to bridge the translation model and the video generation model, which limits both the naturalness and expressiveness of the generated videos. To overcome these limitations, we propose SignViP, a novel SLVG framework that incorporates multiple fine-grained conditions for improved generation fidelity. Rather than directly translating error-prone high-dimensional conditions, SignViP adopts a discrete tokenization paradigm to integrate and represent fine-grained conditions (\ie, fine-grained poses and 3D hands). SignViP contains three core components. (1) Sign Video Diffusion Model is jointly trained with a multi-condition encoder to learn continuous embeddings that encapsulate fine-grained motion and appearance. (2) Finite Scalar Quantization (FSQ) Autoencoder is further trained to compress and quantize these embeddings into discrete tokens for compact representation of the conditions. (3) Multi-Condition Token Translator is trained to translate spoken language text to discrete multi-condition tokens. During inference, Multi-Condition Token Translator first translates the spoken language text into discrete multi-condition tokens. These tokens are then decoded to continuous embeddings by FSQ Autoencoder, which are subsequently injected into Sign Video Diffusion Model to guide video generation. Experimental results show that SignViP achieves state-of-the-art performance across metrics, including video quality, temporal coherence, and semantic fidelity. The code is available at https://github.com/umnooob/signvip/.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
TrajDiff: Diffusion Bridge Network with Semantic Alignment for Trajectory Similarity Computation
Authors:
Xiao Zhang,
Xingyu Zhao,
Hong Xia,
Yuan Cao,
Guiyuan Jiang,
Junyu Dong,
Yanwei Yu
Abstract:
With the proliferation of location-tracking technologies, massive volumes of trajectory data are continuously being collected. As a fundamental task in trajectory data mining, trajectory similarity computation plays a critical role in a wide range of real-world applications. However, existing learning-based methods face three challenges: First, they ignore the semantic gap between GPS and grid fea…
▽ More
With the proliferation of location-tracking technologies, massive volumes of trajectory data are continuously being collected. As a fundamental task in trajectory data mining, trajectory similarity computation plays a critical role in a wide range of real-world applications. However, existing learning-based methods face three challenges: First, they ignore the semantic gap between GPS and grid features in trajectories, making it difficult to obtain meaningful trajectory embeddings. Second, the noise inherent in the trajectories, as well as the noise introduced during grid discretization, obscures the true motion patterns of the trajectories. Third, existing methods focus solely on point-wise and pair-wise losses, without utilizing the global ranking information obtained by sorting all trajectories according to their similarity to a given trajectory. To address the aforementioned challenges, we propose a novel trajectory similarity computation framework, named TrajDiff. Specifically, the semantic alignment module relies on cross-attention and an attention score mask mechanism with adaptive fusion, effectively eliminating semantic discrepancies between data at two scales and generating a unified representation. Additionally, the DDBM-based Noise-robust Pre-Training introduces the transfer patterns between any two trajectories into the model training process, enhancing the model's noise robustness. Finally, the overall ranking-aware regularization shifts the model's focus from a local to a global perspective, enabling it to capture the holistic ordering information among trajectories. Extensive experiments on three publicly available datasets show that TrajDiff consistently outperforms state-of-the-art baselines. In particular, it achieves an average HR@1 gain of 33.38% across all three evaluation metrics and datasets.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
OAgents: An Empirical Study of Building Effective Agents
Authors:
He Zhu,
Tianrui Qin,
King Zhu,
Heyuan Huang,
Yeyi Guan,
Jinxiang Xia,
Yi Yao,
Hanhao Li,
Ningning Wang,
Pai Liu,
Tianhao Peng,
Xin Gui,
Xiaowan Li,
Yuhui Liu,
Yuchen Eleanor Jiang,
Jun Wang,
Changwang Zhang,
Xiangru Tang,
Ge Zhang,
Jian Yang,
Minghao Liu,
Xitong Gao,
Jiaheng Liu,
Wangchunshu Zhou
Abstract:
Recently, Agentic AI has become an increasingly popular research field. However, we argue that current agent research practices lack standardization and scientific rigor, making it hard to conduct fair comparisons among methods. As a result, it is still unclear how different design choices in agent frameworks affect effectiveness, and measuring their progress remains challenging. In this work, we…
▽ More
Recently, Agentic AI has become an increasingly popular research field. However, we argue that current agent research practices lack standardization and scientific rigor, making it hard to conduct fair comparisons among methods. As a result, it is still unclear how different design choices in agent frameworks affect effectiveness, and measuring their progress remains challenging. In this work, we conduct a systematic empirical study on GAIA benchmark and BrowseComp to examine the impact of popular design choices in key agent components in a fair and rigorous manner. We find that the lack of a standard evaluation protocol makes previous works, even open-sourced ones, non-reproducible, with significant variance between random runs. Therefore, we introduce a more robust evaluation protocol to stabilize comparisons. Our study reveals which components and designs are crucial for effective agents, while others are redundant, despite seeming logical. Based on our findings, we build and open-source OAgents, a new foundation agent framework that achieves state-of-the-art performance among open-source projects. OAgents offers a modular design for various agent components, promoting future research in Agentic AI.
△ Less
Submitted 23 June, 2025; v1 submitted 17 June, 2025;
originally announced June 2025.
-
MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference
Authors:
Kunxi Li,
Zhonghua Jiang,
Zhouzhou Shen,
Zhaode Wang,
Chengfei Lv,
Shengyu Zhang,
Fan Wu,
Fei Wu
Abstract:
This paper introduces MadaKV, a modality-adaptive key-value (KV) cache eviction strategy designed to enhance the efficiency of multimodal large language models (MLLMs) in long-context inference. In multimodal scenarios, attention heads exhibit varying preferences for different modalities, resulting in significant disparities in modality importance across attention heads. Traditional KV cache evict…
▽ More
This paper introduces MadaKV, a modality-adaptive key-value (KV) cache eviction strategy designed to enhance the efficiency of multimodal large language models (MLLMs) in long-context inference. In multimodal scenarios, attention heads exhibit varying preferences for different modalities, resulting in significant disparities in modality importance across attention heads. Traditional KV cache eviction methods, which are tailored for unimodal settings, fail to capture modality-specific information, thereby yielding suboptimal performance. MadaKV addresses these challenges through two key components: modality preference adaptation and hierarchical compression compensation. By dynamically sensing modality information within attention heads and adaptively retaining critical tokens, MadaKV achieves substantial reductions in KV cache memory footprint and model inference decoding latency (1.3 to 1.5 times improvement) while maintaining high accuracy across various multimodal long-context tasks. Extensive experiments on representative MLLMs and the MileBench benchmark demonstrate the effectiveness of MadaKV compared to existing KV cache eviction methods.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
BatteryBERT for Realistic Battery Fault Detection Using Point-Masked Signal Modeling
Authors:
Songqi Zhou,
Ruixue Liu,
Yixing Wang,
Jia Lu,
Benben Jiang
Abstract:
Accurate fault detection in lithium-ion batteries is essential for the safe and reliable operation of electric vehicles and energy storage systems. However, existing methods often struggle to capture complex temporal dependencies and cannot fully leverage abundant unlabeled data. Although large language models (LLMs) exhibit strong representation capabilities, their architectures are not directly…
▽ More
Accurate fault detection in lithium-ion batteries is essential for the safe and reliable operation of electric vehicles and energy storage systems. However, existing methods often struggle to capture complex temporal dependencies and cannot fully leverage abundant unlabeled data. Although large language models (LLMs) exhibit strong representation capabilities, their architectures are not directly suited to the numerical time-series data common in industrial settings. To address these challenges, we propose a novel framework that adapts BERT-style pretraining for battery fault detection by extending the standard BERT architecture with a customized time-series-to-token representation module and a point-level Masked Signal Modeling (point-MSM) pretraining task tailored to battery applications. This approach enables self-supervised learning on sequential current, voltage, and other charge-discharge cycle data, yielding distributionally robust, context-aware temporal embeddings. We then concatenate these embeddings with battery metadata and feed them into a downstream classifier for accurate fault classification. Experimental results on a large-scale real-world dataset show that models initialized with our pretrained parameters significantly improve both representation quality and classification accuracy, achieving an AUROC of 0.945 and substantially outperforming existing approaches. These findings validate the effectiveness of BERT-style pretraining for time-series fault detection.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.