Search | arXiv e-print repository

NavigScene: Bridging Local Perception and Global Navigation for Beyond-Visual-Range Autonomous Driving

Authors: Qucheng Peng, Chen Bai, Guoxiang Zhang, Bo Xu, Xiaotong Liu, Xiaoyin Zheng, Chen Chen, Cheng Lu

Abstract: Autonomous driving systems have made significant advances in Q&A, perception, prediction, and planning based on local visual information, yet they struggle to incorporate broader navigational context that human drivers routinely utilize. We address this critical gap between local sensor data and global navigation information by proposing NavigScene, an auxiliary navigation-guided natural language… ▽ More Autonomous driving systems have made significant advances in Q&A, perception, prediction, and planning based on local visual information, yet they struggle to incorporate broader navigational context that human drivers routinely utilize. We address this critical gap between local sensor data and global navigation information by proposing NavigScene, an auxiliary navigation-guided natural language dataset that simulates a human-like driving environment within autonomous driving systems. Moreover, we develop three complementary paradigms to leverage NavigScene: (1) Navigation-guided Reasoning, which enhances vision-language models by incorporating navigation context into the prompting approach; (2) Navigation-guided Preference Optimization, a reinforcement learning method that extends Direct Preference Optimization to improve vision-language model responses by establishing preferences for navigation-relevant summarized information; and (3) Navigation-guided Vision-Language-Action model, which integrates navigation guidance and vision-language models with conventional driving models through feature fusion. Extensive experiments demonstrate that our approaches significantly improve performance across perception, prediction, planning, and question-answering tasks by enabling reasoning capabilities beyond visual range and improving generalization to diverse driving scenarios. This work represents a significant step toward more comprehensive autonomous driving systems capable of navigating complex, unfamiliar environments with greater reliability and safety. △ Less

Submitted 7 July, 2025; originally announced July 2025.

Comments: Accepted by ACM Multimedia 2025

arXiv:2507.05113 [pdf, ps, other]

CLIP-Guided Backdoor Defense through Entropy-Based Poisoned Dataset Separation

Authors: Binyan Xu, Fan Yang, Xilin Dai, Di Tang, Kehuan Zhang

Abstract: Deep Neural Networks (DNNs) are susceptible to backdoor attacks, where adversaries poison training data to implant backdoor into the victim model. Current backdoor defenses on poisoned data often suffer from high computational costs or low effectiveness against advanced attacks like clean-label and clean-image backdoors. To address them, we introduce CLIP-Guided backdoor Defense (CGD), an efficien… ▽ More Deep Neural Networks (DNNs) are susceptible to backdoor attacks, where adversaries poison training data to implant backdoor into the victim model. Current backdoor defenses on poisoned data often suffer from high computational costs or low effectiveness against advanced attacks like clean-label and clean-image backdoors. To address them, we introduce CLIP-Guided backdoor Defense (CGD), an efficient and effective method that mitigates various backdoor attacks. CGD utilizes a publicly accessible CLIP model to identify inputs that are likely to be clean or poisoned. It then retrains the model with these inputs, using CLIP's logits as a guidance to effectively neutralize the backdoor. Experiments on 4 datasets and 11 attack types demonstrate that CGD reduces attack success rates (ASRs) to below 1% while maintaining clean accuracy (CA) with a maximum drop of only 0.3%, outperforming existing defenses. Additionally, we show that clean-data-based defenses can be adapted to poisoned data using CGD. Also, CGD exhibits strong robustness, maintaining low ASRs even when employing a weaker CLIP model or when CLIP itself is compromised by a backdoor. These findings underscore CGD's exceptional efficiency, effectiveness, and applicability for real-world backdoor defense scenarios. Code: https://github.com/binyxu/CGD. △ Less

Submitted 7 July, 2025; originally announced July 2025.

Comments: 15 pages, 9 figures, 15 tables. To appear in the Proceedings of the 32nd ACM International Conference on Multimedia (MM '25)

MSC Class: 68T07 ACM Class: I.2.6

arXiv:2507.04574 [pdf, ps, other]

Deciphering the interplay between wetting and chemo-mechanical fracture in lithium-ion battery cathode materials

Authors: Wan-Xin Chen, Luis J. Carrillo, Arnab Maji, Xiang-Long Peng, Joseph Handy, Sarbajit Banerjee, Bai-Xiang Xu

Abstract: Crack growth in lithium-ion battery electrodes is typically detrimental and undesirable. However, recent experiments suggest that stabilized fracture of cathode active materials in liquid electrolytes can increase electrochemically active surfaces, shorten diffusion pathway, enhance (de)lithiation and improve overall capacity. To decipher the fundamental couplings between electrolyte wetting and f… ▽ More Crack growth in lithium-ion battery electrodes is typically detrimental and undesirable. However, recent experiments suggest that stabilized fracture of cathode active materials in liquid electrolytes can increase electrochemically active surfaces, shorten diffusion pathway, enhance (de)lithiation and improve overall capacity. To decipher the fundamental couplings between electrolyte wetting and fracture evolution and evaluate their influences on macroscopic battery performance, we conducted an integrated experiment-simulation study on $α$-V2O5 single crystals and polycrystalline NCM as model cathode materials. Despite synthesis challenges, single-crystal $α$-V2O5 offers clearer fundamental insights than polycrystalline counterparts with grain-boundary complexities. Fracture patterns and lithiation heterogeneities on the samples were mapped using advanced scanning techniques after chemical (de)lithiation cycles, exhibiting excellent agreements with simulations by the developed multiphysics model. Results reveal a mutually reinforcing interplay between wetting and fracture: (i) electrolyte infiltration at fracture surfaces enhances (de)lithiation and compositional heterogeneity; (ii) wetting influences fracture dynamics, including fracture modes, propagation distance and directionality. The validated modelling framework is further applied to simulations on polycrystalline NCM particles under constant-current (dis)charging, highlighting the critical role of wetting in promoting fracture and improving overall capacity. This work bridges fundamental understanding of wetting-fracture coupling with practical implications for battery performance optimization via controlled fracture engineering. △ Less

Submitted 6 July, 2025; originally announced July 2025.

arXiv:2507.04487 [pdf, ps, other]

LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization

Authors: Xujia Wang. Yunjia Qi, Bin Xu

Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA, significantly reduce the number of trainable parameters by introducing low-rank decomposition matrices. However, existing methods perform extensive matrix multiplications in domain specialization tasks, resulting in computational inefficiency and sub-optimal fine-tuning performance. Hence, we propose LoSiA(Low-Resources Subnet Integrati… ▽ More Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA, significantly reduce the number of trainable parameters by introducing low-rank decomposition matrices. However, existing methods perform extensive matrix multiplications in domain specialization tasks, resulting in computational inefficiency and sub-optimal fine-tuning performance. Hence, we propose LoSiA(Low-Resources Subnet Integration Adaptation), an innovative method that dynamically localizes and optimizes critical parameters during the training process. Specifically, it identifies a sub-network using gradient sparsity analysis and optimizes it as the trainable target. This design enables effective high-rank adaptation by updating only the sub-network parameters, reducing the additional matrix multiplication. We also present LoSiA-Pro, a faster implementation of LoSiA, which reduces the training latency by about $27\%$ compared to LoRA. Extensive evaluations show that our method achieves minimal performance drop compared to full fine-tuning, while requiring the least training time across domain specialization and common-sense reasoning tasks. Further analysis shows that LoSiA also reduces forgetting during continued training. △ Less

Submitted 6 July, 2025; originally announced July 2025.

Comments: 18 pages, 12 figures

arXiv:2507.03917 [pdf, ps, other]

Consistency-Aware Padding for Incomplete Multi-Modal Alignment Clustering Based on Self-Repellent Greedy Anchor Search

Authors: Shubin Ma, Liang Zhao, Mingdong Lu, Yifan Guo, Bo Xu

Abstract: Multimodal representation is faithful and highly effective in describing real-world data samples' characteristics by describing their complementary information. However, the collected data often exhibits incomplete and misaligned characteristics due to factors such as inconsistent sensor frequencies and device malfunctions. Existing research has not effectively addressed the issue of filling missi… ▽ More Multimodal representation is faithful and highly effective in describing real-world data samples' characteristics by describing their complementary information. However, the collected data often exhibits incomplete and misaligned characteristics due to factors such as inconsistent sensor frequencies and device malfunctions. Existing research has not effectively addressed the issue of filling missing data in scenarios where multiview data are both imbalanced and misaligned. Instead, it relies on class-level alignment of the available data. Thus, it results in some data samples not being well-matched, thereby affecting the quality of data fusion. In this paper, we propose the Consistency-Aware Padding for Incomplete Multimodal Alignment Clustering Based on Self-Repellent Greedy Anchor Search(CAPIMAC) to tackle the problem of filling imbalanced and misaligned data in multimodal datasets. Specifically, we propose a self-repellent greedy anchor search module(SRGASM), which employs a self-repellent random walk combined with a greedy algorithm to identify anchor points for re-representing incomplete and misaligned multimodal data. Subsequently, based on noise-contrastive learning, we design a consistency-aware padding module (CAPM) to effectively interpolate and align imbalanced and misaligned data, thereby improving the quality of multimodal data fusion. Experimental results demonstrate the superiority of our method over benchmark datasets. The code will be publicly released at https://github.com/Autism-mm/CAPIMAC.git. △ Less

Submitted 5 July, 2025; originally announced July 2025.

Comments: Accepted at IJCAI 2025. 9 pages, 3 figures

ACM Class: I.2.6; I.5.3

arXiv:2507.03133 [pdf, ps, other]

ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models

Authors: Boyang Xue, Qi Zhu, Rui Wang, Sheng Wang, Hongru Wang, Fei Mi, Yasheng Wang, Lifeng Shang, Qun Liu, Kam-Fai Wong

Abstract: Although demonstrating remarkable performance on reasoning tasks, Large Language Models (LLMs) still tend to fabricate unreliable responses when confronted with problems that are unsolvable or beyond their capability, severely undermining the reliability. Prior studies of LLM reliability have primarily focused on knowledge tasks to identify unanswerable questions, while mathematical reasoning task… ▽ More Although demonstrating remarkable performance on reasoning tasks, Large Language Models (LLMs) still tend to fabricate unreliable responses when confronted with problems that are unsolvable or beyond their capability, severely undermining the reliability. Prior studies of LLM reliability have primarily focused on knowledge tasks to identify unanswerable questions, while mathematical reasoning tasks have remained unexplored due to the dearth of unsolvable math problems. To systematically investigate LLM reliability in mathematical reasoning tasks, we formulate the reliability evaluation for both solvable and unsolvable problems. We then develop a ReliableMath dataset which incorporates open-source solvable problems and high-quality unsolvable problems synthesized by our proposed construction workflow with human evaluations. Experiments are conducted on various LLMs with several key findings uncovered. LLMs fail to directly identify unsolvable problems and always generate fabricated responses. When instructing LLMs to indicate unsolvability using a reliable prompt, the reliability of larger-sized LLMs remains on solvable problems, but notably improves on unsolvable problems yet still falls short of solvable problems. However, small LLMs rarely show any progress despite employing reliable prompts. Therefore, we further propose an alignment strategy to enhance small LLMs' reliability, which can significantly improve LLM reliability performances on both in-domain and out-of-domain tasks. △ Less

Submitted 3 July, 2025; originally announced July 2025.

Comments: under review

arXiv:2507.03122 [pdf, ps, other]

Federated Learning for ICD Classification with Lightweight Models and Pretrained Embeddings

Authors: Binbin Xu, Gérard Dray

Abstract: This study investigates the feasibility and performance of federated learning (FL) for multi-label ICD code classification using clinical notes from the MIMIC-IV dataset. Unlike previous approaches that rely on centralized training or fine-tuned large language models, we propose a lightweight and scalable pipeline combining frozen text embeddings with simple multilayer perceptron (MLP) classifiers… ▽ More This study investigates the feasibility and performance of federated learning (FL) for multi-label ICD code classification using clinical notes from the MIMIC-IV dataset. Unlike previous approaches that rely on centralized training or fine-tuned large language models, we propose a lightweight and scalable pipeline combining frozen text embeddings with simple multilayer perceptron (MLP) classifiers. This design offers a privacy-preserving and deployment-efficient alternative for clinical NLP applications, particularly suited to distributed healthcare settings. Extensive experiments across both centralized and federated configurations were conducted, testing six publicly available embedding models from Massive Text Embedding Benchmark leaderboard and three MLP classifier architectures under two medical coding (ICD-9 and ICD-10). Additionally, ablation studies over ten random stratified splits assess performance stability. Results show that embedding quality substantially outweighs classifier complexity in determining predictive performance, and that federated learning can closely match centralized results in idealized conditions. While the models are orders of magnitude smaller than state-of-the-art architectures and achieved competitive micro and macro F1 scores, limitations remain including the lack of end-to-end training and the simplified FL assumptions. Nevertheless, this work demonstrates a viable way toward scalable, privacy-conscious medical coding systems and offers a step toward for future research into federated, domain-adaptive clinical AI. △ Less

Submitted 3 July, 2025; originally announced July 2025.

Comments: 20 pages

arXiv:2507.02636 [pdf, ps, other]

Online Convex Optimization for Coordinated Long-Term and Short-Term Isolated Microgrid Dispatch

Authors: Ning Qi, Yousuf Baker, Bolun Xu

Abstract: This paper proposes a novel non-anticipatory long-short-term coordinated dispatch framework for isolated microgrid with hybrid short-long-duration energy storages (LDES). We introduce a convex hull approximation model for nonconvex LDES electrochemical dynamics, facilitating computational tractability and accuracy. To address temporal coupling in SoC dynamics and long-term contracts, we generate h… ▽ More This paper proposes a novel non-anticipatory long-short-term coordinated dispatch framework for isolated microgrid with hybrid short-long-duration energy storages (LDES). We introduce a convex hull approximation model for nonconvex LDES electrochemical dynamics, facilitating computational tractability and accuracy. To address temporal coupling in SoC dynamics and long-term contracts, we generate hindsight-optimal state-of-charge (SoC) trajectories of LDES and netloads for offline training. In the online stage, we employ kernel regression to dynamically update the SoC reference and propose an adaptive online convex optimization (OCO) algorithm with SoC reference tracking and expert tracking to mitigate myopia and enable adaptive step-size optimization. We rigorously prove that both long-term and short-term policies achieve sublinear regret bounds over time, which improves with more regression scenarios, stronger tracking penalties, and finer convex approximations. Simulation results show that the proposed method outperforms state-of-the-art methods, reducing costs by 73.4%, eliminating load loss via reference tracking, and achieving an additional 2.4% cost saving via the OCO algorithm. These benefits scale up with longer LDES durations, and the method demonstrates resilience to poor forecasts and unexpected system faults. △ Less

Submitted 4 July, 2025; v1 submitted 3 July, 2025; originally announced July 2025.

arXiv:2507.01889 [pdf, ps, other]

STEM Diffraction Pattern Analysis with Deep Learning Networks

Authors: Sebastian Wissel, Jonas Scheunert, Aaron Dextre, Shamail Ahmed, Andreas Bayer, Kerstin Volz, Bai-Xiang Xu

Abstract: Accurate grain orientation mapping is essential for understanding and optimizing the performance of polycrystalline materials, particularly in energy-related applications. Lithium nickel oxide (LiNiO$_{2}$) is a promising cathode material for next-generation lithium-ion batteries, and its electrochemical behaviour is closely linked to microstructural features such as grain size and crystallographi… ▽ More Accurate grain orientation mapping is essential for understanding and optimizing the performance of polycrystalline materials, particularly in energy-related applications. Lithium nickel oxide (LiNiO$_{2}$) is a promising cathode material for next-generation lithium-ion batteries, and its electrochemical behaviour is closely linked to microstructural features such as grain size and crystallographic orientations. Traditional orientation mapping methods--such as manual indexing, template matching (TM), or Hough transform-based techniques--are often slow and noise-sensitive when handling complex or overlapping patterns, creating a bottleneck in large-scale microstructural analysis. This work presents a machine learning-based approach for predicting Euler angles directly from scanning transmission electron microscopy (STEM) diffraction patterns (DPs). This enables the automated generation of high-resolution crystal orientation maps, facilitating the analysis of internal microstructures at the nanoscale. Three deep learning architectures--convolutional neural networks (CNNs), Dense Convolutional Networks (DenseNets), and Shifted Windows (Swin) Transformers--are evaluated, using an experimentally acquired dataset labelled via a commercial TM algorithm. While the CNN model serves as a baseline, both DenseNets and Swin Transformers demonstrate superior performance, with the Swin Transformer achieving the highest evaluation scores and the most consistent microstructural predictions. The resulting crystal maps exhibit clear grain boundary delineation and coherent intra-grain orientation distributions, underscoring the potential of attention-based architectures for analyzing diffraction-based image data. These findings highlight the promise of combining advanced machine learning models with STEM data for robust, high-throughput microstructural characterization. △ Less

Submitted 2 July, 2025; originally announced July 2025.

arXiv:2507.01299 [pdf, ps, other]

La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation

Authors: Kai Liu, Bowen Xu, Shaoyu Wu, Xin Chen, Hao Zhou, Yongliang Tao, Lulu Hu

Abstract: Activation sparsity can reduce the computational overhead and memory transfers during the forward pass of Large Language Model (LLM) inference. Existing methods face limitations, either demanding time-consuming recovery training that hinders real-world adoption, or relying on empirical magnitude-based pruning, which causes fluctuating sparsity and unstable inference speed-up. This paper introduces… ▽ More Activation sparsity can reduce the computational overhead and memory transfers during the forward pass of Large Language Model (LLM) inference. Existing methods face limitations, either demanding time-consuming recovery training that hinders real-world adoption, or relying on empirical magnitude-based pruning, which causes fluctuating sparsity and unstable inference speed-up. This paper introduces LaRoSA (Layerwise Rotated Sparse Activation), a novel method for activation sparsification designed to improve LLM efficiency without requiring additional training or magnitude-based pruning. We leverage layerwise orthogonal rotations to transform input activations into rotated forms that are more suitable for sparsification. By employing a Top-K selection approach within the rotated activations, we achieve consistent model-level sparsity and reliable wall-clock time speed-up. LaRoSA is effective across various sizes and types of LLMs, demonstrating minimal performance degradation and robust inference acceleration. Specifically, for LLaMA2-7B at 40% sparsity, LaRoSA achieves a mere 0.17 perplexity gap with a consistent 1.30x wall-clock time speed-up, and reduces the accuracy gap in zero-shot tasks compared to the dense model to just 0.54%, while surpassing TEAL by 1.77% and CATS by 17.14%. △ Less

Submitted 1 July, 2025; originally announced July 2025.

Comments: ICML 2025 Acceptance

arXiv:2507.01006 [pdf, ps, other]

GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Authors: GLM-V Team, :, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Boyan Shi, Changyu Pang, Chenhui Zhang , et al. (54 additional authors not shown)

Abstract: We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the fi… ▽ More We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document understanding. We open-source GLM-4.1V-9B-Thinking, which achieves state-of-the-art performance among models of comparable size. In a comprehensive evaluation across 28 public benchmarks, our model outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks relative to the significantly larger Qwen2.5-VL-72B. Notably, GLM-4.1V-9B-Thinking also demonstrates competitive or superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document understanding and STEM reasoning, further underscoring its strong capabilities. Code, models and more information are released at https://github.com/THUDM/GLM-4.1V-Thinking. △ Less

Submitted 2 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

arXiv:2506.23972 [pdf, ps, other]

Visual and Memory Dual Adapter for Multi-Modal Object Tracking

Authors: Boyue Xu, Ruichao Hou, Tongwei Ren, Gangshan Wu

Abstract: Prompt-learning-based multi-modal trackers have achieved promising progress by employing lightweight visual adapters to incorporate auxiliary modality features into frozen foundation models. However, existing approaches often struggle to learn reliable prompts due to limited exploitation of critical cues across frequency and temporal domains. In this paper, we propose a novel visual and memory dua… ▽ More Prompt-learning-based multi-modal trackers have achieved promising progress by employing lightweight visual adapters to incorporate auxiliary modality features into frozen foundation models. However, existing approaches often struggle to learn reliable prompts due to limited exploitation of critical cues across frequency and temporal domains. In this paper, we propose a novel visual and memory dual adapter (VMDA) to construct more robust and discriminative representations for multi-modal tracking. Specifically, we develop a simple but effective visual adapter that adaptively transfers discriminative cues from auxiliary modality to dominant modality by jointly modeling the frequency, spatial, and channel-wise features. Additionally, we design the memory adapter inspired by the human memory mechanism, which stores global temporal cues and performs dynamic update and retrieval operations to ensure the consistent propagation of reliable temporal information across video sequences. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the various multi-modal tracking tasks, including RGB-Thermal, RGB-Depth, and RGB-Event tracking. Code and models are available at https://github.com/xuboyue1999/mmtrack.git. △ Less

Submitted 30 June, 2025; originally announced June 2025.

arXiv:2506.22776 [pdf, ps, other]

Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation

Authors: Sen Fang, Weiyuan Ding, Antonio Mastropaolo, Bowen Xu

Abstract: Quantization has emerged as a mainstream method for compressing Large Language Models (LLMs), reducing memory requirements and accelerating inference without architectural modifications. While existing research primarily focuses on evaluating the effectiveness of quantized LLMs compared to their original counterparts, the impact on robustness remains largely unexplored.In this paper, we present th… ▽ More Quantization has emerged as a mainstream method for compressing Large Language Models (LLMs), reducing memory requirements and accelerating inference without architectural modifications. While existing research primarily focuses on evaluating the effectiveness of quantized LLMs compared to their original counterparts, the impact on robustness remains largely unexplored.In this paper, we present the first systematic investigation of how quantization affects the robustness of LLMs in code generation tasks. Through extensive experiments across four prominent LLM families (LLaMA, DeepSeek, CodeGen, and StarCoder) with parameter scales ranging from 350M to 33B, we evaluate robustness from dual perspectives: adversarial attacks on input prompts and noise perturbations on model architecture. Our findings challenge conventional wisdom by demonstrating that quantized LLMs often exhibit superior robustness compared to their full-precision counterparts, with 51.59% versus 42.86% of our adversarial experiments showing better resilience in quantized LLMs. Similarly, our noise perturbation experiments also confirm that LLMs after quantitation generally withstand higher levels of weight disturbances. These results suggest that quantization not only reduces computational requirements but can actually enhance LLMs' reliability in code generation tasks, providing valuable insights for developing more robust and efficient LLM deployment strategies. △ Less

Submitted 28 June, 2025; originally announced June 2025.

Comments: 13 pages, 6 figures

arXiv:2506.21270 [pdf, ps, other]

Video Virtual Try-on with Conditional Diffusion Transformer Inpainter

Authors: Cheng Zou, Senlin Cheng, Bolei Xu, Dandan Zheng, Xiaobo Li, Jingdong Chen, Ming Yang

Abstract: Video virtual try-on aims to naturally fit a garment to a target person in consecutive video frames. It is a challenging task, on the one hand, the output video should be in good spatial-temporal consistency, on the other hand, the details of the given garment need to be preserved well in all the frames. Naively using image-based try-on methods frame by frame can get poor results due to severe inc… ▽ More Video virtual try-on aims to naturally fit a garment to a target person in consecutive video frames. It is a challenging task, on the one hand, the output video should be in good spatial-temporal consistency, on the other hand, the details of the given garment need to be preserved well in all the frames. Naively using image-based try-on methods frame by frame can get poor results due to severe inconsistency. Recent diffusion-based video try-on methods, though very few, happen to coincide with a similar solution: inserting temporal attention into image-based try-on model to adapt it for video try-on task, which have shown improvements but there still exist inconsistency problems. In this paper, we propose ViTI (Video Try-on Inpainter), formulate and implement video virtual try-on as a conditional video inpainting task, which is different from previous methods. In this way, we start with a video generation problem instead of an image-based try-on problem, which from the beginning has a better spatial-temporal consistency. Specifically, at first we build a video inpainting framework based on Diffusion Transformer with full 3D spatial-temporal attention, and then we progressively adapt it for video garment inpainting, with a collection of masking strategies and multi-stage training. After these steps, the model can inpaint the masked garment area with appropriate garment pixels according to the prompt with good spatial-temporal consistency. Finally, as other try-on methods, garment condition is added to the model to make sure the inpainted garment appearance and details are as expected. Both quantitative and qualitative experimental results show that ViTI is superior to previous works. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: 10 pages, 6 figures

arXiv:2506.20444 [pdf, ps, other]

Smart Cuts: Enhance Active Learning for Vulnerability Detection by Pruning Bad Seeds

Authors: Xiang Lan, Tim Menzies, Bowen Xu

Abstract: Vulnerability detection is crucial for identifying security weaknesses in software systems. However, the effectiveness of machine learning models in this domain is often hindered by low-quality training datasets, which contain noisy, mislabeled, or imbalanced samples. This paper proposes a novel dataset maps-empowered approach that systematically identifies and mitigates hard-to-learn outliers, re… ▽ More Vulnerability detection is crucial for identifying security weaknesses in software systems. However, the effectiveness of machine learning models in this domain is often hindered by low-quality training datasets, which contain noisy, mislabeled, or imbalanced samples. This paper proposes a novel dataset maps-empowered approach that systematically identifies and mitigates hard-to-learn outliers, referred to as "bad seeds", to improve model training efficiency. Our approach can categorize training examples based on learning difficulty and integrate this information into an active learning framework. Unlike traditional methods that focus on uncertainty-based sampling, our strategy prioritizes dataset quality by filtering out performance-harmful samples while emphasizing informative ones. Our experimental results show that our approach can improve F1 score over random selection by 45.36% (DeepGini) and 45.91% (K-Means) and outperforms standard active learning by 61.46% (DeepGini) and 32.65% (K-Means) for CodeBERT on the Big-Vul dataset, demonstrating the effectiveness of integrating dataset maps for optimizing sample selection in vulnerability detection. Furthermore, our approach also enhances model robustness, improves sample selection by filtering bad seeds, and stabilizes active learning performance across iterations. By analyzing the characteristics of these outliers, we provide insights for future improvements in dataset construction, making vulnerability detection more reliable and cost-effective. △ Less

Submitted 25 June, 2025; originally announced June 2025.

arXiv:2506.19580 [pdf, ps, other]

The optimal binding function for (cap, even hole)-free graphs

Authors: Ran Chen, Baogang Xu, Yian Xu

Abstract: A {\em hole} is an induced cycle of length at least 4, an {\em even hole} is a hole of even length, and a {\em cap} is a graph obtained from a hole by adding an additional vertex which is adjacent exactly to two adjacent vertices of the hole. A graph $G$ obtained from a graph $H$ by blowing up all the vertices into cliques is said to be a clique blowup of $H$. Let $p, q$ be two positive integers w… ▽ More A {\em hole} is an induced cycle of length at least 4, an {\em even hole} is a hole of even length, and a {\em cap} is a graph obtained from a hole by adding an additional vertex which is adjacent exactly to two adjacent vertices of the hole. A graph $G$ obtained from a graph $H$ by blowing up all the vertices into cliques is said to be a clique blowup of $H$. Let $p, q$ be two positive integers with $p>2q$, let $F$ be a triangle-free graph, and let $G'$ be a clique blowup of $F$ with $ω(G')\leq\max\{\frac{2q(p-q-2)}{p-2q}, 2q\}$. In this paper, we prove that for any clique blowup $G$ of $F$, $χ(G)\leq\lceil\frac{p}{2q}ω(G)\rceil$ if and only if $χ(G')\leq\lceil\frac{p}{2q}ω(G')\rceil$. As its consequences, we show that every (cap, even hole)-free graph $G$ satisfies $χ(G)\leq\lceil\frac{5}{4}ω(G)\rceil$, which affirmatively answers a question of Cameron {\em et al.} \cite{CdHV2018}, we also show that every (cap, even hole, 5-hole)-free graph $G$ satisfies $χ(G)\leq\lceil\frac{7}{6}ω(G)\rceil$, and the bound is reachable. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.18416 [pdf]

Determining the grain orientations of battery materials from electron diffraction patterns using convolutional neural networks

Authors: Jonas Scheunert, Shamail Ahmed, Thomas Demuth, Andreas Beyer, Sebastian Wissel, Bai-Xiang Xu, Kerstin Volz

Abstract: Polycrystalline materials have numerous applications due to their unique properties, which are often determined by the grain boundaries. Hence, quantitative characterization of grain as well as interface orientation is essential to optimize these materials, particularly energy materials. Using scanning transmission electron microscopy, matter can be analysed in an extremely fine grid of scan point… ▽ More Polycrystalline materials have numerous applications due to their unique properties, which are often determined by the grain boundaries. Hence, quantitative characterization of grain as well as interface orientation is essential to optimize these materials, particularly energy materials. Using scanning transmission electron microscopy, matter can be analysed in an extremely fine grid of scan points via electron diffraction patterns at each scan point. By matching the diffraction patterns to a simulated database, the crystal orientation of the material as well as the orientation of the grain boundaries at each scan point can be determined. This pattern matching approach is highly time intensive. Artificial intelligence promises to be a very powerful tool for pattern recognition. In this work, we train convolutional neural networks (CNNs) on dynamically simulated diffraction patterns of LiNiO2, an important cathode-active material for Lithium-ion batteries, to predict the orientation of grains in terms of three Euler angles for the complete fundamental orientation region. Results demonstrate that these networks outperform the conventional pattern matching algorithm with increased accuracy and efficiency. The increased accuracy of the CNN models can be attributed to the fact that these models are trained by data incorporating dynamical effects. This work is the first attempt to apply deep learning for analysis of electron diffraction data and enlightens the great potential of ML to accelerate the analysis of electron microscopy data, toward high-throughput characterization technique. △ Less

Submitted 23 June, 2025; originally announced June 2025.

arXiv:2506.18350 [pdf, ps, other]

Optical Excitations of Flat Bands Induced by Exciton Condensation in Ta$_2$Pd$_3$Te$_{5}$

Authors: Shaohui Yi, Zhiyu Liao, Chenhao Liang, Sheng Zhang, Xiutong Deng, Yongjie Xie, Lincong Zheng, Yujie Wang, Yubiao Wu, Zhijun Wang, Youguo Shi, Xianggang Qiu, Bing Xu

Abstract: We report on the charge dynamics of Ta$_2$Pd$_3$Te$_5$ using temperature-dependent optical spectroscopy with polarized light. We observe a metal-insulator transition characterized by the collapse of Drude response and the emergence of sharp and narrow absorption peaks at low temperatures. Unlike previous excitonic insulator candidates such as TiSe$_2$ and Ta$_2$NiSe$_5$, where the excitonic order… ▽ More We report on the charge dynamics of Ta$_2$Pd$_3$Te$_5$ using temperature-dependent optical spectroscopy with polarized light. We observe a metal-insulator transition characterized by the collapse of Drude response and the emergence of sharp and narrow absorption peaks at low temperatures. Unlike previous excitonic insulator candidates such as TiSe$_2$ and Ta$_2$NiSe$_5$, where the excitonic order is intertwined with charge density wave or structural instabilities, the sharp features in Ta$_2$Pd$_3$Te$_5$ point to intrinsic excitonic excitations associated with ultra-flat bands driven by many-body renormalization of the band structure via spontaneous exciton condensation. Our findings thus provide clear-cut optical evidence for exciton condensation in a bulk crystal and establish Ta$_2$Pd$_3$Te$_5$ as a promising platform for exploring correlated quantum phases and novel excitonic phenomena. △ Less

Submitted 23 June, 2025; originally announced June 2025.

Comments: 7 pages, 3 figures

arXiv:2506.17495 [pdf, ps, other]

Modeling and Inferring Metacommunity Dynamics with Maximum Caliber

Authors: Zachary Jackson, Mathew A. Leibold, Robert D. Holt, BingKan Xue

Abstract: A major challenge for community ecology is to use distribution patterns to infer basic parameters of dynamical models without conducting laborious experimental manipulations. We present a novel framework drawn from statistical physics -- Maximum Caliber -- for characterizing the temporal dynamics of complex ecological systems in spatially extended landscapes and inferring parameters from temporal… ▽ More A major challenge for community ecology is to use distribution patterns to infer basic parameters of dynamical models without conducting laborious experimental manipulations. We present a novel framework drawn from statistical physics -- Maximum Caliber -- for characterizing the temporal dynamics of complex ecological systems in spatially extended landscapes and inferring parameters from temporal data. As an extension of Maximum Entropy modeling, Maximum Caliber models the probability of possible trajectories of a stochastic system, rather than focusing on system states. We demonstrate the ability of the Maximum Caliber framework to capture ecological processes ranging from near- to far- from-equilibrium, using an array of species interaction motifs including random interactions, apparent competition, intraguild competition, and non-transitive competition, along with dispersal among multiple patches. For spatio-temporal data of species occurrence in a metacommunity, the parameters of a Maximum Caliber model can be estimated through a simple logistic regression to reveal migration rates between patches, magnitudes of interactions between species, and effects of intrinsic local environmental suitabilities. We test the accuracy of the method over a range of system sizes and time periods, and find that these parameters can be estimated without bias. We introduce entropy production as a system-level measure of disequilibrium, and use ``pseudo-$R^2$'' to characterize the predictability of the system. We show that our model can predict the dynamics of metacommunities much better than steady state models, when the system is far from equilibrium. The capacity to estimate basic parameters of dynamical metacommunity models from spatio-temporal data represents an important breakthrough for the study of metacommunities with application to practical problems in conservation and restoration ecology. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.16445 [pdf, ps, other]

StoryWriter: A Multi-Agent Framework for Long Story Generation

Authors: Haotian Xia, Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li

Abstract: Long story generation remains a challenge for existing large language models (LLMs), primarily due to two main factors: (1) discourse coherence, which requires plot consistency, logical coherence, and completeness in the long-form generation, and (2) narrative complexity, which requires an interwoven and engaging narrative. To address these challenges, we propose StoryWriter, a multi-agent story g… ▽ More Long story generation remains a challenge for existing large language models (LLMs), primarily due to two main factors: (1) discourse coherence, which requires plot consistency, logical coherence, and completeness in the long-form generation, and (2) narrative complexity, which requires an interwoven and engaging narrative. To address these challenges, we propose StoryWriter, a multi-agent story generation framework, which consists of three main modules: (1) outline agent, which generates event-based outlines containing rich event plots, character, and event-event relationships. (2) planning agent, which further details events and plans which events should be written in each chapter to maintain an interwoven and engaging story. (3) writing agent, which dynamically compresses the story history based on the current event to generate and reflect new plots, ensuring the coherence of the generated story. We conduct both human and automated evaluation, and StoryWriter significantly outperforms existing story generation baselines in both story quality and length. Furthermore, we use StoryWriter to generate a dataset, which contains about $6,000$ high-quality long stories, with an average length of $8,000$ words. We train the model Llama3.1-8B and GLM4-9B using supervised fine-tuning on LongStory and develop StoryWriter_GLM and StoryWriter_GLM, which demonstrates advanced performance in long story generation. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16051 [pdf, ps, other]

From Data to Decision: Data-Centric Infrastructure for Reproducible ML in Collaborative eScience

Authors: Zhiwei Li, Carl Kesselman, Tran Huy Nguyen, Benjamin Yixing Xu, Kyle Bolo, Kimberley Yu

Abstract: Reproducibility remains a central challenge in machine learning (ML), especially in collaborative eScience projects where teams iterate over data, features, and models. Current ML workflows are often dynamic yet fragmented, relying on informal data sharing, ad hoc scripts, and loosely connected tools. This fragmentation impedes transparency, reproducibility, and the adaptability of experiments ove… ▽ More Reproducibility remains a central challenge in machine learning (ML), especially in collaborative eScience projects where teams iterate over data, features, and models. Current ML workflows are often dynamic yet fragmented, relying on informal data sharing, ad hoc scripts, and loosely connected tools. This fragmentation impedes transparency, reproducibility, and the adaptability of experiments over time. This paper introduces a data-centric framework for lifecycle-aware reproducibility, centered around six structured artifacts: Dataset, Feature, Workflow, Execution, Asset, and Controlled Vocabulary. These artifacts formalize the relationships between data, code, and decisions, enabling ML experiments to be versioned, interpretable, and traceable over time. The approach is demonstrated through a clinical ML use case of glaucoma detection, illustrating how the system supports iterative exploration, improves reproducibility, and preserves the provenance of collaborative decisions across the ML lifecycle. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.14813 [pdf, ps, other]

Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks

Authors: Yuxuan Jiang, Ziming Zhou, Boyu Xu, Beijie Liu, Runhui Xu, Peng Huang

Abstract: Training deep learning (DL) models is a complex process, making it prone to silent errors that are challenging to detect and diagnose. This paper presents TRAINCHECK, a framework that takes a proactive checking approach to address silent training errors. TRAINCHECK automatically infers invariants tailored for DL training. It uses these invariants to proactively detect silent errors during the trai… ▽ More Training deep learning (DL) models is a complex process, making it prone to silent errors that are challenging to detect and diagnose. This paper presents TRAINCHECK, a framework that takes a proactive checking approach to address silent training errors. TRAINCHECK automatically infers invariants tailored for DL training. It uses these invariants to proactively detect silent errors during the training process while providing debugging help. To evaluate TRAINCHECK, we reproduce 20 real-world silent training errors with diverse root causes. TRAINCHECK successfully detects 18 errors within a single training iteration. It also uncovers 6 unknown bugs in popular training libraries that lead to silent errors. △ Less

Submitted 6 June, 2025; originally announced June 2025.

Comments: 19 pages, to appear in 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI '25)

arXiv:2506.14406 [pdf, ps, other]

Search for neutron decay into an antineutrino and a neutral kaon in 0.401 megaton-years exposure of Super-Kamiokande

Authors: Super-Kamiokande Collaboration, :, K. Yamauchi, K. Abe, S. Abe, Y. Asaoka, M. Harada, Y. Hayato, K. Hiraide, K. Hosokawa, K. Ieki, M. Ikeda, J. Kameda, Y. Kanemura, Y. Kataoka, S. Miki, S. Mine, M. Miura, S. Moriyama, M. Nakahata, S. Nakayama, Y. Noguchi, G. Pronost, K. Sato, H. Sekiya , et al. (240 additional authors not shown)

Abstract: We searched for bound neutron decay via $n\to\barν+K^0$ predicted by the Grand Unified Theories in 0.401 Mton$\cdot$years exposure of all pure water phases in the Super-Kamiokande detector. About 4.4 times more data than in the previous search have been analyzed by a new method including a spectrum fit to kaon invariant mass distributions. No significant data excess has been observed in the signal… ▽ More We searched for bound neutron decay via $n\to\barν+K^0$ predicted by the Grand Unified Theories in 0.401 Mton$\cdot$years exposure of all pure water phases in the Super-Kamiokande detector. About 4.4 times more data than in the previous search have been analyzed by a new method including a spectrum fit to kaon invariant mass distributions. No significant data excess has been observed in the signal regions. As a result of this analysis, we set a lower limit of $7.8\times10^{32}$ years on the neutron lifetime at a 90% confidence level. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: 12 pages, 5 figures

arXiv:2506.12877 [pdf, ps, other]

Symplectic Spin-Lattice Dynamics with Machine-Learning Potentials

Authors: Zhengtao Huang, Ben Xu

Abstract: Accurate atomic-scale simulations of magnetic materials require precise handling of coupled spin-lattice degrees of freedom. Traditional spin-lattice dynamics (SLD), employing Newtonian equation for lattice evolution and the Landau-Lifshitz-Gilbert (LLG) equation for spins, encounters severe limitations with machine-learning potentials, including poor energy conservation and excessive computationa… ▽ More Accurate atomic-scale simulations of magnetic materials require precise handling of coupled spin-lattice degrees of freedom. Traditional spin-lattice dynamics (SLD), employing Newtonian equation for lattice evolution and the Landau-Lifshitz-Gilbert (LLG) equation for spins, encounters severe limitations with machine-learning potentials, including poor energy conservation and excessive computational costs due to non-symplectic integration. In this work, we propose TSPIN, a unified Nosé-Hoover Chain-based method overcoming these issues. By extending the classical Lagrangian with explicit spin kinetic terms and thermostat variables, we derive symplectic Hamiltonian formulations suitable for NVE, NVT, and NPT ensembles. The method integrates spin and lattice dynamics simultaneously, ensuring robust energy conservation and significantly reducing computational cost. Benchmarks against analytical harmonic spin-lattice models confirm its accuracy, and application to FCC iron using a DeepSPIN MLP demonstrates superior numerical stability and near-linear computational scaling compared to the conventional LLG method. Thus, TSPIN provides a powerful, broadly applicable framework for efficiently simulating complex spin-lattice phenomena and multi-degree-of-freedom systems at large scales. △ Less

Submitted 15 June, 2025; originally announced June 2025.

arXiv:2506.12446 [pdf, ps, other]

From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment

Authors: Bin Xie, Bingbing Xu, Yige Yuan, Shengmao Zhu, Huawei Shen

Abstract: Inference-time alignment methods have gained significant attention for their efficiency and effectiveness in aligning large language models (LLMs) with human preferences. However, existing dominant approaches using reward-guided search (RGS) primarily rely on outcome reward models (ORMs), which suffer from a critical granularity mismatch: ORMs are designed to provide outcome rewards for complete r… ▽ More Inference-time alignment methods have gained significant attention for their efficiency and effectiveness in aligning large language models (LLMs) with human preferences. However, existing dominant approaches using reward-guided search (RGS) primarily rely on outcome reward models (ORMs), which suffer from a critical granularity mismatch: ORMs are designed to provide outcome rewards for complete responses, while RGS methods rely on process rewards to guide the policy, leading to inconsistent scoring and suboptimal alignment. To address this challenge, we introduce process reward models (PRMs) into RGS and argue that an ideal PRM should satisfy two objectives: Score Consistency, ensuring coherent evaluation across partial and complete responses, and Preference Consistency, aligning partial sequence assessments with human preferences. Based on these, we propose SP-PRM, a novel dual-consistency framework integrating score consistency-based and preference consistency-based partial evaluation modules without relying on human annotation. Extensive experiments on dialogue, summarization, and reasoning tasks demonstrate that SP-PRM substantially enhances existing RGS methods, achieving a 3.6%-10.3% improvement in GPT-4 evaluation scores across all tasks. △ Less

Submitted 28 June, 2025; v1 submitted 14 June, 2025; originally announced June 2025.

arXiv:2506.11763 [pdf, ps, other]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Authors: Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, Zhendong Mao

Abstract: Deep Research Agents are a prominent category of LLM-based agents. By autonomously orchestrating multistep web exploration, targeted retrieval, and higher-order synthesis, they transform vast amounts of online information into analyst-grade, citation-rich reports--compressing hours of manual desk research into minutes. However, a comprehensive benchmark for systematically evaluating the capabiliti… ▽ More Deep Research Agents are a prominent category of LLM-based agents. By autonomously orchestrating multistep web exploration, targeted retrieval, and higher-order synthesis, they transform vast amounts of online information into analyst-grade, citation-rich reports--compressing hours of manual desk research into minutes. However, a comprehensive benchmark for systematically evaluating the capabilities of these agents remains absent. To bridge this gap, we present DeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks, each meticulously crafted by domain experts across 22 distinct fields. Evaluating DRAs is inherently complex and labor-intensive. We therefore propose two novel methodologies that achieve strong alignment with human judgment. The first is a reference-based method with adaptive criteria to assess the quality of generated research reports. The other framework is introduced to evaluate DRA's information retrieval and collection capabilities by assessing its effective citation count and overall citation accuracy. We have open-sourced DeepResearch Bench and key components of these frameworks at https://github.com/Ayanami0730/deep_research_bench to accelerate the development of practical LLM-based agents. △ Less

Submitted 13 June, 2025; originally announced June 2025.

Comments: 31 pages, 5 figures

arXiv:2506.09942 [pdf, ps, other]

VerIF: Verification Engineering for Reinforcement Learning in Instruction Following

Authors: Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li

Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs), with verification engineering playing a central role. However, best practices for RL in instruction following remain underexplored. In this work, we explore the verification challenge in RL for instruction following and propose VerIF, a verification method that combines rule… ▽ More Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs), with verification engineering playing a central role. However, best practices for RL in instruction following remain underexplored. In this work, we explore the verification challenge in RL for instruction following and propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model (e.g., QwQ-32B). To support this approach, we construct a high-quality instruction-following dataset, VerInstruct, containing approximately 22,000 instances with associated verification signals. We apply RL training with VerIF to two models, achieving significant improvements across several representative instruction-following benchmarks. The trained models reach state-of-the-art performance among models of comparable size and generalize well to unseen constraints. We further observe that their general capabilities remain unaffected, suggesting that RL with VerIF can be integrated into existing RL recipes to enhance overall model performance. We have released our datasets, codes, and models to facilitate future research at https://github.com/THU-KEG/VerIF. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: 16 pages, 8 figures

arXiv:2506.09002 [pdf, ps, other]

Boosting Rust Unit Test Coverage through Hybrid Program Analysis and Large Language Models

Authors: Bei Chu, Yang Feng, Kui Liu, Hange Shi, Zifan Nan, Zhaoqiang Guo, Baowen Xu

Abstract: Unit testing is essential for ensuring software reliability and correctness. Classic Search-Based Software Testing (SBST) methods and concolic execution-based approaches for generating unit tests often fail to achieve high coverage due to difficulties in handling complex program units, such as branching conditions and external dependencies. Recent work has increasingly utilized large language mode… ▽ More Unit testing is essential for ensuring software reliability and correctness. Classic Search-Based Software Testing (SBST) methods and concolic execution-based approaches for generating unit tests often fail to achieve high coverage due to difficulties in handling complex program units, such as branching conditions and external dependencies. Recent work has increasingly utilized large language models (LLMs) to generate test cases, improving the quality of test generation by providing better context and correcting errors in the model's output. However, these methods rely on fixed prompts, resulting in relatively low compilation success rates and coverage. This paper presents PALM, an approach that leverages large language models (LLMs) to enhance the generation of high-coverage unit tests. PALM performs program analysis to identify branching conditions within functions, which are then combined into path constraints. These constraints and relevant contextual information are used to construct prompts that guide the LLMs in generating unit tests. We implement the approach and evaluate it in 10 open-source Rust crates. Experimental results show that within just two or three hours, PALM can significantly improves test coverage compared to classic methods, with increases in overall project coverage exceeding 50% in some instances and its generated tests achieving an average coverage of 75.77%, comparable to human effort (71.30%), highlighting the potential of LLMs in automated test generation. We submitted 91 PALM-generated unit tests targeting new code. Of these submissions, 80 were accepted, 5 were rejected, and 6 remain pending review. The results demonstrate the effectiveness of integrating program analysis with AI and open new avenues for future research in automated software testing. △ Less

Submitted 10 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

Comments: 10 pages, 5 figures

arXiv:2506.07831 [pdf, other]

Clock Synchronization for Drone-Based Entanglement Quantum Key Distribution

Authors: Jinquan Huang, Bangying Tang, Hui Han, JianJi Yi, Bo Xu, Chunqing Wu, Xiangwei Zhu, Wanrong Yu, Huicun Yu, Jiahao Li, Shihai Sun, Bo Liu

Abstract: Drone-based entanglement distribution provides full spatiotemporal coverage for quantum networks, enabling quantum key distribution (QKD) in dynamic environments. The security of QKD fundamentally depends on high-fidelity quantum state measurements, for which high-precision clock synchronization is indispensable, as timing jitter is inversely correlated with quantum state fidelity. However, drone-… ▽ More Drone-based entanglement distribution provides full spatiotemporal coverage for quantum networks, enabling quantum key distribution (QKD) in dynamic environments. The security of QKD fundamentally depends on high-fidelity quantum state measurements, for which high-precision clock synchronization is indispensable, as timing jitter is inversely correlated with quantum state fidelity. However, drone-based clock synchronization is constrained by SWaP (Size, Weight, and Power) limitations and dynamic mobility effects. Here, we propose a synchronization protocol for drone-based entanglement distribution, leveraging nanosecond-accurate Global Navigation Satellite System (GNSS) timing and entanglement-based timing correction to overcome SWaP constraints. Experimental results demonstrate 24 ps RMS synchronization in simulated free-space quantum channels with distance dynamics, without requiring precision reference clock. Our protocol enables drone-based entanglement distribution, paving the way for seamless wide-area and local-area quantum internet. △ Less

Submitted 9 June, 2025; originally announced June 2025.

Comments: 16 pages,6 figures

arXiv:2506.07779 [pdf, ps, other]

Design and Evaluation of Deep Learning-Based Dual-Spectrum Image Fusion Methods

Authors: Beining Xu, Junxian Li

Abstract: Visible images offer rich texture details, while infrared images emphasize salient targets. Fusing these complementary modalities enhances scene understanding, particularly for advanced vision tasks under challenging conditions. Recently, deep learning-based fusion methods have gained attention, but current evaluations primarily rely on general-purpose metrics without standardized benchmarks or do… ▽ More Visible images offer rich texture details, while infrared images emphasize salient targets. Fusing these complementary modalities enhances scene understanding, particularly for advanced vision tasks under challenging conditions. Recently, deep learning-based fusion methods have gained attention, but current evaluations primarily rely on general-purpose metrics without standardized benchmarks or downstream task performance. Additionally, the lack of well-developed dual-spectrum datasets and fair algorithm comparisons hinders progress. To address these gaps, we construct a high-quality dual-spectrum dataset captured in campus environments, comprising 1,369 well-aligned visible-infrared image pairs across four representative scenarios: daytime, nighttime, smoke occlusion, and underpasses. We also propose a comprehensive and fair evaluation framework that integrates fusion speed, general metrics, and object detection performance using the lang-segment-anything model to ensure fairness in downstream evaluation. Extensive experiments benchmark several state-of-the-art fusion algorithms under this framework. Results demonstrate that fusion models optimized for downstream tasks achieve superior performance in target detection, especially in low-light and occluded scenes. Notably, some algorithms that perform well on general metrics do not translate to strong downstream performance, highlighting limitations of current evaluation practices and validating the necessity of our proposed framework. The main contributions of this work are: (1)a campus-oriented dual-spectrum dataset with diverse and challenging scenes; (2) a task-aware, comprehensive evaluation framework; and (3) thorough comparative analysis of leading fusion methods across multiple datasets, offering insights for future development. △ Less

Submitted 9 June, 2025; originally announced June 2025.

Comments: 11 pages, 13 figures

arXiv:2506.07599 [pdf, ps, other]

Flexible MIMO for Future Wireless Communications: Which Flexibilities are Possible?

Authors: Zhe Wang, Jiayi Zhang, Bokai Xu, Wenhui Yi, Emil Björnson, Bo Ai

Abstract: To enable next-generation wireless communication networks with modest spectrum availability, multiple-input multiple-output (MIMO) technology needs to undergo further evolution. In this paper, we introduce a promising next-generation wireless communication concept: flexible MIMO technology. This technology represents a MIMO technology with flexible physical configurations and integrated applicatio… ▽ More To enable next-generation wireless communication networks with modest spectrum availability, multiple-input multiple-output (MIMO) technology needs to undergo further evolution. In this paper, we introduce a promising next-generation wireless communication concept: flexible MIMO technology. This technology represents a MIMO technology with flexible physical configurations and integrated applications. We categorize twelve representative flexible MIMO technologies into three major classifications: flexible deployment characteristics-based, flexible geometry characteristics-based, and flexible real-time modifications-based. Then, we provide a comprehensive overview of their fundamental characteristics, potential, and challenges. Furthermore, we demonstrate three vital enablers for the flexible MIMO technology, including efficient channel state information (CSI) acquisition schemes, low-complexity beamforming design, and explainable artificial intelligence (AI)-enabled optimization. Within these areas, eight critical sub-enabling technologies are discussed in detail. Finally, we present two case studies-pre-optimized irregular arrays and cell-free movable antennas-where significant potential for flexible MIMO technologies to enhance the system capacity is showcased. △ Less

Submitted 9 June, 2025; originally announced June 2025.

Comments: 9 pages, 5 figures, 1 table

arXiv:2506.07050 [pdf, ps, other]

doi 10.1145/3711896.3736971

From Swath to Full-Disc: Advancing Precipitation Retrieval with Multimodal Knowledge Expansion

Authors: Zheng Wang, Kai Ying, Bin Xu, Chunjiao Wang, Cong Bai

Abstract: Accurate near-real-time precipitation retrieval has been enhanced by satellite-based technologies. However, infrared-based algorithms have low accuracy due to weak relations with surface precipitation, whereas passive microwave and radar-based methods are more accurate but limited in range. This challenge motivates the Precipitation Retrieval Expansion (PRE) task, which aims to enable accurate, in… ▽ More Accurate near-real-time precipitation retrieval has been enhanced by satellite-based technologies. However, infrared-based algorithms have low accuracy due to weak relations with surface precipitation, whereas passive microwave and radar-based methods are more accurate but limited in range. This challenge motivates the Precipitation Retrieval Expansion (PRE) task, which aims to enable accurate, infrared-based full-disc precipitation retrievals beyond the scanning swath. We introduce Multimodal Knowledge Expansion, a two-stage pipeline with the proposed PRE-Net model. In the Swath-Distilling stage, PRE-Net transfers knowledge from a multimodal data integration model to an infrared-based model within the scanning swath via Coordinated Masking and Wavelet Enhancement (CoMWE). In the Full-Disc Adaptation stage, Self-MaskTune refines predictions across the full disc by balancing multimodal and full-disc infrared knowledge. Experiments on the introduced PRE benchmark demonstrate that PRE-Net significantly advanced precipitation retrieval performance, outperforming leading products like PERSIANN-CCS, PDIR, and IMERG. The code will be available at https://github.com/Zjut-MultimediaPlus/PRE-Net. △ Less

Submitted 8 June, 2025; originally announced June 2025.

arXiv:2506.06881 [pdf, other]

KnowCoder-V2: Deep Knowledge Analysis

Authors: Zixuan Li, Wenxuan Liu, Long Bai, Chunmao Zhang, Wei Li, Fenghui Zhang, Quanxin Jin, Ruoyun He, Zhuo Chen, Zhilei Hu, Fei Wang, Bingbing Xu, Xuhui Jiang, Xiaolong Jin, Jiafeng Guo, Xueqi Cheng

Abstract: Deep knowledge analysis tasks always involve the systematic extraction and association of knowledge from large volumes of data, followed by logical reasoning to discover insights. However, to solve such complex tasks, existing deep research frameworks face three major challenges: 1) They lack systematic organization and management of knowledge; 2) They operate purely online, making it inefficient… ▽ More Deep knowledge analysis tasks always involve the systematic extraction and association of knowledge from large volumes of data, followed by logical reasoning to discover insights. However, to solve such complex tasks, existing deep research frameworks face three major challenges: 1) They lack systematic organization and management of knowledge; 2) They operate purely online, making it inefficient for tasks that rely on shared and large-scale knowledge; 3) They cannot perform complex knowledge computation, limiting their abilities to produce insightful analytical results. Motivated by these, in this paper, we propose a \textbf{K}nowledgeable \textbf{D}eep \textbf{R}esearch (\textbf{KDR}) framework that empowers deep research with deep knowledge analysis capability. Specifically, it introduces an independent knowledge organization phase to preprocess large-scale, domain-relevant data into systematic knowledge offline. Based on this knowledge, it extends deep research with an additional kind of reasoning steps that perform complex knowledge computation in an online manner. To enhance the abilities of LLMs to solve knowledge analysis tasks in the above framework, we further introduce \textbf{\KCII}, an LLM that bridges knowledge organization and reasoning via unified code generation. For knowledge organization, it generates instantiation code for predefined classes, transforming data into knowledge objects. For knowledge computation, it generates analysis code and executes on the above knowledge objects to obtain deep analysis results. Experimental results on more than thirty datasets across six knowledge analysis tasks demonstrate the effectiveness of \KCII. Moreover, when integrated into the KDR framework, \KCII can generate high-quality reports with insightful analytical results compared to the mainstream deep research framework. △ Less

Submitted 7 June, 2025; originally announced June 2025.

arXiv:2506.06679 [pdf, ps, other]

Controlled Reach-avoid Set Computation for Discrete-time Polynomial Systems via Convex Optimization

Authors: Taoran Wu, Yiling Xue, Dejin Ren, Arvind Easwaran, Martin Fränzle, Bai Xue

Abstract: This paper addresses the computation of controlled reach-avoid sets (CRASs) for discrete-time polynomial systems subject to control inputs. A CRAS is a set encompassing initial states from which there exist control inputs driving the system into a target set while avoiding unsafe sets. However, efficiently computing CRASs remains an open problem, especially for discrete-time systems. In this paper… ▽ More This paper addresses the computation of controlled reach-avoid sets (CRASs) for discrete-time polynomial systems subject to control inputs. A CRAS is a set encompassing initial states from which there exist control inputs driving the system into a target set while avoiding unsafe sets. However, efficiently computing CRASs remains an open problem, especially for discrete-time systems. In this paper, we propose a novel framework for computing CRASs which takes advantage of a probabilistic perspective. This framework transforms the fundamentally nonlinear problem of computing CRASs into a computationally tractable convex optimization problem. By regarding control inputs as disturbances obeying certain probability distributions, a CRAS can be equivalently treated as a 0-reach-avoid set in the probabilistic sense, which consists of initial states from which the probability of eventually entering the target set while remaining within the safe set is greater than zero. Thus, we can employ the convex optimization method of computing 0-reach-avoid sets to estimate CRASs. Furthermore, inspired by the $ε$-greedy strategy widely used in reinforcement learning, we propose an approach that iteratively updates the aforementioned probability distributions imposed on control inputs to compute larger CRASs. We demonstrate the effectiveness of the proposed method on extensive examples. △ Less

Submitted 7 June, 2025; originally announced June 2025.

arXiv:2506.06481 [pdf, ps, other]

Ordering curves on surfaces

Authors: Hugo Parlier, Hanh Vo, Binbin Xu

Abstract: We study the order of lengths of closed geodesics on hyperbolic surfaces. Our first main result is that the order of lengths of curves determine a point in Teichmüller space. In an opposite direction, we identify classes of curves whose order never changes, independently of the choice of hyperbolic metric. We use this result to identify short curves with small intersections on pairs of pants. We study the order of lengths of closed geodesics on hyperbolic surfaces. Our first main result is that the order of lengths of curves determine a point in Teichmüller space. In an opposite direction, we identify classes of curves whose order never changes, independently of the choice of hyperbolic metric. We use this result to identify short curves with small intersections on pairs of pants. △ Less

Submitted 6 June, 2025; originally announced June 2025.

Comments: 27 pages, 8 figures

arXiv:2506.06392 [pdf]

Additive Manufacturing of Lunar Regolith for Reconfigurable Building Blocks toward Lunar Habitation

Authors: Cole McCallum, Youwen Liang, Nahid Tushar, Ben Xu, Bo Zhao, Hao Zeng, Wan Shou

Abstract: Utilizing locally available materials is a crucial step towards sustainable planetary habitation. Lunar regolith has gained tremendous interest in additive manufacturing in the past decades. However, due to the constrained manufacturing facilities and materials on the moon, many existing additive manufacturing methods are not suitable for practical on-site manufacturing. Here, we envision that lig… ▽ More Utilizing locally available materials is a crucial step towards sustainable planetary habitation. Lunar regolith has gained tremendous interest in additive manufacturing in the past decades. However, due to the constrained manufacturing facilities and materials on the moon, many existing additive manufacturing methods are not suitable for practical on-site manufacturing. Here, we envision that light-based direct sintering of lunar regolith can be a feasible approach. Instead of directly manufacturing large structures, we hypothesize that small-scale, reconfigurable building blocks can be an alternative to form large and complex structures. To verify the feasibility, we conducted laser sintering of lunar regolith simulants as a proof of concept, following a simple theoretical calculation for direct sintering using the light available in space. Different laser processing parameters are investigated to obtain controllable lunar regolith sintering. We further designed Lego-like interlocking bricks that are reconfigurable for different structure assemblies without additional material. Mechanical performance (compressive strength) of sintered cubic blocks is evaluated, showing a peak stress of ~1.5 MPa. We hope this work will inspire other in-space manufacturing techniques and enable low-cost space habitation. △ Less

Submitted 5 June, 2025; originally announced June 2025.

arXiv:2506.05044 [pdf, ps, other]

Rethinking Contrastive Learning in Session-based Recommendation

Authors: Xiaokun Zhang, Bo Xu, Fenglong Ma, Zhizheng Wang, Liang Yang, Hongfei Lin

Abstract: Session-based recommendation aims to predict intents of anonymous users based on limited behaviors. With the ability in alleviating data sparsity, contrastive learning is prevailing in the task. However, we spot that existing contrastive learning based methods still suffer from three obstacles: (1) they overlook item-level sparsity and primarily focus on session-level sparsity; (2) they typically… ▽ More Session-based recommendation aims to predict intents of anonymous users based on limited behaviors. With the ability in alleviating data sparsity, contrastive learning is prevailing in the task. However, we spot that existing contrastive learning based methods still suffer from three obstacles: (1) they overlook item-level sparsity and primarily focus on session-level sparsity; (2) they typically augment sessions using item IDs like crop, mask and reorder, failing to ensure the semantic consistency of augmented views; (3) they treat all positive-negative signals equally, without considering their varying utility. To this end, we propose a novel multi-modal adaptive contrastive learning framework called MACL for session-based recommendation. In MACL, a multi-modal augmentation is devised to generate semantically consistent views at both item and session levels by leveraging item multi-modal features. Besides, we present an adaptive contrastive loss that distinguishes varying contributions of positive-negative signals to improve self-supervised learning. Extensive experiments on three real-world datasets demonstrate the superiority of MACL over state-of-the-art methods. △ Less

Submitted 5 June, 2025; originally announced June 2025.

Comments: This work has been accepted by Pattern Recognition

arXiv:2506.04699 [pdf, ps, other]

Empowering Economic Simulation for Massively Multiplayer Online Games through Generative Agent-Based Modeling

Authors: Bihan Xu, Shiwei Zhao, Runze Wu, Zhenya Huang, Jiawei Wang, Zhipeng Hu, Kai Wang, Haoyu Liu, Tangjie Lv, Le Li, Changjie Fan, Xin Tong, Jiangze Han

Abstract: Within the domain of Massively Multiplayer Online (MMO) economy research, Agent-Based Modeling (ABM) has emerged as a robust tool for analyzing game economics, evolving from rule-based agents to decision-making agents enhanced by reinforcement learning. Nevertheless, existing works encounter significant challenges when attempting to emulate human-like economic activities among agents, particularly… ▽ More Within the domain of Massively Multiplayer Online (MMO) economy research, Agent-Based Modeling (ABM) has emerged as a robust tool for analyzing game economics, evolving from rule-based agents to decision-making agents enhanced by reinforcement learning. Nevertheless, existing works encounter significant challenges when attempting to emulate human-like economic activities among agents, particularly regarding agent reliability, sociability, and interpretability. In this study, we take a preliminary step in introducing a novel approach using Large Language Models (LLMs) in MMO economy simulation. Leveraging LLMs' role-playing proficiency, generative capacity, and reasoning aptitude, we design LLM-driven agents with human-like decision-making and adaptability. These agents are equipped with the abilities of role-playing, perception, memory, and reasoning, addressing the aforementioned challenges effectively. Simulation experiments focusing on in-game economic activities demonstrate that LLM-empowered agents can promote emergent phenomena like role specialization and price fluctuations in line with market rules. △ Less

Submitted 5 June, 2025; originally announced June 2025.

Comments: KDD2025 Accepted

arXiv:2506.04280 [pdf, ps, other]

Evaluating MLLMs with Multimodal Multi-image Reasoning Benchmark

Authors: Ziming Cheng, Binrui Xu, Lisheng Gong, Zuhe Song, Tianshuo Zhou, Shiqi Zhong, Siyu Ren, Mingxiang Chen, Xiangchao Meng, Yuxin Zhang, Yanlin Li, Lei Ren, Wei Chen, Zhiyuan Huang, Mingjie Zhan, Xiaojie Wang, Fangxiang Feng

Abstract: With enhanced capabilities and widespread applications, Multimodal Large Language Models (MLLMs) are increasingly required to process and reason over multiple images simultaneously. However, existing MLLM benchmarks focus either on single-image visual reasoning or on multi-image understanding tasks with only final-answer evaluation, leaving the reasoning capabilities of MLLMs over multi-image inpu… ▽ More With enhanced capabilities and widespread applications, Multimodal Large Language Models (MLLMs) are increasingly required to process and reason over multiple images simultaneously. However, existing MLLM benchmarks focus either on single-image visual reasoning or on multi-image understanding tasks with only final-answer evaluation, leaving the reasoning capabilities of MLLMs over multi-image inputs largely underexplored. To address this gap, we introduce the $\textbf{Multimodal Multi-image Reasoning Benchmark (MMRB)}$, the first benchmark designed to evaluate structured visual reasoning across multiple images. MMRB comprises $\textbf{92 sub-tasks}$ covering spatial, temporal, and semantic reasoning, with multi-solution, CoT-style annotations generated by GPT-4o and refined by human experts. A derivative subset is designed to evaluate multimodal reward models in multi-image scenarios. To support fast and scalable evaluation, we propose a sentence-level matching framework using open-source LLMs. Extensive baseline experiments on $\textbf{40 MLLMs}$, including 9 reasoning-specific models and 8 reward models, demonstrate that open-source MLLMs still lag significantly behind commercial MLLMs in multi-image reasoning tasks. Furthermore, current multimodal reward models are nearly incapable of handling multi-image reward ranking tasks. △ Less

Submitted 4 June, 2025; originally announced June 2025.

Comments: 18 pages

MSC Class: 68T50 ACM Class: I.2.7

arXiv:2506.03968 [pdf, ps, other]

From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding

Authors: Chiwei Zhu, Benfeng Xu, Xiaorui Wang, Zhendong Mao

Abstract: The pursuit of diverse, complex, and large-scale instruction data is crucial for automatically aligning large language models (LLMs). While there are methods capable of generating synthetic instructions at scale, they either suffer from limited grounding sources, leading to a narrow distribution, or rely on trivial extensions that fail to produce meaningful trajectories in terms of complexity. In… ▽ More The pursuit of diverse, complex, and large-scale instruction data is crucial for automatically aligning large language models (LLMs). While there are methods capable of generating synthetic instructions at scale, they either suffer from limited grounding sources, leading to a narrow distribution, or rely on trivial extensions that fail to produce meaningful trajectories in terms of complexity. In contrast, instructions that benefit efficient alignment are typically crafted with cognitive insights and grounded in real-world use cases. In this paper, we synthesize such instructions using attributed grounding, which involves 1) a top-down attribution process that grounds a selective set of real instructions to situated users, and 2) a bottom-up synthesis process that leverages web documents to first generate a situation, then a meaningful instruction. This framework allows us to harvest diverse and complex instructions at scale, utilizing the vast range of web documents. Specifically, we construct a dataset of 1 million instructions, called SynthQuestions, and demonstrate that models trained on it achieve leading performance on several common benchmarks, with improvements that continually scale with more web corpora. Data, models and codes will be available at https://github.com/Ignoramus0817/SynthQuestions. △ Less

Submitted 4 June, 2025; originally announced June 2025.

Comments: To be published at ACL 2025

arXiv:2506.02244 [pdf, ps, other]

Motion aware video generative model

Authors: Bowen Xue, Giuseppe Claudio Guarnera, Shuang Zhao, Zahra Montazeri

Abstract: Recent advances in diffusion-based video generation have yielded unprecedented quality in visual content and semantic coherence. However, current approaches predominantly rely on statistical learning from vast datasets without explicitly modeling the underlying physics of motion, resulting in subtle yet perceptible non-physical artifacts that diminish the realism of generated videos. This paper in… ▽ More Recent advances in diffusion-based video generation have yielded unprecedented quality in visual content and semantic coherence. However, current approaches predominantly rely on statistical learning from vast datasets without explicitly modeling the underlying physics of motion, resulting in subtle yet perceptible non-physical artifacts that diminish the realism of generated videos. This paper introduces a physics-informed frequency domain approach to enhance the physical plausibility of generated videos. We first conduct a systematic analysis of the frequency-domain characteristics of diverse physical motions (translation, rotation, scaling), revealing that each motion type exhibits distinctive and identifiable spectral signatures. Building on this theoretical foundation, we propose two complementary components: (1) a physical motion loss function that quantifies and optimizes the conformity of generated videos to ideal frequency-domain motion patterns, and (2) a frequency domain enhancement module that progressively learns to adjust video features to conform to physical motion constraints while preserving original network functionality through a zero-initialization strategy. Experiments across multiple video diffusion architectures demonstrate that our approach significantly enhances motion quality and physical plausibility without compromising visual quality or semantic alignment. Our frequency-domain physical motion framework generalizes effectively across different video generation architectures, offering a principled approach to incorporating physical constraints into deep learning-based video synthesis pipelines. This work seeks to establish connections between data-driven models and physics-based motion models. △ Less

Submitted 2 June, 2025; originally announced June 2025.

arXiv:2506.01048 [pdf, ps, other]

IRT-Router: Effective and Interpretable Multi-LLM Routing via Item Response Theory

Authors: Wei Song, Zhenya Huang, Cheng Cheng, Weibo Gao, Bihan Xu, GuanHao Zhao, Fei Wang, Runze Wu

Abstract: Large language models (LLMs) have demonstrated exceptional performance across a wide range of natural language tasks. However, selecting the optimal LLM to respond to a user query often necessitates a delicate balance between performance and cost. While powerful models deliver better results, they come at a high cost, whereas smaller models are more cost-effective but less capable. To address this… ▽ More Large language models (LLMs) have demonstrated exceptional performance across a wide range of natural language tasks. However, selecting the optimal LLM to respond to a user query often necessitates a delicate balance between performance and cost. While powerful models deliver better results, they come at a high cost, whereas smaller models are more cost-effective but less capable. To address this trade-off, we propose IRT-Router, a multi-LLM routing framework that efficiently routes user queries to the most suitable LLM. Inspired by Item Response Theory (IRT), a psychological measurement methodology, IRT-Router explicitly models the relationship between LLM capabilities and user query attributes. This not only enables accurate prediction of response performance but also provides interpretable insights, such as LLM abilities and query difficulty. Additionally, we design an online query warm-up technique based on semantic similarity, further enhancing the online generalization capability of IRT-Router. Extensive experiments on 20 LLMs and 12 datasets demonstrate that IRT-Router outperforms most baseline methods in terms of effectiveness and interpretability. Its superior performance in cold-start scenarios further confirms the reliability and practicality of IRT-Router in real-world applications. Code is available at https://github.com/Mercidaiha/IRT-Router. △ Less

Submitted 20 June, 2025; v1 submitted 1 June, 2025; originally announced June 2025.

Comments: ACL 2025 Main

arXiv:2506.00886 [pdf, ps, other]

Toward a Theory of Agents as Tool-Use Decision-Makers

Authors: Hongru Wang, Cheng Qian, Manling Li, Jiahao Qiu, Boyang Xue, Mengdi Wang, Heng Ji, Kam-Fai Wong

Abstract: As Large Language Models (LLMs) evolve into increasingly autonomous agents, fundamental questions about their epistemic foundations remain unresolved: What defines an agent? How should it make decisions? And what objectives should guide its behavior? In this position paper, we argue that true autonomy requires agents to be grounded in a coherent epistemic framework that governs what they know, wha… ▽ More As Large Language Models (LLMs) evolve into increasingly autonomous agents, fundamental questions about their epistemic foundations remain unresolved: What defines an agent? How should it make decisions? And what objectives should guide its behavior? In this position paper, we argue that true autonomy requires agents to be grounded in a coherent epistemic framework that governs what they know, what they need to know, and how to acquire that knowledge efficiently. We propose a unified theory that treats internal reasoning and external actions as equivalent epistemic tools, enabling agents to systematically coordinate introspection and interaction. Building on this framework, we advocate for aligning an agent's tool use decision-making boundary with its knowledge boundary, thereby minimizing unnecessary tool use and maximizing epistemic efficiency. This perspective shifts the design of agents from mere action executors to knowledge-driven intelligence systems, offering a principled path toward building foundation agents capable of adaptive, efficient, and goal-directed behavior. △ Less

Submitted 1 June, 2025; originally announced June 2025.

arXiv:2506.00388 [pdf, ps, other]

CLARIFY: Contrastive Preference Reinforcement Learning for Untangling Ambiguous Queries

Authors: Ni Mu, Hao Hu, Xiao Hu, Yiqin Yang, Bo Xu, Qing-Shan Jia

Abstract: Preference-based reinforcement learning (PbRL) bypasses explicit reward engineering by inferring reward functions from human preference comparisons, enabling better alignment with human intentions. However, humans often struggle to label a clear preference between similar segments, reducing label efficiency and limiting PbRL's real-world applicability. To address this, we propose an offline PbRL m… ▽ More Preference-based reinforcement learning (PbRL) bypasses explicit reward engineering by inferring reward functions from human preference comparisons, enabling better alignment with human intentions. However, humans often struggle to label a clear preference between similar segments, reducing label efficiency and limiting PbRL's real-world applicability. To address this, we propose an offline PbRL method: Contrastive LeArning for ResolvIng Ambiguous Feedback (CLARIFY), which learns a trajectory embedding space that incorporates preference information, ensuring clearly distinguished segments are spaced apart, thus facilitating the selection of more unambiguous queries. Extensive experiments demonstrate that CLARIFY outperforms baselines in both non-ideal teachers and real human feedback settings. Our approach not only selects more distinguished queries but also learns meaningful trajectory embeddings. △ Less

Submitted 10 June, 2025; v1 submitted 31 May, 2025; originally announced June 2025.

Comments: ICML 2025

arXiv:2505.24710 [pdf, ps, other]

Causal-aware Large Language Models: Enhancing Decision-Making Through Learning, Adapting and Acting

Authors: Wei Chen, Jiahao Zhang, Haipeng Zhu, Boyan Xu, Zhifeng Hao, Keli Zhang, Junjian Ye, Ruichu Cai

Abstract: Large language models (LLMs) have shown great potential in decision-making due to the vast amount of knowledge stored within the models. However, these pre-trained models are prone to lack reasoning abilities and are difficult to adapt to new environments, further hindering their application to complex real-world tasks. To address these challenges, inspired by the human cognitive process, we propo… ▽ More Large language models (LLMs) have shown great potential in decision-making due to the vast amount of knowledge stored within the models. However, these pre-trained models are prone to lack reasoning abilities and are difficult to adapt to new environments, further hindering their application to complex real-world tasks. To address these challenges, inspired by the human cognitive process, we propose Causal-aware LLMs, which integrate the structural causal model (SCM) into the decision-making process to model, update, and utilize structured knowledge of the environment in a ``learning-adapting-acting" paradigm. Specifically, in the learning stage, we first utilize an LLM to extract the environment-specific causal entities and their causal relations to initialize a structured causal model of the environment. Subsequently,in the adapting stage, we update the structured causal model through external feedback about the environment, via an idea of causal intervention. Finally, in the acting stage, Causal-aware LLMs exploit structured causal knowledge for more efficient policy-making through the reinforcement learning agent. The above processes are performed iteratively to learn causal knowledge, ultimately enabling the causal-aware LLMs to achieve a more accurate understanding of the environment and make more efficient decisions. Experimental results across 22 diverse tasks within the open-world game ``Crafter" validate the effectiveness of our proposed method. △ Less

Submitted 30 May, 2025; originally announced May 2025.

Comments: Accepted by IJCAI 2025

arXiv:2505.24147 [pdf, other]

Rationales Are Not Silver Bullets: Measuring the Impact of Rationales on Model Performance and Reliability

Authors: Chiwei Zhu, Benfeng Xu, An Yang, Junyang Lin, Quan Wang, Chang Zhou, Zhendong Mao

Abstract: Training language models with rationales augmentation has been shown to be beneficial in many existing works. In this paper, we identify that such a prevailing view does not hold consistently. We conduct comprehensive investigations to thoroughly inspect the impact of rationales on model performance as well as a novel perspective of model reliability. The results lead to several key findings that… ▽ More Training language models with rationales augmentation has been shown to be beneficial in many existing works. In this paper, we identify that such a prevailing view does not hold consistently. We conduct comprehensive investigations to thoroughly inspect the impact of rationales on model performance as well as a novel perspective of model reliability. The results lead to several key findings that add new insights upon existing understandings: 1) Rationales can, at times, deteriorate model performance; 2) Rationales can, at times, improve model reliability, even outperforming their untrained counterparts; 3) A linear correspondence exists in between the performance and reliability improvements, while both are driven by the intrinsic difficulty of the task. These findings provide informative regulations on the broad utilization of rationales and raise critical implications on the procedure of explicitly aligning language models with implicit human thoughts. Codes can be found at https://github.com/Ignoramus0817/rationales. △ Less

Submitted 29 May, 2025; originally announced May 2025.

Comments: To be published in ACL 2025 Findings. (Work originally done in Jan 2024)

arXiv:2505.24123 [pdf, ps, other]

Meta-heuristic Hypergraph-Assisted Robustness Optimization for Higher-order Complex Systems

Authors: Xilong Qu, Wenbin Pei, Haifang Li, Qiang Zhang, Bing Xue, Mengjie Zhang

Abstract: In complex systems (e.g., communication, transportation, and biological networks), high robustness ensures sustained functionality and stability even when resisting attacks. However, the inherent structure complexity and the unpredictability of attacks make robustness optimization challenging. Hypergraphs provide a framework for modeling complicated higher-order interactions in complex systems nat… ▽ More In complex systems (e.g., communication, transportation, and biological networks), high robustness ensures sustained functionality and stability even when resisting attacks. However, the inherent structure complexity and the unpredictability of attacks make robustness optimization challenging. Hypergraphs provide a framework for modeling complicated higher-order interactions in complex systems naturally, but their potential has not been systematically investigated. Therefore, we propose an effective method based on genetic algorithms from Artificial Intelligence to optimize the robustness of complex systems modeled by hypergraphs. By integrating percolation-based metrics with adaptive computational techniques, our method achieves improved accuracy and efficiency. Experiments on both synthetic and real-world hypergraphs demonstrate the effectiveness of the proposed method in mitigating malicious attacks, with robustness improvements ranging from 16.6% to 205.2%. Further in-depth analysis reveals that optimized hypergraph-based systems exhibit a preferential connection mechanism in which high-hyperdegree nodes preferentially connect to lower-cardinality hyperedges, forming a distinctive Lotus topology that significantly improves robustness. Based on this finding, we propose a robust hypergraph generation method that allows robustness to be controlled via a single parameter rb. Notably, for rb<-1, a distinct Cactus topology emerges as an alternative to the Lotus topology observed for rb>1. The discovery of the Lotus and Cactus topologies offers valuable insights for designing robust higher-order networks while providing a useful foundation for investigating cascading failure dynamics in complex systems. △ Less

Submitted 12 June, 2025; v1 submitted 29 May, 2025; originally announced May 2025.

arXiv:2505.22591 [pdf, ps, other]

Self-Error-Instruct: Generalizing from Errors for LLMs Mathematical Reasoning

Authors: Erxin Yu, Jing Li, Ming Liao, Qi Zhu, Boyang Xue, Minghui Xu, Baojun Wang, Lanqing Hong, Fei Mi, Lifeng Shang

Abstract: Although large language models demonstrate strong performance across various domains, they still struggle with numerous bad cases in mathematical reasoning. Previous approaches to learning from errors synthesize training data by solely extrapolating from isolated bad cases, thereby failing to generalize the extensive patterns inherent within these cases. This paper presents Self-Error-Instruct (SE… ▽ More Although large language models demonstrate strong performance across various domains, they still struggle with numerous bad cases in mathematical reasoning. Previous approaches to learning from errors synthesize training data by solely extrapolating from isolated bad cases, thereby failing to generalize the extensive patterns inherent within these cases. This paper presents Self-Error-Instruct (SEI), a framework that addresses these model weaknesses and synthesizes more generalized targeted training data. Specifically, we explore a target model on two mathematical datasets, GSM8K and MATH, to pinpoint bad cases. Then, we generate error keyphrases for these cases based on the instructor model's (GPT-4o) analysis and identify error types by clustering these keyphrases. Next, we sample a few bad cases during each generation for each identified error type and input them into the instructor model, which synthesizes additional training data using a self-instruct approach. This new data is refined through a one-shot learning process to ensure that only the most effective examples are kept. Finally, we use these curated data to fine-tune the target model, iteratively repeating the process to enhance performance. We apply our framework to various models and observe improvements in their reasoning abilities across both in-domain and out-of-domain mathematics datasets. These results demonstrate the effectiveness of self-error instruction in improving LLMs' mathematical reasoning through error generalization. △ Less

Submitted 28 May, 2025; originally announced May 2025.

Comments: 16 pages, 9 figures

arXiv:2505.21901 [pdf, ps, other]

Symbolically Regressing Fish Biomass Spectral Data: A Linear Genetic Programming Method with Tunable Primitives

Authors: Zhixing Huang, Bing Xue, Mengjie Zhang, Jeremy S. Ronney, Keith C. Gordon, Daniel P. Killeen

Abstract: Machine learning techniques play an important role in analyzing spectral data. The spectral data of fish biomass is useful in fish production, as it carries many important chemistry properties of fish meat. However, it is challenging for existing machine learning techniques to comprehensively discover hidden patterns from fish biomass spectral data since the spectral data often have a lot of noise… ▽ More Machine learning techniques play an important role in analyzing spectral data. The spectral data of fish biomass is useful in fish production, as it carries many important chemistry properties of fish meat. However, it is challenging for existing machine learning techniques to comprehensively discover hidden patterns from fish biomass spectral data since the spectral data often have a lot of noises while the training data are quite limited. To better analyze fish biomass spectral data, this paper models it as a symbolic regression problem and solves it by a linear genetic programming method with newly proposed tunable primitives. In the symbolic regression problem, linear genetic programming automatically synthesizes regression models based on the given primitives and training data. The tunable primitives further improve the approximation ability of the regression models by tuning their inherent coefficients. Our empirical results over ten fish biomass targets show that the proposed method improves the overall performance of fish biomass composition prediction. The synthesized regression models are compact and have good interpretability, which allow us to highlight useful features over the spectrum. Our further investigation also verifies the good generality of the proposed method across various spectral data treatments and other symbolic regression problems. △ Less

Submitted 27 May, 2025; originally announced May 2025.

arXiv:2505.20884 [pdf]

YOLO-FireAD: Efficient Fire Detection via Attention-Guided Inverted Residual Learning and Dual-Pooling Feature Preservation

Authors: Weichao Pan, Bohan Xu, Xu Wang, Chengze Lv, Shuoyang Wang, Zhenke Duan

Abstract: Fire detection in dynamic environments faces continuous challenges, including the interference of illumination changes, many false detections or missed detections, and it is difficult to achieve both efficiency and accuracy. To address the problem of feature extraction limitation and information loss in the existing YOLO-based models, this study propose You Only Look Once for Fire Detection with A… ▽ More Fire detection in dynamic environments faces continuous challenges, including the interference of illumination changes, many false detections or missed detections, and it is difficult to achieve both efficiency and accuracy. To address the problem of feature extraction limitation and information loss in the existing YOLO-based models, this study propose You Only Look Once for Fire Detection with Attention-guided Inverted Residual and Dual-pooling Downscale Fusion (YOLO-FireAD) with two core innovations: (1) Attention-guided Inverted Residual Block (AIR) integrates hybrid channel-spatial attention with inverted residuals to adaptively enhance fire features and suppress environmental noise; (2) Dual Pool Downscale Fusion Block (DPDF) preserves multi-scale fire patterns through learnable fusion of max-average pooling outputs, mitigating small-fire detection failures. Extensive evaluation on two public datasets shows the efficient performance of our model. Our proposed model keeps the sum amount of parameters (1.45M, 51.8% lower than YOLOv8n) (4.6G, 43.2% lower than YOLOv8n), and mAP75 is higher than the mainstream real-time object detection models YOLOv8n, YOL-Ov9t, YOLOv10n, YOLO11n, YOLOv12n and other YOLOv8 variants 1.3-5.5%. △ Less

Submitted 27 May, 2025; originally announced May 2025.

Showing 1–50 of 1,738 results for author: Xu, B