-
EvolProver: Advancing Automated Theorem Proving by Evolving Formalized Problems via Symmetry and Difficulty
Authors:
Yuchen Tian,
Ruiyuan Huang,
Xuanwu Wang,
Jing Ma,
Zengfeng Huang,
Ziyang Luo,
Hongzhan Lin,
Da Zheng,
Lun Du
Abstract:
Large Language Models (LLMs) for formal theorem proving have shown significant promise, yet they often lack generalizability and are fragile to even minor transformations of problem statements. To address this limitation, we introduce a novel data augmentation pipeline designed to enhance model robustness from two perspectives: symmetry and difficulty. From the symmetry perspective, we propose two…
▽ More
Large Language Models (LLMs) for formal theorem proving have shown significant promise, yet they often lack generalizability and are fragile to even minor transformations of problem statements. To address this limitation, we introduce a novel data augmentation pipeline designed to enhance model robustness from two perspectives: symmetry and difficulty. From the symmetry perspective, we propose two complementary methods: EvolAST, an Abstract Syntax Tree (AST) based approach that targets syntactic symmetry to generate semantically equivalent problem variants, and EvolDomain, which leverages LLMs to address semantic symmetry by translating theorems across mathematical domains. From the difficulty perspective, we propose EvolDifficulty, which uses carefully designed evolutionary instructions to guide LLMs in generating new theorems with a wider range of difficulty. We then use the evolved data to train EvolProver, a 7B-parameter non-reasoning theorem prover. EvolProver establishes a new state-of-the-art (SOTA) on FormalMATH-Lite with a 53.8% pass@32 rate, surpassing all models of comparable size, including reasoning-based models. It also sets new SOTA records for non-reasoning models on MiniF2F-Test (69.8% pass@32), Ineq-Comp-Seed (52.2% pass@32), and Ineq-Comp-Transformed (34.0% pass@32). Ablation studies further confirm our data augmentation pipeline's effectiveness across multiple benchmarks.
△ Less
Submitted 1 October, 2025;
originally announced October 2025.
-
DPsurv: Dual-Prototype Evidential Fusion for Uncertainty-Aware and Interpretable Whole-Slide Image Survival Prediction
Authors:
Yucheng Xing,
Ling Huang,
Jingying Ma,
Ruping Hong,
Jiangdong Qiu,
Pei Liu,
Kai He,
Huazhu Fu,
Mengling Feng
Abstract:
Pathology whole-slide images (WSIs) are widely used for cancer survival analysis because of their comprehensive histopathological information at both cellular and tissue levels, enabling quantitative, large-scale, and prognostically rich tumor feature analysis. However, most existing methods in WSI survival analysis struggle with limited interpretability and often overlook predictive uncertainty i…
▽ More
Pathology whole-slide images (WSIs) are widely used for cancer survival analysis because of their comprehensive histopathological information at both cellular and tissue levels, enabling quantitative, large-scale, and prognostically rich tumor feature analysis. However, most existing methods in WSI survival analysis struggle with limited interpretability and often overlook predictive uncertainty in heterogeneous slide images. In this paper, we propose DPsurv, a dual-prototype whole-slide image evidential fusion network that outputs uncertainty-aware survival intervals, while enabling interpretation of predictions through patch prototype assignment maps, component prototypes, and component-wise relative risk aggregation. Experiments on five publicly available datasets achieve the highest mean concordance index and the lowest mean integrated Brier score, validating the effectiveness and reliability of DPsurv. The interpretation of prediction results provides transparency at the feature, reasoning, and decision levels, thereby enhancing the trustworthiness and interpretability of DPsurv.
△ Less
Submitted 28 September, 2025;
originally announced October 2025.
-
Uncovering Intrinsic Capabilities: A Paradigm for Data Curation in Vision-Language Models
Authors:
Junjie Li,
Ziao Wang,
Jianghong Ma,
Xiaofeng Zhang
Abstract:
Large vision-language models (VLMs) achieve strong benchmark performance, but controlling their behavior through instruction tuning remains difficult. Reducing the budget of instruction tuning dataset often causes regressions, as heuristic strategies treat models as black boxes and overlook the latent capabilities that govern learning. We introduce Capability-Attributed Data Curation (CADC), a fra…
▽ More
Large vision-language models (VLMs) achieve strong benchmark performance, but controlling their behavior through instruction tuning remains difficult. Reducing the budget of instruction tuning dataset often causes regressions, as heuristic strategies treat models as black boxes and overlook the latent capabilities that govern learning. We introduce Capability-Attributed Data Curation (CADC), a framework that shifts curation from task-specific heuristics to intrinsic capability analysis. CADC discovers intrinsic capabilities in an unsupervised manner from gradient-based learning trajectories, attributes training data to these capabilities via influence estimation, and curates capability-aware curricula through balanced selection and staged sequencing. This transforms black-box instruction tuning into a controllable, capability-driven process. With as little as 5% of the original data, CADC surpasses full-data training on multimodal benchmarks. These results validate intrinsic capabilities as the fundamental building blocks of model learning and establish CADC as a principle paradigm for instruction data curation.
△ Less
Submitted 26 September, 2025;
originally announced October 2025.
-
Reinforced Embodied Planning with Verifiable Reward for Real-World Robotic Manipulation
Authors:
Zitong Bo,
Yue Hu,
Jinming Ma,
Mingliang Zhou,
Junhui Yin,
Yachen Kang,
Yuqi Liu,
Tong Wu,
Diyun Xiang,
Hao Chen
Abstract:
Enabling robots to execute long-horizon manipulation tasks from free-form language instructions remains a fundamental challenge in embodied AI. While vision-language models (VLMs) have shown promise as high-level planners, their deployment in the real world is hindered by two gaps: (i) the scarcity of large-scale, sequential manipulation data that couples natural language with multi-step action pl…
▽ More
Enabling robots to execute long-horizon manipulation tasks from free-form language instructions remains a fundamental challenge in embodied AI. While vision-language models (VLMs) have shown promise as high-level planners, their deployment in the real world is hindered by two gaps: (i) the scarcity of large-scale, sequential manipulation data that couples natural language with multi-step action plans, and (ii) the absence of dense, interpretable rewards for fine-tuning VLMs on planning objectives. To address these issues, we propose REVER, a framework that empowers VLMs to generate and validate long-horizon manipulation plans from natural language instructions in real-world scenarios. Under REVER we train and release RoboFarseer, a VLM incentivized to emit chain-of-thought that perform temporal and spatial reasoning, ensuring physically plausible and logically coherent plans. To obtain training data, we leverage the Universal Manipulation Interface framework to capture hardware-agnostic demonstrations of atomic skills. An automated annotation engine converts each demonstration into vision-instruction-plan triplet. We introduce a verifiable reward that scores the generated plan by its ordered bipartite matching overlap with the ground-truth skill sequence. At run time, the fine-tuned VLM functions both as a planner and as a monitor, verifying step-wise completion. RoboFarseer matches or exceeds the performance of proprietary models that are orders of magnitude larger, while on open-ended planning it surpasses the best baseline by more than 40%. In real-world, long-horizon tasks, the complete system boosts overall success by roughly 60% compared with the same low-level controller without the planner. We will open-source both the dataset and the trained model upon publication.
△ Less
Submitted 30 September, 2025;
originally announced September 2025.
-
Less is More: Towards Simple Graph Contrastive Learning
Authors:
Yanan Zhao,
Feng Ji,
Jingyang Dai,
Jiaze Ma,
Wee Peng Tay
Abstract:
Graph Contrastive Learning (GCL) has shown strong promise for unsupervised graph representation learning, yet its effectiveness on heterophilic graphs, where connected nodes often belong to different classes, remains limited. Most existing methods rely on complex augmentation schemes, intricate encoders, or negative sampling, which raises the question of whether such complexity is truly necessary…
▽ More
Graph Contrastive Learning (GCL) has shown strong promise for unsupervised graph representation learning, yet its effectiveness on heterophilic graphs, where connected nodes often belong to different classes, remains limited. Most existing methods rely on complex augmentation schemes, intricate encoders, or negative sampling, which raises the question of whether such complexity is truly necessary in this challenging setting. In this work, we revisit the foundations of supervised and unsupervised learning on graphs and uncover a simple yet effective principle for GCL: mitigating node feature noise by aggregating it with structural features derived from the graph topology. This observation suggests that the original node features and the graph structure naturally provide two complementary views for contrastive learning. Building on this insight, we propose an embarrassingly simple GCL model that uses a GCN encoder to capture structural features and an MLP encoder to isolate node feature noise. Our design requires neither data augmentation nor negative sampling, yet achieves state-of-the-art results on heterophilic benchmarks with minimal computational and memory overhead, while also offering advantages in homophilic graphs in terms of complexity, scalability, and robustness. We provide theoretical justification for our approach and validate its effectiveness through extensive experiments, including robustness evaluations against both black-box and white-box adversarial attacks.
△ Less
Submitted 29 September, 2025;
originally announced September 2025.
-
AIMCoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning
Authors:
Xiping Li,
Jianghong Ma
Abstract:
Multimodal Chain-of-Thought (CoT) has emerged as a powerful technique for enhancing the vision-language reasoning with interleaved information. However, existing methods often rely on simplistic heuristics for constructing interleaved CoT, typically depending on attention maps, which our empirical analysis reveals can be unreliable. What's more, the shortcomings of their passive and purposeless se…
▽ More
Multimodal Chain-of-Thought (CoT) has emerged as a powerful technique for enhancing the vision-language reasoning with interleaved information. However, existing methods often rely on simplistic heuristics for constructing interleaved CoT, typically depending on attention maps, which our empirical analysis reveals can be unreliable. What's more, the shortcomings of their passive and purposeless selection strategies and their arbitrary triggering mechanisms in capturing the model's cognitive need for information are further amplified. In this paper, we propose \textbf{AIMCoT}, an \textbf{A}ctive \textbf{I}nformation-driven \textbf{M}ulti-modal \textbf{C}hain-\textbf{o}f-\textbf{T}hought framework that addresses these fundamental limitations. AIMCoT introduces three synergistic components: (1) \textbf{Context-enhanced Attention-map Generation (CAG)}, which mitigates the text-vision granularity imbalance, thereby producing more reliable attention maps as a foundation. (2) \textbf{Active Visual Probing (AVP)}, which replaces passive selection with a proactive, goal-oriented strategy grounded in information theory to select image regions that help answer the questions maximally. (3) \textbf{Dynamic Attention-shifting Trigger (DAT)}, which intelligently determines the optimal moments to insert visual information by monitoring the model's text-to-vision attention shifts. Extensive experiments on three challenging benchmarks demonstrate that AIMCoT significantly outperforms state-of-the-art methods across different settings. By actively foraging for information and dynamically structuring its reasoning process, AIMCoT represents a critical step towards more robust, effective, and human-like multimodal reasoning. Our code is available at https://anonymous.4open.science/r/AIMCoT.
△ Less
Submitted 29 September, 2025;
originally announced September 2025.
-
RadOnc-GPT: An Autonomous LLM Agent for Real-Time Patient Outcomes Labeling at Scale
Authors:
Jason Holmes,
Yuexing Hao,
Mariana Borras-Osorio,
Federico Mastroleo,
Santiago Romero Brufau,
Valentina Carducci,
Katie M Van Abel,
David M Routman,
Andrew Y. K. Foong,
Liv M Muller,
Satomi Shiraishi,
Daniel K Ebner,
Daniel J Ma,
Sameer R Keole,
Samir H Patel,
Mirek Fatyga,
Martin Bues,
Brad J Stish,
Yolanda I Garces,
Michelle A Neben Wittich,
Robert L Foote,
Sujay A Vora,
Nadia N Laack,
Mark R Waddle,
Wei Liu
Abstract:
Manual labeling limits the scale, accuracy, and timeliness of patient outcomes research in radiation oncology. We present RadOnc-GPT, an autonomous large language model (LLM)-based agent capable of independently retrieving patient-specific information, iteratively assessing evidence, and returning structured outcomes. Our evaluation explicitly validates RadOnc-GPT across two clearly defined tiers…
▽ More
Manual labeling limits the scale, accuracy, and timeliness of patient outcomes research in radiation oncology. We present RadOnc-GPT, an autonomous large language model (LLM)-based agent capable of independently retrieving patient-specific information, iteratively assessing evidence, and returning structured outcomes. Our evaluation explicitly validates RadOnc-GPT across two clearly defined tiers of increasing complexity: (1) a structured quality assurance (QA) tier, assessing the accurate retrieval of demographic and radiotherapy treatment plan details, followed by (2) a complex clinical outcomes labeling tier involving determination of mandibular osteoradionecrosis (ORN) in head-and-neck cancer patients and detection of cancer recurrence in independent prostate and head-and-neck cancer cohorts requiring combined interpretation of structured and unstructured patient data. The QA tier establishes foundational trust in structured-data retrieval, a critical prerequisite for successful complex clinical outcome labeling.
△ Less
Submitted 29 September, 2025;
originally announced September 2025.
-
Plan before Solving: Problem-Aware Strategy Routing for Mathematical Reasoning with LLMs
Authors:
Shihao Qi,
Jie Ma,
Ziang Yin,
Lingling Zhang,
Jian Zhang,
Jun Liu,
Feng Tian,
Tongliang Liu
Abstract:
Existing methods usually leverage a fixed strategy, such as natural language reasoning, code-augmented reasoning, tool-integrated reasoning, or ensemble-based reasoning, to guide Large Language Models (LLMs) to perform mathematical reasoning. Our analysis reveals that the single strategy cannot adapt to problem-specific requirements and thus overlooks the trade-off between effectiveness and effici…
▽ More
Existing methods usually leverage a fixed strategy, such as natural language reasoning, code-augmented reasoning, tool-integrated reasoning, or ensemble-based reasoning, to guide Large Language Models (LLMs) to perform mathematical reasoning. Our analysis reveals that the single strategy cannot adapt to problem-specific requirements and thus overlooks the trade-off between effectiveness and efficiency. To address these issues, we propose Planning and Routing through Instance-Specific Modeling (PRISM), a novel framework that decouples mathematical reasoning into two stages: strategy planning and targeted execution. Specifically, we first curate a multi-strategy preference dataset, which we call MathStrat, capturing correctness, process quality, and computational efficiency for each problem--strategy pair. Then, we train a lightweight Strategy Adapter based on the dataset to obtain confidence distributions over the mentioned four reasoning strategies. At inference time, an adaptive routing policy dynamically tailors the reasoning approach based on predictor confidence. It directs the model to use single-strategy execution for high-confidence predictions, dual-strategy verification for competitive scenarios, or comprehensive multi-strategy exploration for uncertain cases. Extensive experiments across five mathematical reasoning benchmarks demonstrate that PRISM consistently outperforms individual strategies and ensemble baselines, achieving improvements ranging from 0.9% to 7.6% across different base models. The adaptive routing approach shows particularly strong benefits for mathematical reasoning tasks across diverse model architectures. Our code is released at https://github.com/reml-group/PRISM.
△ Less
Submitted 29 September, 2025;
originally announced September 2025.
-
From Static to Dynamic: Adaptive Monte Carlo Search for Mathematical Process Supervision
Authors:
Jie Ma,
Shihao Qi,
Rui Xing,
Ziang Yin,
Bifan Wei,
Jun Liu,
Tongliang Liu
Abstract:
The quality of process data plays a key role in training a Process Reward Model (PRM), which can enhance the complex mathematical reasoning capability of large language models. Existing methods estimate the quality of reasoning steps based on a fixed-budget sampling strategy and navigate a vast search space to perform path expansion during the automated data generation process, resulting in their…
▽ More
The quality of process data plays a key role in training a Process Reward Model (PRM), which can enhance the complex mathematical reasoning capability of large language models. Existing methods estimate the quality of reasoning steps based on a fixed-budget sampling strategy and navigate a vast search space to perform path expansion during the automated data generation process, resulting in their inefficiency and inflexibility. To address these issues, we propose Adaptive Monte Carlo Search (AMCS), a framework that transforms data generation from fixed, static to adaptive, dynamic search at the level of node value estimation and path expansion. On one hand, AMCS adaptively refines estimation by allocating more samples to uncertain reasoning steps while using fewer samples for those that are easier to estimate. On the other hand, it enhances the path expansion through a Monte Carlo algorithm with a temporally adaptive policy that begins with broad exploration and gradually shifts toward exploiting the most promising directions. With AMCS, we construct a large-scale dataset MathSearch-200K of about 200K process supervision examples for training PRMs. To verify the effectiveness of our method, we conduct extensive experiments on four mathematical reasoning benchmarks. Experimental results show that Qwen2.5-Math-7B-PRM-AMCS achieves up to 76.2% accuracy on MATH500 with GLM-4-9B, outperforming all baseline PRMs. Notably, a 7B model supervised by Qwen2.5-Math-7B-PRM-AMCS surpasses a 72B model with weaker supervision. Moreover, Qwen2.5-Math-7B-PRM-AMCS maintains consistent advantages on out-of-distribution problems, demonstrating strong generalization capability. Our code is available at https://github.com/reml-group/AMCS.
△ Less
Submitted 29 September, 2025;
originally announced September 2025.
-
Multi-Modal Manipulation via Multi-Modal Policy Consensus
Authors:
Haonan Chen,
Jiaming Xu,
Hongyu Chen,
Kaiwen Hong,
Binghao Huang,
Chaoqi Liu,
Jiayuan Mao,
Yunzhu Li,
Yilun Du,
Katherine Driggs-Campbell
Abstract:
Effectively integrating diverse sensory modalities is crucial for robotic manipulation. However, the typical approach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic architectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes…
▽ More
Effectively integrating diverse sensory modalities is crucial for robotic manipulation. However, the typical approach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic architectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes the policy into a set of diffusion models, each specialized for a single representation (e.g., vision or touch), and employs a router network that learns consensus weights to adaptively combine their contributions, enabling incremental of new representations. We evaluate our approach on simulated manipulation tasks in {RLBench}, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion, where it significantly outperforms feature-concatenation baselines on scenarios requiring multimodal reasoning. Our policy further demonstrates robustness to physical perturbations and sensor corruption. We further conduct perturbation-based importance analysis, which reveals adaptive shifts between modalities.
△ Less
Submitted 27 September, 2025;
originally announced September 2025.
-
Distributed Multi-Robot Multi-Target Simultaneous Search and Tracking in an Unknown Non-convex Environment
Authors:
Jun Chen,
Jiaqing Ma,
Philip Dames
Abstract:
In unknown non-convex environments, such as indoor and underground spaces, deploying a fleet of robots to explore the surroundings while simultaneously searching for and tracking targets of interest to maintain high-precision data collection represents a fundamental challenge that urgently requires resolution in applications such as environmental monitoring and rescue operations. Current research…
▽ More
In unknown non-convex environments, such as indoor and underground spaces, deploying a fleet of robots to explore the surroundings while simultaneously searching for and tracking targets of interest to maintain high-precision data collection represents a fundamental challenge that urgently requires resolution in applications such as environmental monitoring and rescue operations. Current research has made significant progress in addressing environmental exploration, information search, and target tracking problems, but has yet to establish a framework for simultaneously optimizing these tasks in complex environments. In this paper, we propose a novel motion planning algorithm framework that integrates three control strategies: a frontier-based exploration strategy, a guaranteed coverage strategy based on Lloyd's algorithm, and a sensor-based multi-target tracking strategy. By incorporating these three strategies, the proposed algorithm balances coverage search and high-precision active tracking during exploration. Our approach is validated through a series of MATLAB simulations, demonstrating validity and superiority over standard approaches.
△ Less
Submitted 27 September, 2025;
originally announced September 2025.
-
The Matthew Effect of AI Programming Assistants: A Hidden Bias in Software Evolution
Authors:
Fei Gu,
Zi Liang,
Hongzong LI,
Jiahao MA
Abstract:
AI-assisted programming is rapidly reshaping software development, with large language models (LLMs) enabling new paradigms such as vibe coding and agentic coding. While prior works have focused on prompt design and code generation quality, the broader impact of LLM-driven development on the iterative dynamics of software engineering remains underexplored. In this paper, we conduct large-scale exp…
▽ More
AI-assisted programming is rapidly reshaping software development, with large language models (LLMs) enabling new paradigms such as vibe coding and agentic coding. While prior works have focused on prompt design and code generation quality, the broader impact of LLM-driven development on the iterative dynamics of software engineering remains underexplored. In this paper, we conduct large-scale experiments on thousands of algorithmic programming tasks and hundreds of framework selection tasks to systematically investigate how AI-assisted programming interacts with the software ecosystem. Our analysis reveals \textbf{a striking Matthew effect: the more popular a programming language or framework, the higher the success rate of LLM-generated code}. The phenomenon suggests that AI systems may reinforce existing popularity hierarchies, accelerating convergence around dominant tools while hindering diversity and innovation. We provide a quantitative characterization of this effect and discuss its implications for the future evolution of programming ecosystems.
△ Less
Submitted 27 September, 2025;
originally announced September 2025.
-
ABConformer: Physics-inspired Sliding Attention for Antibody-Antigen Interface Prediction
Authors:
Zhang-Yu You,
Jiahao Ma,
Hongzong Li,
Ye-Fan Hu,
Jian-Dong Huang
Abstract:
Accurate prediction of antibody-antigen (Ab-Ag) interfaces is critical for vaccine design, immunodiagnostics, and therapeutic antibody development. However, achieving reliable predictions from sequences alone remains a challenge. In this paper, we present ABCONFORMER, a model based on the Conformer backbone that captures both local and global features of a biosequence. To accurately capture Ab-Ag…
▽ More
Accurate prediction of antibody-antigen (Ab-Ag) interfaces is critical for vaccine design, immunodiagnostics, and therapeutic antibody development. However, achieving reliable predictions from sequences alone remains a challenge. In this paper, we present ABCONFORMER, a model based on the Conformer backbone that captures both local and global features of a biosequence. To accurately capture Ab-Ag interactions, we introduced the physics-inspired sliding attention, enabling residue-level contact recovery without relying on three-dimensional structural data. ABConformer can accurately predict paratopes and epitopes given the antibody and antigen sequence, and predict pan-epitopes on the antigen without antibody information. In comparison experiments, ABCONFORMER achieves state-of-the-art performance on a recent SARS-CoV-2 Ab-Ag dataset, and surpasses widely used sequence-based methods for antibody-agnostic epitope prediction. Ablation studies further quantify the contribution of each component, demonstrating that, compared to conventional cross-attention, sliding attention significantly enhances the precision of epitope prediction. To facilitate reproducibility, we will release the code under an open-source license upon acceptance.
△ Less
Submitted 27 September, 2025;
originally announced September 2025.
-
Robot Learning from Any Images
Authors:
Siheng Zhao,
Jiageng Mao,
Wei Chow,
Zeyu Shangguan,
Tianheng Shi,
Rong Xue,
Yuxi Zheng,
Yijia Weng,
Yang You,
Daniel Seita,
Leonidas Guibas,
Sergey Zakharov,
Vitor Guizilini,
Yue Wang
Abstract:
We introduce RoLA, a framework that transforms any in-the-wild image into an interactive, physics-enabled robotic environment. Unlike previous methods, RoLA operates directly on a single image without requiring additional hardware or digital assets. Our framework democratizes robotic data generation by producing massive visuomotor robotic demonstrations within minutes from a wide range of image so…
▽ More
We introduce RoLA, a framework that transforms any in-the-wild image into an interactive, physics-enabled robotic environment. Unlike previous methods, RoLA operates directly on a single image without requiring additional hardware or digital assets. Our framework democratizes robotic data generation by producing massive visuomotor robotic demonstrations within minutes from a wide range of image sources, including camera captures, robotic datasets, and Internet images. At its core, our approach combines a novel method for single-view physical scene recovery with an efficient visual blending strategy for photorealistic data collection. We demonstrate RoLA's versatility across applications like scalable robotic data generation and augmentation, robot learning from Internet images, and single-image real-to-sim-to-real systems for manipulators and humanoids. Video results are available at https://sihengz02.github.io/RoLA .
△ Less
Submitted 26 September, 2025;
originally announced September 2025.
-
An Intention-driven Lane Change Framework Considering Heterogeneous Dynamic Cooperation in Mixed-traffic Environment
Authors:
Xiaoyun Qiu,
Haichao Liu,
Yue Pan,
Jun Ma,
Xinhu Zheng
Abstract:
In mixed-traffic environments, where autonomous vehicles (AVs) interact with diverse human-driven vehicles (HVs), unpredictable intentions and heterogeneous behaviors make safe and efficient lane change maneuvers highly challenging. Existing methods often oversimplify these interactions by assuming uniform patterns. We propose an intention-driven lane change framework that integrates driving-style…
▽ More
In mixed-traffic environments, where autonomous vehicles (AVs) interact with diverse human-driven vehicles (HVs), unpredictable intentions and heterogeneous behaviors make safe and efficient lane change maneuvers highly challenging. Existing methods often oversimplify these interactions by assuming uniform patterns. We propose an intention-driven lane change framework that integrates driving-style recognition, cooperation-aware decision-making, and coordinated motion planning. A deep learning classifier trained on the NGSIM dataset identifies human driving styles in real time. A cooperation score with intrinsic and interactive components estimates surrounding drivers' intentions and quantifies their willingness to cooperate with the ego vehicle. Decision-making combines behavior cloning with inverse reinforcement learning to determine whether a lane change should be initiated. For trajectory generation, model predictive control is integrated with IRL-based intention inference to produce collision-free and socially compliant maneuvers. Experiments show that the proposed model achieves 94.2\% accuracy and 94.3\% F1-score, outperforming rule-based and learning-based baselines by 4-15\% in lane change recognition. These results highlight the benefit of modeling inter-driver heterogeneity and demonstrate the potential of the framework to advance context-aware and human-like autonomous driving in complex traffic environments.
△ Less
Submitted 26 September, 2025;
originally announced September 2025.
-
SLAM-Free Visual Navigation with Hierarchical Vision-Language Perception and Coarse-to-Fine Semantic Topological Planning
Authors:
Guoyang Zhao,
Yudong Li,
Weiqing Qi,
Kai Zhang,
Bonan Liu,
Kai Chen,
Haoang Li,
Jun Ma
Abstract:
Conventional SLAM pipelines for legged robot navigation are fragile under rapid motion, calibration demands, and sensor drift, while offering limited semantic reasoning for task-driven exploration. To deal with these issues, we propose a vision-only, SLAM-free navigation framework that replaces dense geometry with semantic reasoning and lightweight topological representations. A hierarchical visio…
▽ More
Conventional SLAM pipelines for legged robot navigation are fragile under rapid motion, calibration demands, and sensor drift, while offering limited semantic reasoning for task-driven exploration. To deal with these issues, we propose a vision-only, SLAM-free navigation framework that replaces dense geometry with semantic reasoning and lightweight topological representations. A hierarchical vision-language perception module fuses scene-level context with object-level cues for robust semantic inference. And a semantic-probabilistic topological map supports coarse-to-fine planning: LLM-based global reasoning for subgoal selection and vision-based local planning for obstacle avoidance. Integrated with reinforcement-learning locomotion controllers, the framework is deployable across diverse legged robot platforms. Experiments in simulation and real-world settings demonstrate consistent improvements in semantic accuracy, planning quality, and navigation success, while ablation studies further showcase the necessity of both hierarchical perception and fine local planning. This work introduces a new paradigm for SLAM-free, vision-language-driven navigation, shifting robotic exploration from geometry-centric mapping to semantics-driven decision making.
△ Less
Submitted 25 September, 2025;
originally announced September 2025.
-
A Versatile Foundation Model for AI-enabled Mammogram Interpretation
Authors:
Fuxiang Huang,
Jiayi Zhu,
Yunfang Yu,
Yu Xie,
Yuan Guo,
Qingcong Kong,
Mingxiang Wu,
Xinrui Jiang,
Shu Yang,
Jiabo Ma,
Ziyi Liu,
Zhe Xu,
Zhixuan Chen,
Yujie Tan,
Zifan He,
Luhui Mao,
Xi Wang,
Junlin Hou,
Lei Zhang,
Qiong Luo,
Zhenhui Li,
Herui Yao,
Hao Chen
Abstract:
Breast cancer is the most commonly diagnosed cancer and the leading cause of cancer-related mortality in women globally. Mammography is essential for the early detection and diagnosis of breast lesions. Despite recent progress in foundation models (FMs) for mammogram analysis, their clinical translation remains constrained by several fundamental limitations, including insufficient diversity in tra…
▽ More
Breast cancer is the most commonly diagnosed cancer and the leading cause of cancer-related mortality in women globally. Mammography is essential for the early detection and diagnosis of breast lesions. Despite recent progress in foundation models (FMs) for mammogram analysis, their clinical translation remains constrained by several fundamental limitations, including insufficient diversity in training data, limited model generalizability, and a lack of comprehensive evaluation across clinically relevant tasks. Here, we introduce VersaMammo, a versatile foundation model for mammograms, designed to overcome these limitations. We curated the largest multi-institutional mammogram dataset to date, comprising 706,239 images from 21 sources. To improve generalization, we propose a two-stage pre-training strategy to develop VersaMammo, a mammogram foundation model. First, a teacher model is trained via self-supervised learning to extract transferable features from unlabeled mammograms. Then, supervised learning combined with knowledge distillation transfers both features and clinical knowledge into VersaMammo. To ensure a comprehensive evaluation, we established a benchmark comprising 92 specific tasks, including 68 internal tasks and 24 external validation tasks, spanning 5 major clinical task categories: lesion detection, segmentation, classification, image retrieval, and visual question answering. VersaMammo achieves state-of-the-art performance, ranking first in 50 out of 68 specific internal tasks and 20 out of 24 external validation tasks, with average ranks of 1.5 and 1.2, respectively. These results demonstrate its superior generalization and clinical utility, offering a substantial advancement toward reliable and scalable breast cancer screening and diagnosis.
△ Less
Submitted 24 September, 2025;
originally announced September 2025.
-
MARG: MAstering Risky Gap Terrains for Legged Robots with Elevation Mapping
Authors:
Yinzhao Dong,
Ji Ma,
Liu Zhao,
Wanyue Li,
Peng Lu
Abstract:
Deep Reinforcement Learning (DRL) controllers for quadrupedal locomotion have demonstrated impressive performance on challenging terrains, allowing robots to execute complex skills such as climbing, running, and jumping. However, existing blind locomotion controllers often struggle to ensure safety and efficient traversal through risky gap terrains, which are typically highly complex, requiring ro…
▽ More
Deep Reinforcement Learning (DRL) controllers for quadrupedal locomotion have demonstrated impressive performance on challenging terrains, allowing robots to execute complex skills such as climbing, running, and jumping. However, existing blind locomotion controllers often struggle to ensure safety and efficient traversal through risky gap terrains, which are typically highly complex, requiring robots to perceive terrain information and select appropriate footholds during locomotion accurately. Meanwhile, existing perception-based controllers still present several practical limitations, including a complex multi-sensor deployment system and expensive computing resource requirements. This paper proposes a DRL controller named MAstering Risky Gap Terrains (MARG), which integrates terrain maps and proprioception to dynamically adjust the action and enhance the robot's stability in these tasks. During the training phase, our controller accelerates policy optimization by selectively incorporating privileged information (e.g., center of mass, friction coefficients) that are available in simulation but unmeasurable directly in real-world deployments due to sensor limitations. We also designed three foot-related rewards to encourage the robot to explore safe footholds. More importantly, a terrain map generation (TMG) model is proposed to reduce the drift existing in mapping and provide accurate terrain maps using only one LiDAR, providing a foundation for zero-shot transfer of the learned policy. The experimental results indicate that MARG maintains stability in various risky terrain tasks.
△ Less
Submitted 27 September, 2025; v1 submitted 24 September, 2025;
originally announced September 2025.
-
OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving
Authors:
Pei Liu,
Hongliang Lu,
Haichao Liu,
Haipeng Liu,
Xin Liu,
Ruoyu Yao,
Shengbo Eben Li,
Jun Ma
Abstract:
Human vision is capable of transforming two-dimensional observations into an egocentric three-dimensional scene understanding, which underpins the ability to translate complex scenes and exhibit adaptive behaviors. This capability, however, remains lacking in current autonomous driving systems, where mainstream approaches primarily rely on depth-based 3D reconstruction rather than true scene under…
▽ More
Human vision is capable of transforming two-dimensional observations into an egocentric three-dimensional scene understanding, which underpins the ability to translate complex scenes and exhibit adaptive behaviors. This capability, however, remains lacking in current autonomous driving systems, where mainstream approaches primarily rely on depth-based 3D reconstruction rather than true scene understanding. To address this limitation, we propose a novel human-like framework called OmniScene. First, we introduce the OmniScene Vision-Language Model (OmniVLM), a vision-language framework that integrates multi-view and temporal perception for holistic 4D scene understanding. Then, harnessing a teacher-student OmniVLM architecture and knowledge distillation, we embed textual representations into 3D instance features for semantic supervision, enriching feature learning, and explicitly capturing human-like attentional semantics. These feature representations are further aligned with human driving behaviors, forming a more human-like perception-understanding-action architecture. In addition, we propose a Hierarchical Fusion Strategy (HFS) to address imbalances in modality contributions during multimodal integration. Our approach adaptively calibrates the relative significance of geometric and semantic features at multiple abstraction levels, enabling the synergistic use of complementary cues from visual and textual modalities. This learnable dynamic fusion enables a more nuanced and effective exploitation of heterogeneous information. We evaluate OmniScene comprehensively on the nuScenes dataset, benchmarking it against over ten state-of-the-art models across various tasks. Our approach consistently achieves superior results, establishing new benchmarks in perception, prediction, planning, and visual question answering.
△ Less
Submitted 25 September, 2025; v1 submitted 24 September, 2025;
originally announced September 2025.
-
GUIDE: A Diffusion-Based Autonomous Robot Exploration Framework Using Global Graph Inference
Authors:
Zijun Che,
Yinghong Zhang,
Shengyi Liang,
Boyu Zhou,
Jun Ma,
Jinni Zhou
Abstract:
Autonomous exploration in structured and complex indoor environments remains a challenging task, as existing methods often struggle to appropriately model unobserved space and plan globally efficient paths. To address these limitations, we propose GUIDE, a novel exploration framework that synergistically combines global graph inference with diffusion-based decision-making. We introduce a region-ev…
▽ More
Autonomous exploration in structured and complex indoor environments remains a challenging task, as existing methods often struggle to appropriately model unobserved space and plan globally efficient paths. To address these limitations, we propose GUIDE, a novel exploration framework that synergistically combines global graph inference with diffusion-based decision-making. We introduce a region-evaluation global graph representation that integrates both observed environmental data and predictions of unexplored areas, enhanced by a region-level evaluation mechanism to prioritize reliable structural inferences while discounting uncertain predictions. Building upon this enriched representation, a diffusion policy network generates stable, foresighted action sequences with significantly reduced denoising steps. Extensive simulations and real-world deployments demonstrate that GUIDE consistently outperforms state-of-the-art methods, achieving up to 18.3% faster coverage completion and a 34.9% reduction in redundant movements.
△ Less
Submitted 25 September, 2025; v1 submitted 24 September, 2025;
originally announced September 2025.
-
SAGE:State-Aware Guided End-to-End Policy for Multi-Stage Sequential Tasks via Hidden Markov Decision Process
Authors:
BinXu Wu,
TengFei Zhang,
Chen Yang,
JiaHao Wen,
HaoCheng Li,
JingTian Ma,
Zhen Chen,
JingYuan Wang
Abstract:
Multi-stage sequential (MSS) robotic manipulation tasks are prevalent and crucial in robotics. They often involve state ambiguity, where visually similar observations correspond to different actions. We present SAGE, a state-aware guided imitation learning framework that models tasks as a Hidden Markov Decision Process (HMDP) to explicitly capture latent task stages and resolve ambiguity. We insta…
▽ More
Multi-stage sequential (MSS) robotic manipulation tasks are prevalent and crucial in robotics. They often involve state ambiguity, where visually similar observations correspond to different actions. We present SAGE, a state-aware guided imitation learning framework that models tasks as a Hidden Markov Decision Process (HMDP) to explicitly capture latent task stages and resolve ambiguity. We instantiate the HMDP with a state transition network that infers hidden states, and a state-aware action policy that conditions on both observations and hidden states to produce actions, thereby enabling disambiguation across task stages. To reduce manual annotation effort, we propose a semi-automatic labeling pipeline combining active learning and soft label interpolation. In real-world experiments across multiple complex MSS tasks with state ambiguity, SAGE achieved 100% task success under the standard evaluation protocol, markedly surpassing the baselines. Ablation studies further show that such performance can be maintained with manual labeling for only about 13% of the states, indicating its strong effectiveness.
△ Less
Submitted 24 September, 2025;
originally announced September 2025.
-
HUNT: High-Speed UAV Navigation and Tracking in Unstructured Environments via Instantaneous Relative Frames
Authors:
Alessandro Saviolo,
Jeffrey Mao,
Giuseppe Loianno
Abstract:
Search and rescue operations require unmanned aerial vehicles to both traverse unknown unstructured environments at high speed and track targets once detected. Achieving both capabilities under degraded sensing and without global localization remains an open challenge. Recent works on relative navigation have shown robust tracking by anchoring planning and control to a visible detected object, but…
▽ More
Search and rescue operations require unmanned aerial vehicles to both traverse unknown unstructured environments at high speed and track targets once detected. Achieving both capabilities under degraded sensing and without global localization remains an open challenge. Recent works on relative navigation have shown robust tracking by anchoring planning and control to a visible detected object, but cannot address navigation when no target is in the field of view. We present HUNT (High-speed UAV Navigation and Tracking), a real-time framework that unifies traversal, acquisition, and tracking within a single relative formulation. HUNT defines navigation objectives directly from onboard instantaneous observables such as attitude, altitude, and velocity, enabling reactive high-speed flight during search. Once a target is detected, the same perception-control pipeline transitions seamlessly to tracking. Outdoor experiments in dense forests, container compounds, and search-and-rescue operations with vehicles and mannequins demonstrate robust autonomy where global methods fail.
△ Less
Submitted 28 September, 2025; v1 submitted 23 September, 2025;
originally announced September 2025.
-
Otters: An Energy-Efficient SpikingTransformer via Optical Time-to-First-Spike Encoding
Authors:
Zhanglu Yan,
Jiayi Mao,
Qianhui Liu,
Fanfan Li,
Gang Pan,
Tao Luo,
Bowen Zhu,
Weng-Fai Wong
Abstract:
Spiking neural networks (SNNs) promise high energy efficiency, particularly with time-to-first-spike (TTFS) encoding, which maximizes sparsity by emitting at most one spike per neuron. However, such energy advantage is often unrealized because inference requires evaluating a temporal decay function and subsequent multiplication with the synaptic weights. This paper challenges this costly approach…
▽ More
Spiking neural networks (SNNs) promise high energy efficiency, particularly with time-to-first-spike (TTFS) encoding, which maximizes sparsity by emitting at most one spike per neuron. However, such energy advantage is often unrealized because inference requires evaluating a temporal decay function and subsequent multiplication with the synaptic weights. This paper challenges this costly approach by repurposing a physical hardware `bug', namely, the natural signal decay in optoelectronic devices, as the core computation of TTFS. We fabricated a custom indium oxide optoelectronic synapse, showing how its natural physical decay directly implements the required temporal function. By treating the device's analog output as the fused product of the synaptic weight and temporal decay, optoelectronic synaptic TTFS (named Otters) eliminates these expensive digital operations. To use the Otters paradigm in complex architectures like the transformer, which are challenging to train directly due to the sparsity issue, we introduce a novel quantized neural network-to-SNN conversion algorithm. This complete hardware-software co-design enables our model to achieve state-of-the-art accuracy across seven GLUE benchmark datasets and demonstrates a 1.77$\times$ improvement in energy efficiency over previous leading SNNs, based on a comprehensive analysis of compute, data movement, and memory access costs using energy measurements from a commercial 22nm process. Our work thus establishes a new paradigm for energy-efficient SNNs, translating fundamental device physics directly into powerful computational primitives. All codes and data are open source.
△ Less
Submitted 23 September, 2025;
originally announced September 2025.
-
One-shot Embroidery Customization via Contrastive LoRA Modulation
Authors:
Jun Ma,
Qian He,
Gaofeng He,
Huang Chen,
Chen Liu,
Xiaogang Jin,
Huamin Wang
Abstract:
Diffusion models have significantly advanced image manipulation techniques, and their ability to generate photorealistic images is beginning to transform retail workflows, particularly in presale visualization. Beyond artistic style transfer, the capability to perform fine-grained visual feature transfer is becoming increasingly important. Embroidery is a textile art form characterized by intricat…
▽ More
Diffusion models have significantly advanced image manipulation techniques, and their ability to generate photorealistic images is beginning to transform retail workflows, particularly in presale visualization. Beyond artistic style transfer, the capability to perform fine-grained visual feature transfer is becoming increasingly important. Embroidery is a textile art form characterized by intricate interplay of diverse stitch patterns and material properties, which poses unique challenges for existing style transfer methods. To explore the customization for such fine-grained features, we propose a novel contrastive learning framework that disentangles fine-grained style and content features with a single reference image, building on the classic concept of image analogy. We first construct an image pair to define the target style, and then adopt a similarity metric based on the decoupled representations of pretrained diffusion models for style-content separation. Subsequently, we propose a two-stage contrastive LoRA modulation technique to capture fine-grained style features. In the first stage, we iteratively update the whole LoRA and the selected style blocks to initially separate style from content. In the second stage, we design a contrastive learning strategy to further decouple style and content through self-knowledge distillation. Finally, we build an inference pipeline to handle image or text inputs with only the style blocks. To evaluate our method on fine-grained style transfer, we build a benchmark for embroidery customization. Our approach surpasses prior methods on this task and further demonstrates strong generalization to three additional domains: artistic style transfer, sketch colorization, and appearance transfer.
△ Less
Submitted 23 September, 2025;
originally announced September 2025.
-
FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction
Authors:
Yuxuan Cai,
Xiaozhuan Liang,
Xinghua Wang,
Jin Ma,
Haijin Liang,
Jinwen Luo,
Xinyu Zuo,
Lisheng Duan,
Yuyang Yin,
Xi Chen
Abstract:
As large language models (LLMs) become increasingly powerful, the sequential nature of autoregressive generation creates a fundamental throughput bottleneck that limits the practical deployment. While Multi-Token Prediction (MTP) has demonstrated remarkable benefits for model training efficiency and performance, its inherent potential for inference acceleration remains largely unexplored. This pap…
▽ More
As large language models (LLMs) become increasingly powerful, the sequential nature of autoregressive generation creates a fundamental throughput bottleneck that limits the practical deployment. While Multi-Token Prediction (MTP) has demonstrated remarkable benefits for model training efficiency and performance, its inherent potential for inference acceleration remains largely unexplored. This paper introduces FastMTP, a simple yet effective method that improves multi-step draft quality by aligning MTP training with its inference pattern, significantly enhancing speculative decoding performance. Our approach fine-tunes a single MTP head with position-shared weights on self-distilled data, enabling it to capture dependencies among consecutive future tokens and maintain high acceptance rates across multiple recursive draft steps. By integrating language-aware dynamic vocabulary compression into the MTP head, we further reduce computational overhead in the drafting process. Experimental results across seven diverse benchmarks demonstrate that FastMTP achieves an average of 2.03x speedup compared to standard next token prediction with lossless output quality, outperforming vanilla MTP by 82%. FastMTP requires only lightweight training and seamlessly integrates with existing inference frameworks, offering a practical and rapidly deployable solution for accelerating LLM inference.
△ Less
Submitted 16 September, 2025;
originally announced September 2025.
-
Learning Contrastive Multimodal Fusion with Improved Modality Dropout for Disease Detection and Prediction
Authors:
Yi Gu,
Kuniaki Saito,
Jiaxin Ma
Abstract:
As medical diagnoses increasingly leverage multimodal data, machine learning models are expected to effectively fuse heterogeneous information while remaining robust to missing modalities. In this work, we propose a novel multimodal learning framework that integrates enhanced modalities dropout and contrastive learning to address real-world limitations such as modality imbalance and missingness. O…
▽ More
As medical diagnoses increasingly leverage multimodal data, machine learning models are expected to effectively fuse heterogeneous information while remaining robust to missing modalities. In this work, we propose a novel multimodal learning framework that integrates enhanced modalities dropout and contrastive learning to address real-world limitations such as modality imbalance and missingness. Our approach introduces learnable modality tokens for improving missingness-aware fusion of modalities and augments conventional unimodal contrastive objectives with fused multimodal representations. We validate our framework on large-scale clinical datasets for disease detection and prediction tasks, encompassing both visual and tabular modalities. Experimental results demonstrate that our method achieves state-of-the-art performance, particularly in challenging and practical scenarios where only a single modality is available. Furthermore, we show its adaptability through successful integration with a recent CT foundation model. Our findings highlight the effectiveness, efficiency, and generalizability of our approach for multimodal learning, offering a scalable, low-cost solution with significant potential for real-world clinical applications. The code is available at https://github.com/omron-sinicx/medical-modality-dropout.
△ Less
Submitted 22 September, 2025;
originally announced September 2025.
-
Qianfan-VL: Domain-Enhanced Universal Vision-Language Models
Authors:
Daxiang Dong,
Mingming Zheng,
Dong Xu,
Bairong Zhuang,
Wenyu Zhang,
Chunhua Luo,
Haoran Wang,
Zijian Zhao,
Jie Li,
Yuxuan Li,
Hanjun Zhong,
Mengyue Liu,
Jieting Chen,
Shupeng Li,
Lun Tian,
Yaping Feng,
Xin Li,
Donggang Jiang,
Yong Chen,
Yehua Xu,
Duohao Qin,
Chen Feng,
Dan Wang,
Henghua Zhang,
Jingjing Ha
, et al. (10 additional authors not shown)
Abstract:
We present Qianfan-VL, a series of multimodal large language models ranging from 3B to 70B parameters, achieving state-of-the-art performance through innovative domain enhancement techniques. Our approach employs multi-stage progressive training and high-precision data synthesis pipelines, which prove to be critical technologies for enhancing domain-specific capabilities while maintaining strong g…
▽ More
We present Qianfan-VL, a series of multimodal large language models ranging from 3B to 70B parameters, achieving state-of-the-art performance through innovative domain enhancement techniques. Our approach employs multi-stage progressive training and high-precision data synthesis pipelines, which prove to be critical technologies for enhancing domain-specific capabilities while maintaining strong general performance. Qianfan-VL achieves comparable results to leading open-source models on general benchmarks, with state-of-the-art performance on benchmarks such as CCBench, SEEDBench IMG, ScienceQA, and MMStar. The domain enhancement strategy delivers significant advantages in OCR and document understanding, validated on both public benchmarks (OCRBench 873, DocVQA 94.75%) and in-house evaluations. Notably, Qianfan-VL-8B and 70B variants incorporate long chain-of-thought capabilities, demonstrating superior performance on mathematical reasoning (MathVista 78.6%) and logical inference tasks. All models are trained entirely on Baidu's Kunlun P800 chips, validating the capability of large-scale AI infrastructure to train SOTA-level multimodal models with over 90% scaling efficiency on 5000 chips for a single task. This work establishes an effective methodology for developing domain-enhanced multimodal models suitable for diverse enterprise deployment scenarios.
△ Less
Submitted 19 September, 2025;
originally announced September 2025.
-
SocialTraj: Two-Stage Socially-Aware Trajectory Prediction for Autonomous Driving via Conditional Diffusion Model
Authors:
Xiao Zhou,
Zengqi Peng,
Jun Ma
Abstract:
Accurate trajectory prediction of surrounding vehicles (SVs) is crucial for autonomous driving systems to avoid misguided decisions and potential accidents. However, achieving reliable predictions in highly dynamic and complex traffic scenarios remains a significant challenge. One of the key impediments lies in the limited effectiveness of current approaches to capture the multi-modal behaviors of…
▽ More
Accurate trajectory prediction of surrounding vehicles (SVs) is crucial for autonomous driving systems to avoid misguided decisions and potential accidents. However, achieving reliable predictions in highly dynamic and complex traffic scenarios remains a significant challenge. One of the key impediments lies in the limited effectiveness of current approaches to capture the multi-modal behaviors of drivers, which leads to predicted trajectories that deviate from actual future motions. To address this issue, we propose SocialTraj, a novel trajectory prediction framework integrating social psychology principles through social value orientation (SVO). By utilizing Bayesian inverse reinforcement learning (IRL) to estimate the SVO of SVs, we obtain the critical social context to infer the future interaction trend. To ensure modal consistency in predicted behaviors, the estimated SVOs of SVs are embedded into a conditional denoising diffusion model that aligns generated trajectories with historical driving styles. Additionally, the planned future trajectory of the ego vehicle (EV) is explicitly incorporated to enhance interaction modeling. Extensive experiments on NGSIM and HighD datasets demonstrate that SocialTraj is capable of adapting to highly dynamic and interactive scenarios while generating socially compliant and behaviorally consistent trajectory predictions, outperforming existing baselines. Ablation studies demonstrate that dynamic SVO estimation and explicit ego-planning components notably improve prediction accuracy and substantially reduce inference time.
△ Less
Submitted 22 September, 2025;
originally announced September 2025.
-
SimToken: A Simple Baseline for Referring Audio-Visual Segmentation
Authors:
Dian Jin,
Yanghao Zhou,
Jinxing Zhou,
Jiaqi Ma,
Ruohao Guo,
Dan Guo
Abstract:
Referring Audio-Visual Segmentation (Ref-AVS) aims to segment specific objects in videos based on natural language expressions involving audio, vision, and text information. This task poses significant challenges in cross-modal reasoning and fine-grained object localization. In this paper, we propose a simple framework, SimToken, that integrates a multimodal large language model (MLLM) with the Se…
▽ More
Referring Audio-Visual Segmentation (Ref-AVS) aims to segment specific objects in videos based on natural language expressions involving audio, vision, and text information. This task poses significant challenges in cross-modal reasoning and fine-grained object localization. In this paper, we propose a simple framework, SimToken, that integrates a multimodal large language model (MLLM) with the Segment Anything Model (SAM). The MLLM is guided to generate a special semantic token representing the referred object. This compact token, enriched with contextual information from all modalities, acts as a prompt to guide SAM to segment objectsacross video frames. To further improve semantic learning, we introduce a novel target-consistent semantic alignment loss that aligns token embeddings from different expressions but referring to the same object. Experiments on the Ref-AVS benchmark demonstrate that our approach achieves superior performance compared to existing methods.
△ Less
Submitted 23 September, 2025; v1 submitted 22 September, 2025;
originally announced September 2025.
-
Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration
Authors:
Zhitao Zeng,
Guojian Yuan,
Junyuan Mao,
Yuxuan Wang,
Xiaoshuang Jia,
Yueming Jin
Abstract:
Accurate temporal prediction is the bridge between comprehensive scene understanding and embodied artificial intelligence. However, predicting multiple fine-grained states of a scene at multiple temporal scales is difficult for vision-language models. We formalize the Multi-Scale Temporal Prediction (MSTP) task in general and surgical scenes by decomposing multi-scale into two orthogonal dimension…
▽ More
Accurate temporal prediction is the bridge between comprehensive scene understanding and embodied artificial intelligence. However, predicting multiple fine-grained states of a scene at multiple temporal scales is difficult for vision-language models. We formalize the Multi-Scale Temporal Prediction (MSTP) task in general and surgical scenes by decomposing multi-scale into two orthogonal dimensions: the temporal scale, forecasting states of humans and surgery at varying look-ahead intervals, and the state scale, modeling a hierarchy of states in general and surgical scenes. For example, in general scenes, states of contact relationships are finer-grained than states of spatial relationships. In surgical scenes, medium-level steps are finer-grained than high-level phases yet remain constrained by their encompassing phase. To support this unified task, we introduce the first MSTP Benchmark, featuring synchronized annotations across multiple state scales and temporal scales. We further propose a method, Incremental Generation and Multi-agent Collaboration (IG-MC), which integrates two key innovations. First, we present a plug-and-play incremental generation module that continuously synthesizes up-to-date visual previews at expanding temporal scales to inform multiple decision-making agents, keeping decisions and generated visuals synchronized and preventing performance degradation as look-ahead intervals lengthen. Second, we present a decision-driven multi-agent collaboration framework for multi-state prediction, comprising generation, initiation, and multi-state assessment agents that dynamically trigger and evaluate prediction cycles to balance global coherence and local fidelity.
△ Less
Submitted 23 September, 2025; v1 submitted 22 September, 2025;
originally announced September 2025.
-
Computational Scaffolding of Composition, Value, and Color for Disciplined Drawing
Authors:
Jiaju Ma,
Chau Vu,
Asya Lyubavina,
Catherine Liu,
Jingyi Li
Abstract:
One way illustrators engage in disciplined drawing - the process of drawing to improve technical skills - is through studying and replicating reference images. However, for many novice and intermediate digital artists, knowing how to approach studying a reference image can be challenging. It can also be difficult to receive immediate feedback on their works-in-progress. To help these users develop…
▽ More
One way illustrators engage in disciplined drawing - the process of drawing to improve technical skills - is through studying and replicating reference images. However, for many novice and intermediate digital artists, knowing how to approach studying a reference image can be challenging. It can also be difficult to receive immediate feedback on their works-in-progress. To help these users develop their professional vision, we propose ArtKrit, a tool that scaffolds the process of replicating a reference image into three main steps: composition, value, and color. At each step, our tool offers computational guidance, such as adaptive composition line generation, and automatic feedback, such as value and color accuracy. Evaluating this tool with intermediate digital artists revealed that ArtKrit could flexibly accommodate their unique workflows. Our code and supplemental materials are available at https://majiaju.io/artkrit .
△ Less
Submitted 21 September, 2025;
originally announced September 2025.
-
CoPlanner: An Interactive Motion Planner with Contingency-Aware Diffusion for Autonomous Driving
Authors:
Ruiguo Zhong,
Ruoyu Yao,
Pei Liu,
Xiaolong Chen,
Rui Yang,
Jun Ma
Abstract:
Accurate trajectory prediction and motion planning are crucial for autonomous driving systems to navigate safely in complex, interactive environments characterized by multimodal uncertainties. However, current generation-then-evaluation frameworks typically construct multiple plausible trajectory hypotheses but ultimately adopt a single most likely outcome, leading to overconfident decisions and a…
▽ More
Accurate trajectory prediction and motion planning are crucial for autonomous driving systems to navigate safely in complex, interactive environments characterized by multimodal uncertainties. However, current generation-then-evaluation frameworks typically construct multiple plausible trajectory hypotheses but ultimately adopt a single most likely outcome, leading to overconfident decisions and a lack of fallback strategies that are vital for safety in rare but critical scenarios. Moreover, the usual decoupling of prediction and planning modules could result in socially inconsistent or unrealistic joint trajectories, especially in highly interactive traffic. To address these challenges, we propose a contingency-aware diffusion planner (CoPlanner), a unified framework that jointly models multi-agent interactive trajectory generation and contingency-aware motion planning. Specifically, the pivot-conditioned diffusion mechanism anchors trajectory sampling on a validated, shared short-term segment to preserve temporal consistency, while stochastically generating diverse long-horizon branches that capture multimodal motion evolutions. In parallel, we design a contingency-aware multi-scenario scoring strategy that evaluates candidate ego trajectories across multiple plausible long-horizon evolution scenarios, balancing safety, progress, and comfort. This integrated design preserves feasible fallback options and enhances robustness under uncertainty, leading to more realistic interaction-aware planning. Extensive closed-loop experiments on the nuPlan benchmark demonstrate that CoPlanner consistently surpasses state-of-the-art methods on both Val14 and Test14 datasets, achieving significant improvements in safety and comfort under both reactive and non-reactive settings. Code and model will be made publicly available upon acceptance.
△ Less
Submitted 21 September, 2025;
originally announced September 2025.
-
Orchestrate, Generate, Reflect: A VLM-Based Multi-Agent Collaboration Framework for Automated Driving Policy Learning
Authors:
Zengqi Peng,
Yusen Xie,
Yubin Wang,
Rui Yang,
Qifeng Chen,
Jun Ma
Abstract:
The advancement of foundation models fosters new initiatives for policy learning in achieving safe and efficient autonomous driving. However, a critical bottleneck lies in the manual engineering of reward functions and training curricula for complex and dynamic driving tasks, which is a labor-intensive and time-consuming process. To address this problem, we propose OGR (Orchestrate, Generate, Refl…
▽ More
The advancement of foundation models fosters new initiatives for policy learning in achieving safe and efficient autonomous driving. However, a critical bottleneck lies in the manual engineering of reward functions and training curricula for complex and dynamic driving tasks, which is a labor-intensive and time-consuming process. To address this problem, we propose OGR (Orchestrate, Generate, Reflect), a novel automated driving policy learning framework that leverages vision-language model (VLM)-based multi-agent collaboration. Our framework capitalizes on advanced reasoning and multimodal understanding capabilities of VLMs to construct a hierarchical agent system. Specifically, a centralized orchestrator plans high-level training objectives, while a generation module employs a two-step analyze-then-generate process for efficient generation of reward-curriculum pairs. A reflection module then facilitates iterative optimization based on the online evaluation. Furthermore, a dedicated memory module endows the VLM agents with the capabilities of long-term memory. To enhance robustness and diversity of the generation process, we introduce a parallel generation scheme and a human-in-the-loop technique for augmentation of the reward observation space. Through efficient multi-agent cooperation and leveraging rich multimodal information, OGR enables the online evolution of reinforcement learning policies to acquire interaction-aware driving skills. Extensive experiments in the CARLA simulator demonstrate the superior performance, robust generalizability across distinct urban scenarios, and strong compatibility with various RL algorithms. Further real-world experiments highlight the practical viability and effectiveness of our framework. The source code will be available upon acceptance of the paper.
△ Less
Submitted 21 September, 2025;
originally announced September 2025.
-
DistillMatch: Leveraging Knowledge Distillation from Vision Foundation Model for Multimodal Image Matching
Authors:
Meng Yang,
Fan Fan,
Zizhuo Li,
Songchu Deng,
Yong Ma,
Jiayi Ma
Abstract:
Multimodal image matching seeks pixel-level correspondences between images of different modalities, crucial for cross-modal perception, fusion and analysis. However, the significant appearance differences between modalities make this task challenging. Due to the scarcity of high-quality annotated datasets, existing deep learning methods that extract modality-common features for matching perform po…
▽ More
Multimodal image matching seeks pixel-level correspondences between images of different modalities, crucial for cross-modal perception, fusion and analysis. However, the significant appearance differences between modalities make this task challenging. Due to the scarcity of high-quality annotated datasets, existing deep learning methods that extract modality-common features for matching perform poorly and lack adaptability to diverse scenarios. Vision Foundation Model (VFM), trained on large-scale data, yields generalizable and robust feature representations adapted to data and tasks of various modalities, including multimodal matching. Thus, we propose DistillMatch, a multimodal image matching method using knowledge distillation from VFM. DistillMatch employs knowledge distillation to build a lightweight student model that extracts high-level semantic features from VFM (including DINOv2 and DINOv3) to assist matching across modalities. To retain modality-specific information, it extracts and injects modality category information into the other modality's features, which enhances the model's understanding of cross-modal correlations. Furthermore, we design V2I-GAN to boost the model's generalization by translating visible to pseudo-infrared images for data augmentation. Experiments show that DistillMatch outperforms existing algorithms on public datasets.
△ Less
Submitted 19 September, 2025;
originally announced September 2025.
-
Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech
Authors:
Xinlei Niu,
Jianbo Ma,
Dylan Harper-Harris,
Xiangyu Zhang,
Charles Patrick Martin,
Jing Zhang
Abstract:
The generation of realistic, context-aware audio is important in real-world applications such as video game development. While existing video-to-audio (V2A) methods mainly focus on Foley sound generation, they struggle to produce intelligible speech. Meanwhile, current environmental speech synthesis approaches remain text-driven and fail to temporally align with dynamic video content. In this pape…
▽ More
The generation of realistic, context-aware audio is important in real-world applications such as video game development. While existing video-to-audio (V2A) methods mainly focus on Foley sound generation, they struggle to produce intelligible speech. Meanwhile, current environmental speech synthesis approaches remain text-driven and fail to temporally align with dynamic video content. In this paper, we propose Beyond Video-to-SFX (BVS), a method to generate synchronized audio with environmentally aware intelligible speech for given videos. We introduce a two-stage modeling method: (1) stage one is a video-guided audio semantic (V2AS) model to predict unified audio semantic tokens conditioned on phonetic cues; (2) stage two is a video-conditioned semantic-to-acoustic (VS2A) model that refines semantic tokens into detailed acoustic tokens. Experiments demonstrate the effectiveness of BVS in scenarios such as video-to-context-aware speech synthesis and immersive audio background conversion, with ablation studies further validating our design. Our demonstration is available at~\href{https://xinleiniu.github.io/BVS-demo/}{BVS-Demo}.
△ Less
Submitted 18 September, 2025;
originally announced September 2025.
-
Semantic-LiDAR-Inertial-Wheel Odometry Fusion for Robust Localization in Large-Scale Dynamic Environments
Authors:
Haoxuan Jiang,
Peicong Qian,
Yusen Xie,
Linwei Zheng,
Xiaocong Li,
Ming Liu,
Jun Ma
Abstract:
Reliable, drift-free global localization presents significant challenges yet remains crucial for autonomous navigation in large-scale dynamic environments. In this paper, we introduce a tightly-coupled Semantic-LiDAR-Inertial-Wheel Odometry fusion framework, which is specifically designed to provide high-precision state estimation and robust localization in large-scale dynamic environments. Our fr…
▽ More
Reliable, drift-free global localization presents significant challenges yet remains crucial for autonomous navigation in large-scale dynamic environments. In this paper, we introduce a tightly-coupled Semantic-LiDAR-Inertial-Wheel Odometry fusion framework, which is specifically designed to provide high-precision state estimation and robust localization in large-scale dynamic environments. Our framework leverages an efficient semantic-voxel map representation and employs an improved scan matching algorithm, which utilizes global semantic information to significantly reduce long-term trajectory drift. Furthermore, it seamlessly fuses data from LiDAR, IMU, and wheel odometry using a tightly-coupled multi-sensor fusion Iterative Error-State Kalman Filter (iESKF). This ensures reliable localization without experiencing abnormal drift. Moreover, to tackle the challenges posed by terrain variations and dynamic movements, we introduce a 3D adaptive scaling strategy that allows for flexible adjustments to wheel odometry measurement weights, thereby enhancing localization precision. This study presents extensive real-world experiments conducted in a one-million-square-meter automated port, encompassing 3,575 hours of operational data from 35 Intelligent Guided Vehicles (IGVs). The results consistently demonstrate that our system outperforms state-of-the-art LiDAR-based localization methods in large-scale dynamic environments, highlighting the framework's reliability and practical value.
△ Less
Submitted 18 September, 2025;
originally announced September 2025.
-
Artificial Intelligence-derived Cardiotocography Age as a Digital Biomarker for Predicting Future Adverse Pregnancy Outcomes
Authors:
Jinshuai Gu,
Zenghui Lin,
Jingying Ma,
Jingyu Wang,
Linyan Zhang,
Rui Bai,
Zelin Tu,
Youyou Jiang,
Donglin Xie,
Yuxi Zhou,
Guoli Liu,
Shenda Hong
Abstract:
Cardiotocography (CTG) is a low-cost, non-invasive fetal health assessment technique used globally, especially in underdeveloped countries. However, it is currently mainly used to identify the fetus's current status (e.g., fetal acidosis or hypoxia), and the potential of CTG in predicting future adverse pregnancy outcomes has not been fully explored. We aim to develop an AI-based model that predic…
▽ More
Cardiotocography (CTG) is a low-cost, non-invasive fetal health assessment technique used globally, especially in underdeveloped countries. However, it is currently mainly used to identify the fetus's current status (e.g., fetal acidosis or hypoxia), and the potential of CTG in predicting future adverse pregnancy outcomes has not been fully explored. We aim to develop an AI-based model that predicts biological age from CTG time series (named CTGage), then calculate the age gap between CTGage and actual age (named CTGage-gap), and use this gap as a new digital biomarker for future adverse pregnancy outcomes. The CTGage model is developed using 61,140 records from 11,385 pregnant women, collected at Peking University People's Hospital between 2018 and 2022. For model training, a structurally designed 1D convolutional neural network is used, incorporating distribution-aligned augmented regression technology. The CTGage-gap is categorized into five groups: < -21 days (underestimation group), -21 to -7 days, -7 to 7 days (normal group), 7 to 21 days, and > 21 days (overestimation group). We further defined the underestimation group and overestimation group together as the high-risk group. We then compare the incidence of adverse outcomes and maternal diseases across these groups. The average absolute error of the CTGage model is 10.91 days. When comparing the overestimation group with the normal group, premature infants incidence is 5.33% vs. 1.42% (p < 0.05) and gestational diabetes mellitus (GDM) incidence is 31.93% vs. 20.86% (p < 0.05). When comparing the underestimation group with the normal group, low birth weight incidence is 0.17% vs. 0.15% (p < 0.05) and anaemia incidence is 37.51% vs. 34.74% (p < 0.05). Artificial intelligence-derived CTGage can predict the future risk of adverse pregnancy outcomes and hold potential as a novel, non-invasive, and easily accessible digital biomarker.
△ Less
Submitted 3 September, 2025;
originally announced September 2025.
-
MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook
Authors:
Peng Xu,
Shengwu Xiong,
Jiajun Zhang,
Yaxiong Chen,
Bowen Zhou,
Chen Change Loy,
David A. Clifton,
Kyoung Mu Lee,
Luc Van Gool,
Ruiming He,
Ruilin Yao,
Xinwei Long,
Jirui Huang,
Kai Tian,
Sa Yang,
Yihua Shao,
Jin Feng,
Yue Zhong,
Jiakai Zhou,
Cheng Tang,
Tianyu Zou,
Yifang Zhang,
Junming Liang,
Guoyou Li,
Zhaoxiang Wang
, et al. (103 additional authors not shown)
Abstract:
This paper reviews the MARS2 2025 Challenge on Multimodal Reasoning. We aim to bring together different approaches in multimodal machine learning and LLMs via a large benchmark. We hope it better allows researchers to follow the state-of-the-art in this very dynamic area. Meanwhile, a growing number of testbeds have boosted the evolution of general-purpose large language models. Thus, this year's…
▽ More
This paper reviews the MARS2 2025 Challenge on Multimodal Reasoning. We aim to bring together different approaches in multimodal machine learning and LLMs via a large benchmark. We hope it better allows researchers to follow the state-of-the-art in this very dynamic area. Meanwhile, a growing number of testbeds have boosted the evolution of general-purpose large language models. Thus, this year's MARS2 focuses on real-world and specialized scenarios to broaden the multimodal reasoning applications of MLLMs. Our organizing team released two tailored datasets Lens and AdsQA as test sets, which support general reasoning in 12 daily scenarios and domain-specific reasoning in advertisement videos, respectively. We evaluated 40+ baselines that include both generalist MLLMs and task-specific models, and opened up three competition tracks, i.e., Visual Grounding in Real-world Scenarios (VG-RS), Visual Question Answering with Spatial Awareness (VQA-SA), and Visual Reasoning in Creative Advertisement Videos (VR-Ads). Finally, 76 teams from the renowned academic and industrial institutions have registered and 40+ valid submissions (out of 1200+) have been included in our ranking lists. Our datasets, code sets (40+ baselines and 15+ participants' methods), and rankings are publicly available on the MARS2 workshop website and our GitHub organization page https://github.com/mars2workshop/, where our updates and announcements of upcoming events will be continuously provided.
△ Less
Submitted 17 September, 2025;
originally announced September 2025.
-
Generative AI for Misalignment-Resistant Virtual Staining to Accelerate Histopathology Workflows
Authors:
Jiabo MA,
Wenqiang Li,
Jinbang Li,
Ziyi Liu,
Linshan Wu,
Fengtao Zhou,
Li Liang,
Ronald Cheong Kin Chan,
Terence T. W. Wong,
Hao Chen
Abstract:
Accurate histopathological diagnosis often requires multiple differently stained tissue sections, a process that is time-consuming, labor-intensive, and environmentally taxing due to the use of multiple chemical stains. Recently, virtual staining has emerged as a promising alternative that is faster, tissue-conserving, and environmentally friendly. However, existing virtual staining methods face s…
▽ More
Accurate histopathological diagnosis often requires multiple differently stained tissue sections, a process that is time-consuming, labor-intensive, and environmentally taxing due to the use of multiple chemical stains. Recently, virtual staining has emerged as a promising alternative that is faster, tissue-conserving, and environmentally friendly. However, existing virtual staining methods face significant challenges in clinical applications, primarily due to their reliance on well-aligned paired data. Obtaining such data is inherently difficult because chemical staining processes can distort tissue structures, and a single tissue section cannot undergo multiple staining procedures without damage or loss of information. As a result, most available virtual staining datasets are either unpaired or roughly paired, making it difficult for existing methods to achieve accurate pixel-level supervision. To address this challenge, we propose a robust virtual staining framework featuring cascaded registration mechanisms to resolve spatial mismatches between generated outputs and their corresponding ground truth. Experimental results demonstrate that our method significantly outperforms state-of-the-art models across five datasets, achieving an average improvement of 3.2% on internal datasets and 10.1% on external datasets. Moreover, in datasets with substantial misalignment, our approach achieves a remarkable 23.8% improvement in peak signal-to-noise ratio compared to baseline models. The exceptional robustness of the proposed method across diverse datasets simplifies the data acquisition process for virtual staining and offers new insights for advancing its development.
△ Less
Submitted 17 September, 2025;
originally announced September 2025.
-
SEG-Parking: Towards Safe, Efficient, and Generalizable Autonomous Parking via End-to-End Offline Reinforcement Learning
Authors:
Zewei Yang,
Zengqi Peng,
Jun Ma
Abstract:
Autonomous parking is a critical component for achieving safe and efficient urban autonomous driving. However, unstructured environments and dynamic interactions pose significant challenges to autonomous parking tasks. To address this problem, we propose SEG-Parking, a novel end-to-end offline reinforcement learning (RL) framework to achieve interaction-aware autonomous parking. Notably, a special…
▽ More
Autonomous parking is a critical component for achieving safe and efficient urban autonomous driving. However, unstructured environments and dynamic interactions pose significant challenges to autonomous parking tasks. To address this problem, we propose SEG-Parking, a novel end-to-end offline reinforcement learning (RL) framework to achieve interaction-aware autonomous parking. Notably, a specialized parking dataset is constructed for parking scenarios, which include those without interference from the opposite vehicle (OV) and complex ones involving interactions with the OV. Based on this dataset, a goal-conditioned state encoder is pretrained to map the fused perception information into the latent space. Then, an offline RL policy is optimized with a conservative regularizer that penalizes out-of-distribution actions. Extensive closed-loop experiments are conducted in the high-fidelity CARLA simulator. Comparative results demonstrate the superior performance of our framework with the highest success rate and robust generalization to out-of-distribution parking scenarios. The related dataset and source code will be made publicly available after the paper is accepted.
△ Less
Submitted 17 September, 2025;
originally announced September 2025.
-
Asymptotic Analysis of Nonlinear One-Bit Precoding in Massive MIMO Systems via Approximate Message Passing
Authors:
Zheyu Wu,
Junjie Ma,
Ya-Feng Liu,
Bruno Clerckx
Abstract:
Massive multiple-input multiple-output (MIMO) systems employing one-bit digital-to-analog converters offer a hardware-efficient solution for wireless communications. However, the one-bit constraint poses significant challenges for precoding design, as it transforms the problem into a discrete and nonconvex optimization task. In this paper, we investigate a widely adopted ``convex-relaxation-then-q…
▽ More
Massive multiple-input multiple-output (MIMO) systems employing one-bit digital-to-analog converters offer a hardware-efficient solution for wireless communications. However, the one-bit constraint poses significant challenges for precoding design, as it transforms the problem into a discrete and nonconvex optimization task. In this paper, we investigate a widely adopted ``convex-relaxation-then-quantization" approach for nonlinear symbol-level one-bit precoding. Specifically, we first solve a convex relaxation of the discrete minimum mean square error precoding problem, and then quantize the solution to satisfy the one-bit constraint. To analyze the high-dimensional asymptotic performance of this scheme, we develop a novel analytical framework based on approximate message passing (AMP). This framework enables us to derive a closed-form expression for the symbol error probability (SEP) at the receiver side in the large-system limit, which provides a quantitative characterization of how model and system parameters affect the SEP performance. Our empirical results suggest that the $\ell_\infty^2$ regularizer, when paired with an optimally chosen regularization parameter, achieves optimal SEP performance within a broad class of convex regularization functions. As a first step towards a theoretical justification, we prove the optimality of the $\ell_\infty^2$ regularizer within the mixed $\ell_\infty^2$-$\ell_2^2$ regularization functions.
△ Less
Submitted 17 September, 2025;
originally announced September 2025.
-
THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning
Authors:
Qikai Chang,
Zhenrong Zhang,
Pengfei Hu,
Jiefeng Ma,
Yicheng Pan,
Jianshu Zhang,
Jun Du,
Quan Liu,
Jianqing Gao
Abstract:
Large Language Models (LLMs) have made remarkable progress in mathematical reasoning, but still continue to struggle with high-precision tasks like numerical computation and formal symbolic manipulation. Integrating external tools has emerged as a promising approach to bridge this gap. Despite recent advances, existing methods struggle with three key challenges: constructing tool-integrated reason…
▽ More
Large Language Models (LLMs) have made remarkable progress in mathematical reasoning, but still continue to struggle with high-precision tasks like numerical computation and formal symbolic manipulation. Integrating external tools has emerged as a promising approach to bridge this gap. Despite recent advances, existing methods struggle with three key challenges: constructing tool-integrated reasoning data, performing fine-grained optimization, and enhancing inference. To overcome these limitations, we propose THOR (Tool-Integrated Hierarchical Optimization via RL). First, we introduce TIRGen, a multi-agent actor-critic-based pipeline for constructing high-quality datasets of tool-integrated reasoning paths, aligning with the policy and generalizing well across diverse models. Second, to perform fine-grained hierarchical optimization, we introduce an RL strategy that jointly optimizes for both trajectory-level problem solving and step-level code generation. This is motivated by our key insight that the success of an intermediate tool call is a strong predictor of the final answer's correctness. Finally, THOR incorporates a self-correction mechanism that leverages immediate tool feedback to dynamically revise erroneous reasoning paths during inference. Our approach demonstrates strong generalization across diverse models, performing effectively in both reasoning and non-reasoning models. It further achieves state-of-the-art performance for models of a similar scale on multiple mathematical benchmarks, while also delivering consistent improvements on code benchmarks. Our code will be publicly available at https://github.com/JingMog/THOR.
△ Less
Submitted 17 September, 2025;
originally announced September 2025.
-
Embracing Bulky Objects with Humanoid Robots: Whole-Body Manipulation with Reinforcement Learning
Authors:
Chunxin Zheng,
Kai Chen,
Zhihai Bi,
Yulin Li,
Liang Pan,
Jinni Zhou,
Haoang Li,
Jun Ma
Abstract:
Whole-body manipulation (WBM) for humanoid robots presents a promising approach for executing embracing tasks involving bulky objects, where traditional grasping relying on end-effectors only remains limited in such scenarios due to inherent stability and payload constraints. This paper introduces a reinforcement learning framework that integrates a pre-trained human motion prior with a neural sig…
▽ More
Whole-body manipulation (WBM) for humanoid robots presents a promising approach for executing embracing tasks involving bulky objects, where traditional grasping relying on end-effectors only remains limited in such scenarios due to inherent stability and payload constraints. This paper introduces a reinforcement learning framework that integrates a pre-trained human motion prior with a neural signed distance field (NSDF) representation to achieve robust whole-body embracing. Our method leverages a teacher-student architecture to distill large-scale human motion data, generating kinematically natural and physically feasible whole-body motion patterns. This facilitates coordinated control across the arms and torso, enabling stable multi-contact interactions that enhance the robustness in manipulation and also the load capacity. The embedded NSDF further provides accurate and continuous geometric perception, improving contact awareness throughout long-horizon tasks. We thoroughly evaluate the approach through comprehensive simulations and real-world experiments. The results demonstrate improved adaptability to diverse shapes and sizes of objects and also successful sim-to-real transfer. These indicate that the proposed framework offers an effective and practical solution for multi-contact and long-horizon WBM tasks of humanoid robots.
△ Less
Submitted 16 September, 2025;
originally announced September 2025.
-
A Scalable Architecture for Efficient Multi-bit Fully Homomorphic Encryption
Authors:
Jiaao Ma,
Ceyu Xu,
Lisa Wu Wills
Abstract:
In the era of cloud computing, privacy-preserving computation offloading is crucial for safeguarding sensitive data. Fully Homomorphic Encryption (FHE) enables secure processing of encrypted data, but the inherent computational complexity of FHE operations introduces significant computational overhead on the server side. FHE schemes often face a tradeoff between efficiency and versatility. While t…
▽ More
In the era of cloud computing, privacy-preserving computation offloading is crucial for safeguarding sensitive data. Fully Homomorphic Encryption (FHE) enables secure processing of encrypted data, but the inherent computational complexity of FHE operations introduces significant computational overhead on the server side. FHE schemes often face a tradeoff between efficiency and versatility. While the CKKS scheme is highly efficient for polynomial operations, it lacks the flexibility of the binary TFHE (Torus-FHE) scheme, which offers greater versatility but at the cost of efficiency. The recent multi-bit TFHE extension offers greater flexibility and performance by supporting native non-polynomial operations and efficient integer processing. However, current implementations of multi-bit TFHE are constrained by its narrower numeric representation, which prevents its adoption in applications requiring wider numeric representations.
To address this challenge, we introduce Taurus, a hardware accelerator designed to enhance the efficiency of multi-bit TFHE computations. Taurus supports ciphertexts up to 10 bits by leveraging novel FFT units and optimizing memory bandwidth through key reuse strategies. We also propose a compiler with operation deduplication to improve memory utilization. Our experiment results demonstrate that Taurus achieves up to 2600x speedup over a CPU, 1200x speedup over a GPU, and up to 7x faster compared to the previous state-of-the-art TFHE accelerator. Moreover, Taurus is the first accelerator to demonstrate privacy-preserving inference with large language models such as GPT-2. These advancements enable more practical and scalable applications of privacy-preserving computation in cloud environments.
△ Less
Submitted 16 September, 2025;
originally announced September 2025.
-
High-Energy Concentration for Federated Learning in Frequency Domain
Authors:
Haozhi Shi,
Weiying Xie,
Hangyu Ye,
Daixun Li,
Jitao Ma,
Leyuan Fang
Abstract:
Federated Learning (FL) presents significant potential for collaborative optimization without data sharing. Since synthetic data is sent to the server, leveraging the popular concept of dataset distillation, this FL framework protects real data privacy while alleviating data heterogeneity. However, such methods are still challenged by the redundant information and noise in entire spatial-domain de…
▽ More
Federated Learning (FL) presents significant potential for collaborative optimization without data sharing. Since synthetic data is sent to the server, leveraging the popular concept of dataset distillation, this FL framework protects real data privacy while alleviating data heterogeneity. However, such methods are still challenged by the redundant information and noise in entire spatial-domain designs, which inevitably increases the communication burden. In this paper, we propose a novel Frequency-Domain aware FL method with high-energy concentration (FedFD) to address this problem. Our FedFD is inspired by the discovery that the discrete cosine transform predominantly distributes energy to specific regions, referred to as high-energy concentration. The principle behind FedFD is that low-energy like high-frequency components usually contain redundant information and noise, thus filtering them helps reduce communication costs and optimize performance. Our FedFD is mathematically formulated to preserve the low-frequency components using a binary mask, facilitating an optimal solution through frequency-domain distribution alignment. In particular, real data-driven synthetic classification is imposed into the loss to enhance the quality of the low-frequency components. On five image and speech datasets, FedFD achieves superior performance than state-of-the-art methods while reducing communication costs. For example, on the CIFAR-10 dataset with Dirichlet coefficient $α= 0.01$, FedFD achieves a minimum reduction of 37.78\% in the communication cost, while attaining a 10.88\% performance gain.
△ Less
Submitted 15 September, 2025;
originally announced September 2025.
-
A Multimodal Foundation Model to Enhance Generalizability and Data Efficiency for Pan-cancer Prognosis Prediction
Authors:
Huajun Zhou,
Fengtao Zhou,
Jiabo Ma,
Yingxue Xu,
Xi Wang,
Xiuming Zhang,
Li Liang,
Zhenhui Li,
Hao Chen
Abstract:
Multimodal data provides heterogeneous information for a holistic understanding of the tumor microenvironment. However, existing AI models often struggle to harness the rich information within multimodal data and extract poorly generalizable representations. Here we present MICE (Multimodal data Integration via Collaborative Experts), a multimodal foundation model that effectively integrates patho…
▽ More
Multimodal data provides heterogeneous information for a holistic understanding of the tumor microenvironment. However, existing AI models often struggle to harness the rich information within multimodal data and extract poorly generalizable representations. Here we present MICE (Multimodal data Integration via Collaborative Experts), a multimodal foundation model that effectively integrates pathology images, clinical reports, and genomics data for precise pan-cancer prognosis prediction. Instead of conventional multi-expert modules, MICE employs multiple functionally diverse experts to comprehensively capture both cross-cancer and cancer-specific insights. Leveraging data from 11,799 patients across 30 cancer types, we enhanced MICE's generalizability by coupling contrastive and supervised learning. MICE outperformed both unimodal and state-of-the-art multi-expert-based multimodal models, demonstrating substantial improvements in C-index ranging from 3.8% to 11.2% on internal cohorts and 5.8% to 8.8% on independent cohorts, respectively. Moreover, it exhibited remarkable data efficiency across diverse clinical scenarios. With its enhanced generalizability and data efficiency, MICE establishes an effective and scalable foundation for pan-cancer prognosis prediction, holding strong potential to personalize tailored therapies and improve treatment outcomes.
△ Less
Submitted 15 September, 2025;
originally announced September 2025.
-
Exploring Training Data Attribution under Limited Access Constraints
Authors:
Shiyuan Zhang,
Junwei Deng,
Juhan Bae,
Jiaqi Ma
Abstract:
Training data attribution (TDA) plays a critical role in understanding the influence of individual training data points on model predictions. Gradient-based TDA methods, popularized by \textit{influence function} for their superior performance, have been widely applied in data selection, data cleaning, data economics, and fact tracing. However, in real-world scenarios where commercial models are n…
▽ More
Training data attribution (TDA) plays a critical role in understanding the influence of individual training data points on model predictions. Gradient-based TDA methods, popularized by \textit{influence function} for their superior performance, have been widely applied in data selection, data cleaning, data economics, and fact tracing. However, in real-world scenarios where commercial models are not publicly accessible and computational resources are limited, existing TDA methods are often constrained by their reliance on full model access and high computational costs. This poses significant challenges to the broader adoption of TDA in practical applications.
In this work, we present a systematic study of TDA methods under various access and resource constraints. We investigate the feasibility of performing TDA under varying levels of access constraints by leveraging appropriately designed solutions such as proxy models. Besides, we demonstrate that attribution scores obtained from models without prior training on the target dataset remain informative across a range of tasks, which is useful for scenarios where computational resources are limited. Our findings provide practical guidance for deploying TDA in real-world environments, aiming to improve feasibility and efficiency under limited access.
△ Less
Submitted 15 September, 2025;
originally announced September 2025.
-
Deriving the Scaled-Dot-Function via Maximum Likelihood Estimation and Maximum Entropy Approach
Authors:
Jiyong Ma
Abstract:
In this paper, we present a maximum likelihood estimation approach to determine the value vector in transformer models. We model the sequence of value vectors, key vectors, and the query vector as a sequence of Gaussian distributions. The variance in each Gaussian distribution depends on the time step, the corresponding key vector, and the query vector. The mean value in each Gaussian distribution…
▽ More
In this paper, we present a maximum likelihood estimation approach to determine the value vector in transformer models. We model the sequence of value vectors, key vectors, and the query vector as a sequence of Gaussian distributions. The variance in each Gaussian distribution depends on the time step, the corresponding key vector, and the query vector. The mean value in each Gaussian distribution depends on the time step, and the corresponding value vector. This analysis may offer a new explanation of the scaled-dot-product function or softmax function used in transformer architectures [1]. Another explanation, inspired by [4], is based on the maximum entropy approach in natural language processing [5]. In this approach, a query vector and key vectors are used to derive the feature functions for the maximum entropy model.
△ Less
Submitted 14 September, 2025;
originally announced September 2025.
-
Artificial Intelligence in Breast Cancer Care: Transforming Preoperative Planning and Patient Education with 3D Reconstruction
Authors:
Mustafa Khanbhai,
Giulia Di Nardo,
Jun Ma,
Vivienne Freitas,
Caterina Masino,
Ali Dolatabadi,
Zhaoxun "Lorenz" Liu,
Wey Leong,
Wagner H. Souza,
Amin Madani
Abstract:
Effective preoperative planning requires accurate algorithms for segmenting anatomical structures across diverse datasets, but traditional models struggle with generalization. This study presents a novel machine learning methodology to improve algorithm generalization for 3D anatomical reconstruction beyond breast cancer applications. We processed 120 retrospective breast MRIs (January 2018-June 2…
▽ More
Effective preoperative planning requires accurate algorithms for segmenting anatomical structures across diverse datasets, but traditional models struggle with generalization. This study presents a novel machine learning methodology to improve algorithm generalization for 3D anatomical reconstruction beyond breast cancer applications. We processed 120 retrospective breast MRIs (January 2018-June 2023) through three phases: anonymization and manual segmentation of T1-weighted and dynamic contrast-enhanced sequences; co-registration and segmentation of whole breast, fibroglandular tissue, and tumors; and 3D visualization using ITK-SNAP. A human-in-the-loop approach refined segmentations using U-Mamba, designed to generalize across imaging scenarios. Dice similarity coefficient assessed overlap between automated segmentation and ground truth. Clinical relevance was evaluated through clinician and patient interviews. U-Mamba showed strong performance with DSC values of 0.97 ($\pm$0.013) for whole organs, 0.96 ($\pm$0.024) for fibroglandular tissue, and 0.82 ($\pm$0.12) for tumors on T1-weighted images. The model generated accurate 3D reconstructions enabling visualization of complex anatomical features. Clinician interviews indicated improved planning, intraoperative navigation, and decision support. Integration of 3D visualization enhanced patient education, communication, and understanding. This human-in-the-loop machine learning approach successfully generalizes algorithms for 3D reconstruction and anatomical segmentation across patient datasets, offering enhanced visualization for clinicians, improved preoperative planning, and more effective patient education, facilitating shared decision-making and empowering informed patient choices across medical applications.
△ Less
Submitted 10 September, 2025;
originally announced September 2025.
-
Bridging the Gap Between Sparsity and Redundancy: A Dual-Decoding Framework with Global Context for Map Inference
Authors:
Yudong Shen,
Wenyu Wu,
Jiali Mao,
Yixiao Tong,
Guoping Liu,
Chaoya Wang
Abstract:
Trajectory data has become a key resource for automated map in-ference due to its low cost, broad coverage, and continuous availability. However, uneven trajectory density often leads to frag-mented roads in sparse areas and redundant segments in dense regions, posing significant challenges for existing methods. To address these issues, we propose DGMap, a dual-decoding framework with global conte…
▽ More
Trajectory data has become a key resource for automated map in-ference due to its low cost, broad coverage, and continuous availability. However, uneven trajectory density often leads to frag-mented roads in sparse areas and redundant segments in dense regions, posing significant challenges for existing methods. To address these issues, we propose DGMap, a dual-decoding framework with global context awareness, featuring Multi-scale Grid Encoding, Mask-enhanced Keypoint Extraction, and Global Context-aware Relation Prediction. By integrating global semantic context with local geometric features, DGMap improves keypoint detection accuracy to reduce road fragmentation in sparse-trajectory areas. Additionally, the Global Context-aware Relation Prediction module suppresses false connections in dense-trajectory regions by modeling long-range trajectory patterns. Experimental results on three real-world datasets show that DGMap outperforms state-of-the-art methods by 5% in APLS, with notable performance gains on trajectory data from the Didi Chuxing platform
△ Less
Submitted 15 September, 2025;
originally announced September 2025.