Skip to main content

Showing 1–50 of 1,903 results for author: Mao, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2510.00732  [pdf, ps, other

    cs.AI

    EvolProver: Advancing Automated Theorem Proving by Evolving Formalized Problems via Symmetry and Difficulty

    Authors: Yuchen Tian, Ruiyuan Huang, Xuanwu Wang, Jing Ma, Zengfeng Huang, Ziyang Luo, Hongzhan Lin, Da Zheng, Lun Du

    Abstract: Large Language Models (LLMs) for formal theorem proving have shown significant promise, yet they often lack generalizability and are fragile to even minor transformations of problem statements. To address this limitation, we introduce a novel data augmentation pipeline designed to enhance model robustness from two perspectives: symmetry and difficulty. From the symmetry perspective, we propose two… ▽ More

    Submitted 1 October, 2025; originally announced October 2025.

  2. arXiv:2510.00053  [pdf, ps, other

    eess.IV cs.CV cs.LG

    DPsurv: Dual-Prototype Evidential Fusion for Uncertainty-Aware and Interpretable Whole-Slide Image Survival Prediction

    Authors: Yucheng Xing, Ling Huang, Jingying Ma, Ruping Hong, Jiangdong Qiu, Pei Liu, Kai He, Huazhu Fu, Mengling Feng

    Abstract: Pathology whole-slide images (WSIs) are widely used for cancer survival analysis because of their comprehensive histopathological information at both cellular and tissue levels, enabling quantitative, large-scale, and prognostically rich tumor feature analysis. However, most existing methods in WSI survival analysis struggle with limited interpretability and often overlook predictive uncertainty i… ▽ More

    Submitted 28 September, 2025; originally announced October 2025.

  3. arXiv:2510.00040  [pdf, ps, other

    cs.CV cs.AI

    Uncovering Intrinsic Capabilities: A Paradigm for Data Curation in Vision-Language Models

    Authors: Junjie Li, Ziao Wang, Jianghong Ma, Xiaofeng Zhang

    Abstract: Large vision-language models (VLMs) achieve strong benchmark performance, but controlling their behavior through instruction tuning remains difficult. Reducing the budget of instruction tuning dataset often causes regressions, as heuristic strategies treat models as black boxes and overlook the latent capabilities that govern learning. We introduce Capability-Attributed Data Curation (CADC), a fra… ▽ More

    Submitted 26 September, 2025; originally announced October 2025.

  4. arXiv:2509.25852  [pdf, ps, other

    cs.RO

    Reinforced Embodied Planning with Verifiable Reward for Real-World Robotic Manipulation

    Authors: Zitong Bo, Yue Hu, Jinming Ma, Mingliang Zhou, Junhui Yin, Yachen Kang, Yuqi Liu, Tong Wu, Diyun Xiang, Hao Chen

    Abstract: Enabling robots to execute long-horizon manipulation tasks from free-form language instructions remains a fundamental challenge in embodied AI. While vision-language models (VLMs) have shown promise as high-level planners, their deployment in the real world is hindered by two gaps: (i) the scarcity of large-scale, sequential manipulation data that couples natural language with multi-step action pl… ▽ More

    Submitted 30 September, 2025; originally announced September 2025.

  5. arXiv:2509.25742  [pdf, ps, other

    cs.LG

    Less is More: Towards Simple Graph Contrastive Learning

    Authors: Yanan Zhao, Feng Ji, Jingyang Dai, Jiaze Ma, Wee Peng Tay

    Abstract: Graph Contrastive Learning (GCL) has shown strong promise for unsupervised graph representation learning, yet its effectiveness on heterophilic graphs, where connected nodes often belong to different classes, remains limited. Most existing methods rely on complex augmentation schemes, intricate encoders, or negative sampling, which raises the question of whether such complexity is truly necessary… ▽ More

    Submitted 29 September, 2025; originally announced September 2025.

    Comments: Submitted to ICLR 2026

  6. arXiv:2509.25699  [pdf, ps, other

    cs.CV

    AIMCoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning

    Authors: Xiping Li, Jianghong Ma

    Abstract: Multimodal Chain-of-Thought (CoT) has emerged as a powerful technique for enhancing the vision-language reasoning with interleaved information. However, existing methods often rely on simplistic heuristics for constructing interleaved CoT, typically depending on attention maps, which our empirical analysis reveals can be unreliable. What's more, the shortcomings of their passive and purposeless se… ▽ More

    Submitted 29 September, 2025; originally announced September 2025.

    Comments: 22 pages, 4 figures, submitted to ICLR 2026

  7. arXiv:2509.25540  [pdf, ps, other

    cs.AI

    RadOnc-GPT: An Autonomous LLM Agent for Real-Time Patient Outcomes Labeling at Scale

    Authors: Jason Holmes, Yuexing Hao, Mariana Borras-Osorio, Federico Mastroleo, Santiago Romero Brufau, Valentina Carducci, Katie M Van Abel, David M Routman, Andrew Y. K. Foong, Liv M Muller, Satomi Shiraishi, Daniel K Ebner, Daniel J Ma, Sameer R Keole, Samir H Patel, Mirek Fatyga, Martin Bues, Brad J Stish, Yolanda I Garces, Michelle A Neben Wittich, Robert L Foote, Sujay A Vora, Nadia N Laack, Mark R Waddle, Wei Liu

    Abstract: Manual labeling limits the scale, accuracy, and timeliness of patient outcomes research in radiation oncology. We present RadOnc-GPT, an autonomous large language model (LLM)-based agent capable of independently retrieving patient-specific information, iteratively assessing evidence, and returning structured outcomes. Our evaluation explicitly validates RadOnc-GPT across two clearly defined tiers… ▽ More

    Submitted 29 September, 2025; originally announced September 2025.

  8. arXiv:2509.24377  [pdf, ps, other

    cs.AI

    Plan before Solving: Problem-Aware Strategy Routing for Mathematical Reasoning with LLMs

    Authors: Shihao Qi, Jie Ma, Ziang Yin, Lingling Zhang, Jian Zhang, Jun Liu, Feng Tian, Tongliang Liu

    Abstract: Existing methods usually leverage a fixed strategy, such as natural language reasoning, code-augmented reasoning, tool-integrated reasoning, or ensemble-based reasoning, to guide Large Language Models (LLMs) to perform mathematical reasoning. Our analysis reveals that the single strategy cannot adapt to problem-specific requirements and thus overlooks the trade-off between effectiveness and effici… ▽ More

    Submitted 29 September, 2025; originally announced September 2025.

  9. arXiv:2509.24351  [pdf, ps, other

    cs.AI

    From Static to Dynamic: Adaptive Monte Carlo Search for Mathematical Process Supervision

    Authors: Jie Ma, Shihao Qi, Rui Xing, Ziang Yin, Bifan Wei, Jun Liu, Tongliang Liu

    Abstract: The quality of process data plays a key role in training a Process Reward Model (PRM), which can enhance the complex mathematical reasoning capability of large language models. Existing methods estimate the quality of reasoning steps based on a fixed-budget sampling strategy and navigate a vast search space to perform path expansion during the automated data generation process, resulting in their… ▽ More

    Submitted 29 September, 2025; originally announced September 2025.

  10. arXiv:2509.23468  [pdf, ps, other

    cs.RO cs.AI cs.LG

    Multi-Modal Manipulation via Multi-Modal Policy Consensus

    Authors: Haonan Chen, Jiaming Xu, Hongyu Chen, Kaiwen Hong, Binghao Huang, Chaoqi Liu, Jiayuan Mao, Yunzhu Li, Yilun Du, Katherine Driggs-Campbell

    Abstract: Effectively integrating diverse sensory modalities is crucial for robotic manipulation. However, the typical approach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic architectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes… ▽ More

    Submitted 27 September, 2025; originally announced September 2025.

    Comments: 9 pages, 7 figures

  11. arXiv:2509.23308  [pdf, ps, other

    cs.RO

    Distributed Multi-Robot Multi-Target Simultaneous Search and Tracking in an Unknown Non-convex Environment

    Authors: Jun Chen, Jiaqing Ma, Philip Dames

    Abstract: In unknown non-convex environments, such as indoor and underground spaces, deploying a fleet of robots to explore the surroundings while simultaneously searching for and tracking targets of interest to maintain high-precision data collection represents a fundamental challenge that urgently requires resolution in applications such as environmental monitoring and rescue operations. Current research… ▽ More

    Submitted 27 September, 2025; originally announced September 2025.

  12. arXiv:2509.23261  [pdf, ps, other

    cs.SE

    The Matthew Effect of AI Programming Assistants: A Hidden Bias in Software Evolution

    Authors: Fei Gu, Zi Liang, Hongzong LI, Jiahao MA

    Abstract: AI-assisted programming is rapidly reshaping software development, with large language models (LLMs) enabling new paradigms such as vibe coding and agentic coding. While prior works have focused on prompt design and code generation quality, the broader impact of LLM-driven development on the iterative dynamics of software engineering remains underexplored. In this paper, we conduct large-scale exp… ▽ More

    Submitted 27 September, 2025; originally announced September 2025.

  13. arXiv:2509.23254  [pdf, ps, other

    cs.LG q-bio.BM

    ABConformer: Physics-inspired Sliding Attention for Antibody-Antigen Interface Prediction

    Authors: Zhang-Yu You, Jiahao Ma, Hongzong Li, Ye-Fan Hu, Jian-Dong Huang

    Abstract: Accurate prediction of antibody-antigen (Ab-Ag) interfaces is critical for vaccine design, immunodiagnostics, and therapeutic antibody development. However, achieving reliable predictions from sequences alone remains a challenge. In this paper, we present ABCONFORMER, a model based on the Conformer backbone that captures both local and global features of a biosequence. To accurately capture Ab-Ag… ▽ More

    Submitted 27 September, 2025; originally announced September 2025.

  14. arXiv:2509.22970  [pdf, ps, other

    cs.RO cs.CV cs.LG

    Robot Learning from Any Images

    Authors: Siheng Zhao, Jiageng Mao, Wei Chow, Zeyu Shangguan, Tianheng Shi, Rong Xue, Yuxi Zheng, Yijia Weng, Yang You, Daniel Seita, Leonidas Guibas, Sergey Zakharov, Vitor Guizilini, Yue Wang

    Abstract: We introduce RoLA, a framework that transforms any in-the-wild image into an interactive, physics-enabled robotic environment. Unlike previous methods, RoLA operates directly on a single image without requiring additional hardware or digital assets. Our framework democratizes robotic data generation by producing massive visuomotor robotic demonstrations within minutes from a wide range of image so… ▽ More

    Submitted 26 September, 2025; originally announced September 2025.

    Comments: CoRL 2025 camera ready

  15. arXiv:2509.22550  [pdf, ps, other

    cs.RO

    An Intention-driven Lane Change Framework Considering Heterogeneous Dynamic Cooperation in Mixed-traffic Environment

    Authors: Xiaoyun Qiu, Haichao Liu, Yue Pan, Jun Ma, Xinhu Zheng

    Abstract: In mixed-traffic environments, where autonomous vehicles (AVs) interact with diverse human-driven vehicles (HVs), unpredictable intentions and heterogeneous behaviors make safe and efficient lane change maneuvers highly challenging. Existing methods often oversimplify these interactions by assuming uniform patterns. We propose an intention-driven lane change framework that integrates driving-style… ▽ More

    Submitted 26 September, 2025; originally announced September 2025.

  16. arXiv:2509.20739  [pdf, ps, other

    cs.RO cs.CV

    SLAM-Free Visual Navigation with Hierarchical Vision-Language Perception and Coarse-to-Fine Semantic Topological Planning

    Authors: Guoyang Zhao, Yudong Li, Weiqing Qi, Kai Zhang, Bonan Liu, Kai Chen, Haoang Li, Jun Ma

    Abstract: Conventional SLAM pipelines for legged robot navigation are fragile under rapid motion, calibration demands, and sensor drift, while offering limited semantic reasoning for task-driven exploration. To deal with these issues, we propose a vision-only, SLAM-free navigation framework that replaces dense geometry with semantic reasoning and lightweight topological representations. A hierarchical visio… ▽ More

    Submitted 25 September, 2025; originally announced September 2025.

  17. arXiv:2509.20271  [pdf, ps, other

    cs.CV

    A Versatile Foundation Model for AI-enabled Mammogram Interpretation

    Authors: Fuxiang Huang, Jiayi Zhu, Yunfang Yu, Yu Xie, Yuan Guo, Qingcong Kong, Mingxiang Wu, Xinrui Jiang, Shu Yang, Jiabo Ma, Ziyi Liu, Zhe Xu, Zhixuan Chen, Yujie Tan, Zifan He, Luhui Mao, Xi Wang, Junlin Hou, Lei Zhang, Qiong Luo, Zhenhui Li, Herui Yao, Hao Chen

    Abstract: Breast cancer is the most commonly diagnosed cancer and the leading cause of cancer-related mortality in women globally. Mammography is essential for the early detection and diagnosis of breast lesions. Despite recent progress in foundation models (FMs) for mammogram analysis, their clinical translation remains constrained by several fundamental limitations, including insufficient diversity in tra… ▽ More

    Submitted 24 September, 2025; originally announced September 2025.

    Comments: 64 pages, 7 figures, 40 tables

  18. arXiv:2509.20036  [pdf, ps, other

    cs.RO

    MARG: MAstering Risky Gap Terrains for Legged Robots with Elevation Mapping

    Authors: Yinzhao Dong, Ji Ma, Liu Zhao, Wanyue Li, Peng Lu

    Abstract: Deep Reinforcement Learning (DRL) controllers for quadrupedal locomotion have demonstrated impressive performance on challenging terrains, allowing robots to execute complex skills such as climbing, running, and jumping. However, existing blind locomotion controllers often struggle to ensure safety and efficient traversal through risky gap terrains, which are typically highly complex, requiring ro… ▽ More

    Submitted 27 September, 2025; v1 submitted 24 September, 2025; originally announced September 2025.

  19. arXiv:2509.19973  [pdf, ps, other

    cs.CV

    OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving

    Authors: Pei Liu, Hongliang Lu, Haichao Liu, Haipeng Liu, Xin Liu, Ruoyu Yao, Shengbo Eben Li, Jun Ma

    Abstract: Human vision is capable of transforming two-dimensional observations into an egocentric three-dimensional scene understanding, which underpins the ability to translate complex scenes and exhibit adaptive behaviors. This capability, however, remains lacking in current autonomous driving systems, where mainstream approaches primarily rely on depth-based 3D reconstruction rather than true scene under… ▽ More

    Submitted 25 September, 2025; v1 submitted 24 September, 2025; originally announced September 2025.

  20. arXiv:2509.19916  [pdf, ps, other

    cs.RO

    GUIDE: A Diffusion-Based Autonomous Robot Exploration Framework Using Global Graph Inference

    Authors: Zijun Che, Yinghong Zhang, Shengyi Liang, Boyu Zhou, Jun Ma, Jinni Zhou

    Abstract: Autonomous exploration in structured and complex indoor environments remains a challenging task, as existing methods often struggle to appropriately model unobserved space and plan globally efficient paths. To address these limitations, we propose GUIDE, a novel exploration framework that synergistically combines global graph inference with diffusion-based decision-making. We introduce a region-ev… ▽ More

    Submitted 25 September, 2025; v1 submitted 24 September, 2025; originally announced September 2025.

  21. arXiv:2509.19853  [pdf, ps, other

    cs.RO

    SAGE:State-Aware Guided End-to-End Policy for Multi-Stage Sequential Tasks via Hidden Markov Decision Process

    Authors: BinXu Wu, TengFei Zhang, Chen Yang, JiaHao Wen, HaoCheng Li, JingTian Ma, Zhen Chen, JingYuan Wang

    Abstract: Multi-stage sequential (MSS) robotic manipulation tasks are prevalent and crucial in robotics. They often involve state ambiguity, where visually similar observations correspond to different actions. We present SAGE, a state-aware guided imitation learning framework that models tasks as a Hidden Markov Decision Process (HMDP) to explicitly capture latent task stages and resolve ambiguity. We insta… ▽ More

    Submitted 24 September, 2025; originally announced September 2025.

  22. arXiv:2509.19452  [pdf, ps, other

    cs.RO cs.CV cs.LG

    HUNT: High-Speed UAV Navigation and Tracking in Unstructured Environments via Instantaneous Relative Frames

    Authors: Alessandro Saviolo, Jeffrey Mao, Giuseppe Loianno

    Abstract: Search and rescue operations require unmanned aerial vehicles to both traverse unknown unstructured environments at high speed and track targets once detected. Achieving both capabilities under degraded sensing and without global localization remains an open challenge. Recent works on relative navigation have shown robust tracking by anchoring planning and control to a visible detected object, but… ▽ More

    Submitted 28 September, 2025; v1 submitted 23 September, 2025; originally announced September 2025.

  23. arXiv:2509.18968  [pdf, ps, other

    cs.LG

    Otters: An Energy-Efficient SpikingTransformer via Optical Time-to-First-Spike Encoding

    Authors: Zhanglu Yan, Jiayi Mao, Qianhui Liu, Fanfan Li, Gang Pan, Tao Luo, Bowen Zhu, Weng-Fai Wong

    Abstract: Spiking neural networks (SNNs) promise high energy efficiency, particularly with time-to-first-spike (TTFS) encoding, which maximizes sparsity by emitting at most one spike per neuron. However, such energy advantage is often unrealized because inference requires evaluating a temporal decay function and subsequent multiplication with the synaptic weights. This paper challenges this costly approach… ▽ More

    Submitted 23 September, 2025; originally announced September 2025.

  24. arXiv:2509.18948  [pdf, ps, other

    cs.GR cs.CV

    One-shot Embroidery Customization via Contrastive LoRA Modulation

    Authors: Jun Ma, Qian He, Gaofeng He, Huang Chen, Chen Liu, Xiaogang Jin, Huamin Wang

    Abstract: Diffusion models have significantly advanced image manipulation techniques, and their ability to generate photorealistic images is beginning to transform retail workflows, particularly in presale visualization. Beyond artistic style transfer, the capability to perform fine-grained visual feature transfer is becoming increasingly important. Embroidery is a textile art form characterized by intricat… ▽ More

    Submitted 23 September, 2025; originally announced September 2025.

    Comments: Accepted to ACM Transactions on Graphics (TOG), SIGGRAPH Asia 2025

  25. arXiv:2509.18362  [pdf, ps, other

    cs.LG cs.AI

    FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction

    Authors: Yuxuan Cai, Xiaozhuan Liang, Xinghua Wang, Jin Ma, Haijin Liang, Jinwen Luo, Xinyu Zuo, Lisheng Duan, Yuyang Yin, Xi Chen

    Abstract: As large language models (LLMs) become increasingly powerful, the sequential nature of autoregressive generation creates a fundamental throughput bottleneck that limits the practical deployment. While Multi-Token Prediction (MTP) has demonstrated remarkable benefits for model training efficiency and performance, its inherent potential for inference acceleration remains largely unexplored. This pap… ▽ More

    Submitted 16 September, 2025; originally announced September 2025.

  26. arXiv:2509.18284  [pdf, ps, other

    cs.CV

    Learning Contrastive Multimodal Fusion with Improved Modality Dropout for Disease Detection and Prediction

    Authors: Yi Gu, Kuniaki Saito, Jiaxin Ma

    Abstract: As medical diagnoses increasingly leverage multimodal data, machine learning models are expected to effectively fuse heterogeneous information while remaining robust to missing modalities. In this work, we propose a novel multimodal learning framework that integrates enhanced modalities dropout and contrastive learning to address real-world limitations such as modality imbalance and missingness. O… ▽ More

    Submitted 22 September, 2025; originally announced September 2025.

    Comments: MICCAI 2025

  27. arXiv:2509.18189  [pdf, ps, other

    cs.CV cs.AI

    Qianfan-VL: Domain-Enhanced Universal Vision-Language Models

    Authors: Daxiang Dong, Mingming Zheng, Dong Xu, Bairong Zhuang, Wenyu Zhang, Chunhua Luo, Haoran Wang, Zijian Zhao, Jie Li, Yuxuan Li, Hanjun Zhong, Mengyue Liu, Jieting Chen, Shupeng Li, Lun Tian, Yaping Feng, Xin Li, Donggang Jiang, Yong Chen, Yehua Xu, Duohao Qin, Chen Feng, Dan Wang, Henghua Zhang, Jingjing Ha , et al. (10 additional authors not shown)

    Abstract: We present Qianfan-VL, a series of multimodal large language models ranging from 3B to 70B parameters, achieving state-of-the-art performance through innovative domain enhancement techniques. Our approach employs multi-stage progressive training and high-precision data synthesis pipelines, which prove to be critical technologies for enhancing domain-specific capabilities while maintaining strong g… ▽ More

    Submitted 19 September, 2025; originally announced September 2025.

    Comments: 12 pages

  28. arXiv:2509.17850  [pdf, ps, other

    cs.RO

    SocialTraj: Two-Stage Socially-Aware Trajectory Prediction for Autonomous Driving via Conditional Diffusion Model

    Authors: Xiao Zhou, Zengqi Peng, Jun Ma

    Abstract: Accurate trajectory prediction of surrounding vehicles (SVs) is crucial for autonomous driving systems to avoid misguided decisions and potential accidents. However, achieving reliable predictions in highly dynamic and complex traffic scenarios remains a significant challenge. One of the key impediments lies in the limited effectiveness of current approaches to capture the multi-modal behaviors of… ▽ More

    Submitted 22 September, 2025; originally announced September 2025.

  29. arXiv:2509.17537  [pdf, ps, other

    cs.CV

    SimToken: A Simple Baseline for Referring Audio-Visual Segmentation

    Authors: Dian Jin, Yanghao Zhou, Jinxing Zhou, Jiaqi Ma, Ruohao Guo, Dan Guo

    Abstract: Referring Audio-Visual Segmentation (Ref-AVS) aims to segment specific objects in videos based on natural language expressions involving audio, vision, and text information. This task poses significant challenges in cross-modal reasoning and fine-grained object localization. In this paper, we propose a simple framework, SimToken, that integrates a multimodal large language model (MLLM) with the Se… ▽ More

    Submitted 23 September, 2025; v1 submitted 22 September, 2025; originally announced September 2025.

    Comments: Project page: https://github.com/DianJin-HFUT/SimToken

  30. arXiv:2509.17429  [pdf, ps, other

    cs.CV

    Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration

    Authors: Zhitao Zeng, Guojian Yuan, Junyuan Mao, Yuxuan Wang, Xiaoshuang Jia, Yueming Jin

    Abstract: Accurate temporal prediction is the bridge between comprehensive scene understanding and embodied artificial intelligence. However, predicting multiple fine-grained states of a scene at multiple temporal scales is difficult for vision-language models. We formalize the Multi-Scale Temporal Prediction (MSTP) task in general and surgical scenes by decomposing multi-scale into two orthogonal dimension… ▽ More

    Submitted 23 September, 2025; v1 submitted 22 September, 2025; originally announced September 2025.

    Comments: 20 pages, 6 figures

    MSC Class: 68T45 ACM Class: I.2.10

    Journal ref: NeurIPS 2025

  31. Computational Scaffolding of Composition, Value, and Color for Disciplined Drawing

    Authors: Jiaju Ma, Chau Vu, Asya Lyubavina, Catherine Liu, Jingyi Li

    Abstract: One way illustrators engage in disciplined drawing - the process of drawing to improve technical skills - is through studying and replicating reference images. However, for many novice and intermediate digital artists, knowing how to approach studying a reference image can be challenging. It can also be difficult to receive immediate feedback on their works-in-progress. To help these users develop… ▽ More

    Submitted 21 September, 2025; originally announced September 2025.

    Comments: Accepted to UIST 2025 (Best Paper)

  32. arXiv:2509.17080  [pdf, ps, other

    cs.RO

    CoPlanner: An Interactive Motion Planner with Contingency-Aware Diffusion for Autonomous Driving

    Authors: Ruiguo Zhong, Ruoyu Yao, Pei Liu, Xiaolong Chen, Rui Yang, Jun Ma

    Abstract: Accurate trajectory prediction and motion planning are crucial for autonomous driving systems to navigate safely in complex, interactive environments characterized by multimodal uncertainties. However, current generation-then-evaluation frameworks typically construct multiple plausible trajectory hypotheses but ultimately adopt a single most likely outcome, leading to overconfident decisions and a… ▽ More

    Submitted 21 September, 2025; originally announced September 2025.

  33. arXiv:2509.17042  [pdf, ps, other

    cs.RO

    Orchestrate, Generate, Reflect: A VLM-Based Multi-Agent Collaboration Framework for Automated Driving Policy Learning

    Authors: Zengqi Peng, Yusen Xie, Yubin Wang, Rui Yang, Qifeng Chen, Jun Ma

    Abstract: The advancement of foundation models fosters new initiatives for policy learning in achieving safe and efficient autonomous driving. However, a critical bottleneck lies in the manual engineering of reward functions and training curricula for complex and dynamic driving tasks, which is a labor-intensive and time-consuming process. To address this problem, we propose OGR (Orchestrate, Generate, Refl… ▽ More

    Submitted 21 September, 2025; originally announced September 2025.

  34. arXiv:2509.16017  [pdf, ps, other

    cs.CV

    DistillMatch: Leveraging Knowledge Distillation from Vision Foundation Model for Multimodal Image Matching

    Authors: Meng Yang, Fan Fan, Zizhuo Li, Songchu Deng, Yong Ma, Jiayi Ma

    Abstract: Multimodal image matching seeks pixel-level correspondences between images of different modalities, crucial for cross-modal perception, fusion and analysis. However, the significant appearance differences between modalities make this task challenging. Due to the scarcity of high-quality annotated datasets, existing deep learning methods that extract modality-common features for matching perform po… ▽ More

    Submitted 19 September, 2025; originally announced September 2025.

    Comments: 10 pages, 4 figures, 3 tables

    ACM Class: I.4.3; I.5.2

  35. arXiv:2509.15492  [pdf, ps, other

    cs.SD cs.MM eess.AS

    Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech

    Authors: Xinlei Niu, Jianbo Ma, Dylan Harper-Harris, Xiangyu Zhang, Charles Patrick Martin, Jing Zhang

    Abstract: The generation of realistic, context-aware audio is important in real-world applications such as video game development. While existing video-to-audio (V2A) methods mainly focus on Foley sound generation, they struggle to produce intelligible speech. Meanwhile, current environmental speech synthesis approaches remain text-driven and fail to temporally align with dynamic video content. In this pape… ▽ More

    Submitted 18 September, 2025; originally announced September 2025.

  36. arXiv:2509.14999  [pdf, ps, other

    cs.RO

    Semantic-LiDAR-Inertial-Wheel Odometry Fusion for Robust Localization in Large-Scale Dynamic Environments

    Authors: Haoxuan Jiang, Peicong Qian, Yusen Xie, Linwei Zheng, Xiaocong Li, Ming Liu, Jun Ma

    Abstract: Reliable, drift-free global localization presents significant challenges yet remains crucial for autonomous navigation in large-scale dynamic environments. In this paper, we introduce a tightly-coupled Semantic-LiDAR-Inertial-Wheel Odometry fusion framework, which is specifically designed to provide high-precision state estimation and robust localization in large-scale dynamic environments. Our fr… ▽ More

    Submitted 18 September, 2025; originally announced September 2025.

  37. arXiv:2509.14242  [pdf, ps, other

    eess.SP cs.LG

    Artificial Intelligence-derived Cardiotocography Age as a Digital Biomarker for Predicting Future Adverse Pregnancy Outcomes

    Authors: Jinshuai Gu, Zenghui Lin, Jingying Ma, Jingyu Wang, Linyan Zhang, Rui Bai, Zelin Tu, Youyou Jiang, Donglin Xie, Yuxi Zhou, Guoli Liu, Shenda Hong

    Abstract: Cardiotocography (CTG) is a low-cost, non-invasive fetal health assessment technique used globally, especially in underdeveloped countries. However, it is currently mainly used to identify the fetus's current status (e.g., fetal acidosis or hypoxia), and the potential of CTG in predicting future adverse pregnancy outcomes has not been fully explored. We aim to develop an AI-based model that predic… ▽ More

    Submitted 3 September, 2025; originally announced September 2025.

  38. arXiv:2509.14142  [pdf, ps, other

    cs.CV

    MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook

    Authors: Peng Xu, Shengwu Xiong, Jiajun Zhang, Yaxiong Chen, Bowen Zhou, Chen Change Loy, David A. Clifton, Kyoung Mu Lee, Luc Van Gool, Ruiming He, Ruilin Yao, Xinwei Long, Jirui Huang, Kai Tian, Sa Yang, Yihua Shao, Jin Feng, Yue Zhong, Jiakai Zhou, Cheng Tang, Tianyu Zou, Yifang Zhang, Junming Liang, Guoyou Li, Zhaoxiang Wang , et al. (103 additional authors not shown)

    Abstract: This paper reviews the MARS2 2025 Challenge on Multimodal Reasoning. We aim to bring together different approaches in multimodal machine learning and LLMs via a large benchmark. We hope it better allows researchers to follow the state-of-the-art in this very dynamic area. Meanwhile, a growing number of testbeds have boosted the evolution of general-purpose large language models. Thus, this year's… ▽ More

    Submitted 17 September, 2025; originally announced September 2025.

    Comments: ICCV 2025 MARS2 Workshop and Challenge "Multimodal Reasoning and Slow Thinking in the Large Model Era: Towards System 2 and Beyond''

  39. arXiv:2509.14119  [pdf, ps, other

    cs.CV

    Generative AI for Misalignment-Resistant Virtual Staining to Accelerate Histopathology Workflows

    Authors: Jiabo MA, Wenqiang Li, Jinbang Li, Ziyi Liu, Linshan Wu, Fengtao Zhou, Li Liang, Ronald Cheong Kin Chan, Terence T. W. Wong, Hao Chen

    Abstract: Accurate histopathological diagnosis often requires multiple differently stained tissue sections, a process that is time-consuming, labor-intensive, and environmentally taxing due to the use of multiple chemical stains. Recently, virtual staining has emerged as a promising alternative that is faster, tissue-conserving, and environmentally friendly. However, existing virtual staining methods face s… ▽ More

    Submitted 17 September, 2025; originally announced September 2025.

    Comments: the arxiv version of the under review journal paper

  40. arXiv:2509.13956  [pdf, ps, other

    cs.RO

    SEG-Parking: Towards Safe, Efficient, and Generalizable Autonomous Parking via End-to-End Offline Reinforcement Learning

    Authors: Zewei Yang, Zengqi Peng, Jun Ma

    Abstract: Autonomous parking is a critical component for achieving safe and efficient urban autonomous driving. However, unstructured environments and dynamic interactions pose significant challenges to autonomous parking tasks. To address this problem, we propose SEG-Parking, a novel end-to-end offline reinforcement learning (RL) framework to achieve interaction-aware autonomous parking. Notably, a special… ▽ More

    Submitted 17 September, 2025; originally announced September 2025.

  41. arXiv:2509.13955  [pdf, ps, other

    cs.IT eess.SP

    Asymptotic Analysis of Nonlinear One-Bit Precoding in Massive MIMO Systems via Approximate Message Passing

    Authors: Zheyu Wu, Junjie Ma, Ya-Feng Liu, Bruno Clerckx

    Abstract: Massive multiple-input multiple-output (MIMO) systems employing one-bit digital-to-analog converters offer a hardware-efficient solution for wireless communications. However, the one-bit constraint poses significant challenges for precoding design, as it transforms the problem into a discrete and nonconvex optimization task. In this paper, we investigate a widely adopted ``convex-relaxation-then-q… ▽ More

    Submitted 17 September, 2025; originally announced September 2025.

    Comments: 39 pages, 6 figures, submitted for possible publication

  42. arXiv:2509.13761  [pdf, ps, other

    cs.AI cs.CL

    THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning

    Authors: Qikai Chang, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Yicheng Pan, Jianshu Zhang, Jun Du, Quan Liu, Jianqing Gao

    Abstract: Large Language Models (LLMs) have made remarkable progress in mathematical reasoning, but still continue to struggle with high-precision tasks like numerical computation and formal symbolic manipulation. Integrating external tools has emerged as a promising approach to bridge this gap. Despite recent advances, existing methods struggle with three key challenges: constructing tool-integrated reason… ▽ More

    Submitted 17 September, 2025; originally announced September 2025.

    Comments: 22 pages, 13 figures

  43. arXiv:2509.13534  [pdf, ps, other

    cs.RO

    Embracing Bulky Objects with Humanoid Robots: Whole-Body Manipulation with Reinforcement Learning

    Authors: Chunxin Zheng, Kai Chen, Zhihai Bi, Yulin Li, Liang Pan, Jinni Zhou, Haoang Li, Jun Ma

    Abstract: Whole-body manipulation (WBM) for humanoid robots presents a promising approach for executing embracing tasks involving bulky objects, where traditional grasping relying on end-effectors only remains limited in such scenarios due to inherent stability and payload constraints. This paper introduces a reinforcement learning framework that integrates a pre-trained human motion prior with a neural sig… ▽ More

    Submitted 16 September, 2025; originally announced September 2025.

  44. arXiv:2509.12676  [pdf, ps, other

    cs.AR cs.CR

    A Scalable Architecture for Efficient Multi-bit Fully Homomorphic Encryption

    Authors: Jiaao Ma, Ceyu Xu, Lisa Wu Wills

    Abstract: In the era of cloud computing, privacy-preserving computation offloading is crucial for safeguarding sensitive data. Fully Homomorphic Encryption (FHE) enables secure processing of encrypted data, but the inherent computational complexity of FHE operations introduces significant computational overhead on the server side. FHE schemes often face a tradeoff between efficiency and versatility. While t… ▽ More

    Submitted 16 September, 2025; originally announced September 2025.

    Comments: 13 pages, 16 figures

    ACM Class: C.3; E.3; C.1

  45. arXiv:2509.12630  [pdf, ps, other

    cs.LG

    High-Energy Concentration for Federated Learning in Frequency Domain

    Authors: Haozhi Shi, Weiying Xie, Hangyu Ye, Daixun Li, Jitao Ma, Leyuan Fang

    Abstract: Federated Learning (FL) presents significant potential for collaborative optimization without data sharing. Since synthetic data is sent to the server, leveraging the popular concept of dataset distillation, this FL framework protects real data privacy while alleviating data heterogeneity. However, such methods are still challenged by the redundant information and noise in entire spatial-domain de… ▽ More

    Submitted 15 September, 2025; originally announced September 2025.

  46. arXiv:2509.12600  [pdf, ps, other

    cs.LG cs.AI q-bio.QM

    A Multimodal Foundation Model to Enhance Generalizability and Data Efficiency for Pan-cancer Prognosis Prediction

    Authors: Huajun Zhou, Fengtao Zhou, Jiabo Ma, Yingxue Xu, Xi Wang, Xiuming Zhang, Li Liang, Zhenhui Li, Hao Chen

    Abstract: Multimodal data provides heterogeneous information for a holistic understanding of the tumor microenvironment. However, existing AI models often struggle to harness the rich information within multimodal data and extract poorly generalizable representations. Here we present MICE (Multimodal data Integration via Collaborative Experts), a multimodal foundation model that effectively integrates patho… ▽ More

    Submitted 15 September, 2025; originally announced September 2025.

    Comments: 27 pages, 7 figures

  47. arXiv:2509.12581  [pdf, ps, other

    cs.LG

    Exploring Training Data Attribution under Limited Access Constraints

    Authors: Shiyuan Zhang, Junwei Deng, Juhan Bae, Jiaqi Ma

    Abstract: Training data attribution (TDA) plays a critical role in understanding the influence of individual training data points on model predictions. Gradient-based TDA methods, popularized by \textit{influence function} for their superior performance, have been widely applied in data selection, data cleaning, data economics, and fact tracing. However, in real-world scenarios where commercial models are n… ▽ More

    Submitted 15 September, 2025; originally announced September 2025.

  48. arXiv:2509.12285  [pdf

    cs.LG cs.AI

    Deriving the Scaled-Dot-Function via Maximum Likelihood Estimation and Maximum Entropy Approach

    Authors: Jiyong Ma

    Abstract: In this paper, we present a maximum likelihood estimation approach to determine the value vector in transformer models. We model the sequence of value vectors, key vectors, and the query vector as a sequence of Gaussian distributions. The variance in each Gaussian distribution depends on the time step, the corresponding key vector, and the query vector. The mean value in each Gaussian distribution… ▽ More

    Submitted 14 September, 2025; originally announced September 2025.

  49. arXiv:2509.12242  [pdf, ps, other

    cs.CV

    Artificial Intelligence in Breast Cancer Care: Transforming Preoperative Planning and Patient Education with 3D Reconstruction

    Authors: Mustafa Khanbhai, Giulia Di Nardo, Jun Ma, Vivienne Freitas, Caterina Masino, Ali Dolatabadi, Zhaoxun "Lorenz" Liu, Wey Leong, Wagner H. Souza, Amin Madani

    Abstract: Effective preoperative planning requires accurate algorithms for segmenting anatomical structures across diverse datasets, but traditional models struggle with generalization. This study presents a novel machine learning methodology to improve algorithm generalization for 3D anatomical reconstruction beyond breast cancer applications. We processed 120 retrospective breast MRIs (January 2018-June 2… ▽ More

    Submitted 10 September, 2025; originally announced September 2025.

  50. arXiv:2509.11731  [pdf, ps, other

    cs.CV cs.AI

    Bridging the Gap Between Sparsity and Redundancy: A Dual-Decoding Framework with Global Context for Map Inference

    Authors: Yudong Shen, Wenyu Wu, Jiali Mao, Yixiao Tong, Guoping Liu, Chaoya Wang

    Abstract: Trajectory data has become a key resource for automated map in-ference due to its low cost, broad coverage, and continuous availability. However, uneven trajectory density often leads to frag-mented roads in sparse areas and redundant segments in dense regions, posing significant challenges for existing methods. To address these issues, we propose DGMap, a dual-decoding framework with global conte… ▽ More

    Submitted 15 September, 2025; originally announced September 2025.