Skip to main content

Showing 1–50 of 932 results for author: Xu, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.05142  [pdf, ps, other

    cs.AI

    GIST: Cross-Domain Click-Through Rate Prediction via Guided Content-Behavior Distillation

    Authors: Wei Xu, Haoran Li, Baoyuan Ou, Lai Xu, Yingjie Qin, Ruilong Su, Ruiwen Xu

    Abstract: Cross-domain Click-Through Rate prediction aims to tackle the data sparsity and the cold start problems in online advertising systems by transferring knowledge from source domains to a target domain. Most existing methods rely on overlapping users to facilitate this transfer, often focusing on joint training or pre-training with fine-tuning approach to connect the source and target domains. Howeve… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  2. arXiv:2507.04590  [pdf, ps, other

    cs.CV cs.CL

    VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

    Authors: Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, Semih Yavuz

    Abstract: Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME are predominantly focused on natural images, with limited support for other visual forms such as videos and visual documents. This restricts their applicabilit… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

    Comments: Technical Report

  3. arXiv:2507.04587  [pdf, ps, other

    cs.CV

    CVFusion: Cross-View Fusion of 4D Radar and Camera for 3D Object Detection

    Authors: Hanzhi Zhong, Zhiyu Xiang, Ruoyu Xu, Jingyun Fu, Peng Xu, Shaohong Wang, Zhihao Yang, Tianyu Pu, Eryun Liu

    Abstract: 4D radar has received significant attention in autonomous driving thanks to its robustness under adverse weathers. Due to the sparse points and noisy measurements of the 4D radar, most of the research finish the 3D object detection task by integrating images from camera and perform modality fusion in BEV space. However, the potential of the radar and the fusion mechanism is still largely unexplore… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

  4. arXiv:2507.02724  [pdf, ps, other

    cs.LG q-bio.BM

    Hierarchical Multi-Label Contrastive Learning for Protein-Protein Interaction Prediction Across Organisms

    Authors: Shiyi Liu, Buwen Liang, Yuetong Fang, Zixuan Jiang, Renjing Xu

    Abstract: Recent advances in AI for science have highlighted the power of contrastive learning in bridging heterogeneous biological data modalities. Building on this paradigm, we propose HIPPO (HIerarchical Protein-Protein interaction prediction across Organisms), a hierarchical contrastive framework for protein-protein interaction(PPI) prediction, where protein sequences and their hierarchical attributes a… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  5. arXiv:2507.01485  [pdf, ps, other

    cs.RO cs.AI cs.MA q-bio.QM

    BioMARS: A Multi-Agent Robotic System for Autonomous Biological Experiments

    Authors: Yibo Qiu, Zan Huang, Zhiyu Wang, Handi Liu, Yiling Qiao, Yifeng Hu, Shu'ang Sun, Hangke Peng, Ronald X Xu, Mingzhai Sun

    Abstract: Large language models (LLMs) and vision-language models (VLMs) have the potential to transform biological research by enabling autonomous experimentation. Yet, their application remains constrained by rigid protocol design, limited adaptability to dynamic lab conditions, inadequate error handling, and high operational complexity. Here we introduce BioMARS (Biological Multi-Agent Robotic System), a… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  6. arXiv:2507.00880  [pdf, ps, other

    cs.LG cs.AI

    NN-Former: Rethinking Graph Structure in Neural Architecture Representation

    Authors: Ruihan Xu, Haokui Zhang, Yaowei Wang, Wei Zeng, Shiliang Zhang

    Abstract: The growing use of deep learning necessitates efficient network design and deployment, making neural predictors vital for estimating attributes such as accuracy and latency. Recently, Graph Neural Networks (GNNs) and transformers have shown promising performance in representing neural architectures. However, each of both methods has its disadvantages. GNNs lack the capabilities to represent compli… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted to CVPR 2025. Code is avaiable at https://github.com/XuRuihan/NNFormer

  7. arXiv:2506.23351  [pdf, ps, other

    cs.RO cs.AI cs.LG cs.MA

    Benchmarking Generalizable Bimanual Manipulation: RoboTwin Dual-Arm Collaboration Challenge at CVPR 2025 MEIS Workshop

    Authors: Tianxing Chen, Kaixuan Wang, Zhaohui Yang, Yuhao Zhang, Zanxin Chen, Baijun Chen, Wanxi Dong, Ziyuan Liu, Dong Chen, Tianshuo Yang, Haibao Yu, Xiaokang Yang, Yusen Qin, Zhiqiang Xie, Yao Mu, Ping Luo, Tian Nian, Weiliang Deng, Yiheng Ge, Yibin Liu, Zixuan Li, Dehui Wang, Zhixuan Liang, Haohui Xie, Rijie Zeng , et al. (74 additional authors not shown)

    Abstract: Embodied Artificial Intelligence (Embodied AI) is an emerging frontier in robotics, driven by the need for autonomous systems that can perceive, reason, and act in complex physical environments. While single-arm systems have shown strong task performance, collaborative dual-arm systems are essential for handling more intricate tasks involving rigid, deformable, and tactile-sensitive objects. To ad… ▽ More

    Submitted 2 July, 2025; v1 submitted 29 June, 2025; originally announced June 2025.

    Comments: Challenge Webpage: https://robotwin-benchmark.github.io/cvpr-2025-challenge/

  8. arXiv:2506.23086  [pdf, ps, other

    cs.CV

    Frequency-enhanced Multi-granularity Context Network for Efficient Vertebrae Segmentation

    Authors: Jian Shi, Tianqi You, Pingping Zhang, Hongli Zhang, Rui Xu, Haojie Li

    Abstract: Automated and accurate segmentation of individual vertebra in 3D CT and MRI images is essential for various clinical applications. Due to the limitations of current imaging techniques and the complexity of spinal structures, existing methods still struggle with reducing the impact of image blurring and distinguishing similar vertebrae. To alleviate these issues, we introduce a Frequency-enhanced M… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: Accepted by MICCAI2025. More modifications my be performed

  9. arXiv:2506.22283  [pdf, ps, other

    cs.CV

    Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment

    Authors: Rui Xu, Yunke Wang, Yong Luo, Bo Du

    Abstract: Large Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics. These visual tokens often outnumber their textual counterparts by a large margin, leading to substantial computational overhead and limiting the scalability of LVLMs in practice. Previous efforts have explored visual token reduction either prior to or within the lar… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  10. arXiv:2506.21414  [pdf, ps, other

    cs.AR

    Accelerating GNN Training through Locality-aware Dropout and Merge

    Authors: Gongjian Sun, Mingyu Yan, Dengke Han, Runzhen Xue, Duo Wang, Xiaochun Ye, Dongrui Fan

    Abstract: Graph Neural Networks (GNNs) have demonstrated significant success in graph learning and are widely adopted across various critical domains. However, the irregular connectivity between vertices leads to inefficient neighbor aggregation, resulting in substantial irregular and coarse-grained DRAM accesses. This lack of data locality presents significant challenges for execution platforms, ultimately… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: under review in TPDS. extend version of DATE 2025

  11. arXiv:2506.20702  [pdf

    cs.AI cs.CY

    The Singapore Consensus on Global AI Safety Research Priorities

    Authors: Yoshua Bengio, Tegan Maharaj, Luke Ong, Stuart Russell, Dawn Song, Max Tegmark, Lan Xue, Ya-Qin Zhang, Stephen Casper, Wan Sie Lee, Sören Mindermann, Vanessa Wilfred, Vidhisha Balachandran, Fazl Barez, Michael Belinsky, Imane Bello, Malo Bourgon, Mark Brakel, Siméon Campos, Duncan Cass-Beggs, Jiahao Chen, Rumman Chowdhury, Kuan Chua Seah, Jeff Clune, Juntao Dai , et al. (63 additional authors not shown)

    Abstract: Rapidly improving AI capabilities and autonomy hold significant promise of transformation, but are also driving vigorous debate on how to ensure that AI is safe, i.e., trustworthy, reliable, and secure. Building a trusted ecosystem is therefore essential -- it helps people embrace AI with confidence and gives maximal space for innovation while avoiding backlash. The "2025 Singapore Conference on… ▽ More

    Submitted 30 June, 2025; v1 submitted 25 June, 2025; originally announced June 2025.

    Comments: Final report from the "2025 Singapore Conference on AI (SCAI)" held April 26: https://www.scai.gov.sg/2025/scai2025-report

  12. arXiv:2506.16784  [pdf, ps, other

    cs.CV cs.MM

    TextBraTS: Text-Guided Volumetric Brain Tumor Segmentation with Innovative Dataset Development and Fusion Module Exploration

    Authors: Xiaoyu Shi, Rahul Kumar Jain, Yinhao Li, Ruibo Hou, Jingliang Cheng, Jie Bai, Guohua Zhao, Lanfen Lin, Rui Xu, Yen-wei Chen

    Abstract: Deep learning has demonstrated remarkable success in medical image segmentation and computer-aided diagnosis. In particular, numerous advanced methods have achieved state-of-the-art performance in brain tumor segmentation from MRI scans. While recent studies in other medical imaging domains have revealed that integrating textual reports with visual data can enhance segmentation accuracy, the field… ▽ More

    Submitted 23 June, 2025; v1 submitted 20 June, 2025; originally announced June 2025.

  13. arXiv:2506.14813  [pdf, ps, other

    cs.LG cs.AI

    Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks

    Authors: Yuxuan Jiang, Ziming Zhou, Boyu Xu, Beijie Liu, Runhui Xu, Peng Huang

    Abstract: Training deep learning (DL) models is a complex process, making it prone to silent errors that are challenging to detect and diagnose. This paper presents TRAINCHECK, a framework that takes a proactive checking approach to address silent training errors. TRAINCHECK automatically infers invariants tailored for DL training. It uses these invariants to proactively detect silent errors during the trai… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

    Comments: 19 pages, to appear in 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI '25)

  14. arXiv:2506.12594  [pdf, other

    cs.AI cs.MA

    A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications

    Authors: Renjun Xu, Jingwen Peng

    Abstract: This survey examines the rapidly evolving field of Deep Research systems -- AI-powered applications that automate complex research workflows through the integration of large language models, advanced information retrieval, and autonomous reasoning capabilities. We analyze more than 80 commercial and non-commercial implementations that have emerged since 2023, including OpenAI/Deep Research, Gemini… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

    Comments: 95 pages, 11 figures

    ACM Class: I.2.8

  15. arXiv:2506.12103  [pdf, other

    cs.AI cs.CY cs.LG

    The Amazon Nova Family of Models: Technical Report and Model Card

    Authors: Amazon AGI, Aaron Langford, Aayush Shah, Abhanshu Gupta, Abhimanyu Bhatter, Abhinav Goyal, Abhinav Mathur, Abhinav Mohanty, Abhishek Kumar, Abhishek Sethi, Abi Komma, Abner Pena, Achin Jain, Adam Kunysz, Adam Opyrchal, Adarsh Singh, Aditya Rawal, Adok Achar Budihal Prasad, Adrià de Gispert, Agnika Kumar, Aishwarya Aryamane, Ajay Nair, Akilan M, Akshaya Iyengar, Akshaya Vishnu Kudlu Shanbhogue , et al. (761 additional authors not shown)

    Abstract: We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents… ▽ More

    Submitted 17 March, 2025; originally announced June 2025.

    Comments: 48 pages, 10 figures

    Report number: 20250317

  16. arXiv:2506.11153  [pdf, ps, other

    cs.SE cs.LG

    Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

    Authors: Changxin Ke, Rui Zhang, Shuo Wang, Li Ding, Guangli Li, Yuanbo Wen, Shuoming Zhang, Ruiyuan Xu, Jin Qin, Jiaming Guo, Chenxi Wang, Ling Li, Qi Guo, Yunji Chen

    Abstract: The rise of GPU-based high-performance computing (HPC) has driven the widespread adoption of parallel programming models such as CUDA. Yet, the inherent complexity of parallel programming creates a demand for the automated sequential-to-parallel approaches. However, data scarcity poses a significant challenge for machine learning-based sequential-to-parallel code translation. Although recent back-… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: 28 pages

  17. arXiv:2506.11003  [pdf, other

    cs.SE cs.AI

    EmbedAgent: Benchmarking Large Language Models in Embedded System Development

    Authors: Ruiyang Xu, Jialun Cao, Mingyuan Wu, Wenliang Zhong, Yaojie Lu, Ben He, Xianpei Han, Shing-Chi Cheung, Le Sun

    Abstract: Large Language Models (LLMs) have shown promise in various tasks, yet few benchmarks assess their capabilities in embedded system development.In this paper, we introduce EmbedAgent, a paradigm designed to simulate real-world roles in embedded system development, such as Embedded System Programmer, Architect, and Integrator. This paradigm enables LLMs to be tested in tasks that bridge the gap betwe… ▽ More

    Submitted 19 April, 2025; originally announced June 2025.

    Comments: 21 pages

  18. arXiv:2506.09928  [pdf, ps, other

    cs.LG stat.ML

    Course Project Report: Comparing MCMC and Variational Inference for Bayesian Probabilistic Matrix Factorization on the MovieLens Dataset

    Authors: Ruixuan Xu, Xiangxiang Weng

    Abstract: This is a course project report with complete methodology, experiments, references and mathematical derivations. Matrix factorization [1] is a widely used technique in recommendation systems. Probabilistic Matrix Factorization (PMF) [2] extends traditional matrix factorization by incorporating probability distributions over latent factors, allowing for uncertainty quantification. However, computin… ▽ More

    Submitted 12 June, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

    Comments: 11 pages, 2 figures. This document is a course project report. Some derivations are presented in a simplified form. For more detailed discussions and comprehensive proofs, please refer to the references cited in this report. v2 replacement: we have modified the title to better match our content. We have also updated the references to be more complete, including the link to our code

  19. arXiv:2506.08949  [pdf, ps, other

    cs.CV

    SSS: Semi-Supervised SAM-2 with Efficient Prompting for Medical Imaging Segmentation

    Authors: Hongjie Zhu, Xiwei Liu, Rundong Xue, Zeyu Zhang, Yong Xu, Daji Ergu, Ying Cai, Yang Zhao

    Abstract: In the era of information explosion, efficiently leveraging large-scale unlabeled data while minimizing the reliance on high-quality pixel-level annotations remains a critical challenge in the field of medical imaging. Semi-supervised learning (SSL) enhances the utilization of unlabeled data by facilitating knowledge transfer, significantly improving the performance of fully supervised models and… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  20. arXiv:2506.08708  [pdf, ps, other

    cs.RO cs.AI cs.CV

    PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly

    Authors: Liang Ma, Jiajun Wen, Min Lin, Rongtao Xu, Xiwen Liang, Bingqian Lin, Jun Ma, Yongxin Wang, Ziming Wei, Haokun Lin, Mingfei Han, Meng Cao, Bokui Chen, Ivan Laptev, Xiaodan Liang

    Abstract: While vision-language models (VLMs) have demonstrated promising capabilities in reasoning and planning for embodied agents, their ability to comprehend physical phenomena, particularly within structured 3D environments, remains severely limited. To close this gap, we introduce PhyBlock, a progressive benchmark designed to assess VLMs on physical understanding and planning through robotic 3D block… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  21. arXiv:2506.08670  [pdf, ps, other

    math.NA cs.LG math.OC

    sparseGeoHOPCA: A Geometric Solution to Sparse Higher-Order PCA Without Covariance Estimation

    Authors: Renjie Xu, Chong Wu, Maolin Che, Zhuoheng Ran, Yimin Wei, Hong Yan

    Abstract: We propose sparseGeoHOPCA, a novel framework for sparse higher-order principal component analysis (SHOPCA) that introduces a geometric perspective to high-dimensional tensor decomposition. By unfolding the input tensor along each mode and reformulating the resulting subproblems as structured binary linear optimization problems, our method transforms the original nonconvex sparse objective into a t… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  22. arXiv:2506.07640  [pdf, ps, other

    math.NT cs.CR math.GR quant-ph

    Stark-Coleman Invariants and Quantum Lower Bounds: An Integrated Framework for Real Quadratic Fields

    Authors: Ruopengyu Xu, Chenglian Liu

    Abstract: Class groups of real quadratic fields represent fundamental structures in algebraic number theory with significant computational implications. While Stark's conjecture establishes theoretical connections between special units and class group structures, explicit constructions have remained elusive, and precise quantum complexity bounds for class group computations are lacking. Here we establish an… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: 16 pages, 1 figure, 3 tables

    MSC Class: 11R29; 11R42; 81P68 ACM Class: F.2.2; G.2.0; E.3

  23. arXiv:2506.07035  [pdf, ps, other

    q-bio.BM cs.AI

    AnnoDPO: Protein Functional Annotation Learning with Direct Preference Optimization

    Authors: Zixuan Jiang, Renjing Xu

    Abstract: Deciphering protein function remains a fundamental challenge in protein representation learning. The task presents significant difficulties for protein language models (PLMs) due to the sheer volume of functional annotation categories and the highly imbalanced distribution of annotated instances across biological ontologies. Inspired by the remarkable success of reinforcement learning from human f… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

  24. arXiv:2506.07020  [pdf, ps, other

    cs.GR

    CrossGen: Learning and Generating Cross Fields for Quad Meshing

    Authors: Qiujie Dong, Jiepeng Wang, Rui Xu, Cheng Lin, Yuan Liu, Shiqing Xin, Zichun Zhong, Xin Li, Changhe Tu, Taku Komura, Leif Kobbelt, Scott Schaefer, Wenping Wang

    Abstract: Cross fields play a critical role in various geometry processing tasks, especially for quad mesh generation. Existing methods for cross field generation often struggle to balance computational efficiency with generation quality, using slow per-shape optimization. We introduce CrossGen, a novel framework that supports both feed-forward prediction and latent generative modeling of cross fields for q… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

    Comments: Project page: https://anonymousproject-homepage.github.io/

  25. arXiv:2506.04405  [pdf, ps, other

    cs.CL cs.AI cs.LG

    MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale

    Authors: Ran Xu, Yuchen Zhuang, Yishan Zhong, Yue Yu, Xiangru Tang, Hang Wu, May D. Wang, Peifeng Ruan, Donghan Yang, Tao Wang, Guanghua Xiao, Carl Yang, Yang Xie, Wenqi Shi

    Abstract: We introduce MedAgentGYM, the first publicly available training environment designed to enhance coding-based medical reasoning capabilities in large language model (LLM) agents. MedAgentGYM comprises 72,413 task instances across 129 categories derived from authentic real-world biomedical scenarios. Tasks are encapsulated within executable coding environments, each featuring detailed task descripti… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  26. arXiv:2506.02929  [pdf, ps, other

    cs.AR

    Large Processor Chip Model

    Authors: Kaiyan Chang, Mingzhi Chen, Yunji Chen, Zhirong Chen, Dongrui Fan, Junfeng Gong, Nan Guo, Yinhe Han, Qinfen Hao, Shuo Hou, Xuan Huang, Pengwei Jin, Changxin Ke, Cangyuan Li, Guangli Li, Huawei Li, Kuan Li, Naipeng Li, Shengwen Liang, Cheng Liu, Hongwei Liu, Jiahua Liu, Junliang Lv, Jianan Mu, Jin Qin , et al. (18 additional authors not shown)

    Abstract: Computer System Architecture serves as a crucial bridge between software applications and the underlying hardware, encompassing components like compilers, CPUs, coprocessors, and RTL designs. Its development, from early mainframes to modern domain-specific architectures, has been driven by rising computational demands and advancements in semiconductor technology. However, traditional paradigms in… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  27. arXiv:2506.02208  [pdf, ps, other

    cs.LG cs.AI cs.CL

    KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning

    Authors: Hongling Xu, Qi Zhu, Heyuan Deng, Jinpeng Li, Lu Hou, Yasheng Wang, Lifeng Shang, Ruifeng Xu, Fei Mi

    Abstract: Recent advances in large language model (LLM) post-training have leveraged two distinct paradigms to enhance reasoning capabilities: reinforcement learning (RL) and knowledge distillation (KD). While RL enables the emergence of complex reasoning behaviors, it often suffers from low sample efficiency when the initial policy struggles to explore high-reward trajectories. Conversely, KD improves lear… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

  28. arXiv:2506.02007  [pdf, ps, other

    cs.DC cs.AI cs.NI

    eACGM: Non-instrumented Performance Tracing and Anomaly Detection towards Machine Learning Systems

    Authors: Ruilin Xu, Zongxuan Xie, Pengfei Chen

    Abstract: We present eACGM, a full-stack AI/ML system monitoring framework based on eBPF. eACGM collects real-time performance data from key hardware components, including the GPU and network communication layer, as well as from key software stacks such as CUDA, Python, and PyTorch, all without requiring any code instrumentation or modifications. Additionally, it leverages libnvml to gather process-level GP… ▽ More

    Submitted 1 July, 2025; v1 submitted 25 May, 2025; originally announced June 2025.

    Comments: IWQoS 2025 (Camera-Ready Version)

  29. arXiv:2506.01943  [pdf, ps, other

    cs.CV

    Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

    Authors: Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, Dahua Lin

    Abstract: Recent advances in video diffusion models have demonstrated strong potential for generating robotic decision-making data, with trajectory conditions further enabling fine-grained control. However, existing trajectory-based methods primarily focus on individual object motion and struggle to capture multi-object interaction crucial in complex robotic manipulation. This limitation arises from multi-f… ▽ More

    Submitted 4 July, 2025; v1 submitted 2 June, 2025; originally announced June 2025.

    Comments: Project Page: https://fuxiao0719.github.io/projects/robomaster/ Code: https://github.com/KwaiVGI/RoboMaster

  30. arXiv:2506.01551  [pdf, ps, other

    cs.CV cs.AI cs.CL

    EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation

    Authors: Bingqian Lin, Yunshuang Nie, Khun Loun Zai, Ziming Wei, Mingfei Han, Rongtao Xu, Minzhe Niu, Jianhua Han, Liang Lin, Cewu Lu, Xiaodan Liang

    Abstract: Building Vision-Language Navigation (VLN) agents which can navigate following natural language instructions is a long-standing goal in human-robot interaction applications. Recent studies have revealed the potential of training open-source Large Language Models (LLMs) to unleash LLMs' reasoning ability for improving navigation, and simultaneously mitigate the domain gap between LLMs' training corp… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

  31. arXiv:2506.01275  [pdf, ps, other

    cs.AI

    Contra4: Evaluating Contrastive Cross-Modal Reasoning in Audio, Video, Image, and 3D

    Authors: Artemis Panagopoulou, Le Xue, Honglu Zhou, silvio savarese, Ran Xu, Caiming Xiong, Chris Callison-Burch, Mark Yatskar, Juan Carlos Niebles

    Abstract: Real-world decision-making often begins with identifying which modality contains the most relevant information for a given query. While recent multimodal models have made impressive progress in processing diverse inputs, it remains unclear whether they can reason contrastively across multiple modalities to select the one that best satisfies a natural language prompt. We argue this capability is fo… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

  32. arXiv:2506.01078  [pdf, ps, other

    cs.CV cs.AI

    GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking

    Authors: Yufei Zhan, Ziheng Wu, Yousong Zhu, Rongkun Xue, Ruipu Luo, Zhenghao Chen, Can Zhang, Yifan Li, Zhentao He, Zheming Yang, Ming Tang, Minghui Qiu, Jinqiao Wang

    Abstract: Despite notable advancements in multimodal reasoning, leading Multimodal Large Language Models (MLLMs) still underperform on vision-centric multimodal reasoning tasks in general scenarios. This shortfall stems from their predominant reliance on logic- and knowledge-based slow thinking strategies, while effective for domains like math and science, fail to integrate visual information effectively du… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

    Comments: Tech report

  33. arXiv:2506.00320  [pdf, ps, other

    cs.AI cs.CL cs.LG

    Dyna-Think: Synergizing Reasoning, Acting, and World Model Simulation in AI Agents

    Authors: Xiao Yu, Baolin Peng, Ruize Xu, Michel Galley, Hao Cheng, Suman Nath, Jianfeng Gao, Zhou Yu

    Abstract: Recent progress in reasoning with large language models (LLMs), such as DeepSeek-R1, demonstrates impressive capabilities in domains like mathematics and coding, by exhibiting complex cognitive behaviors such as verification, goal decomposition, and self-reflection. However, it is unclear what behavior is effective and what behavior is missing for long-horizon AI agents tasks. In this work, we pro… ▽ More

    Submitted 30 May, 2025; originally announced June 2025.

  34. arXiv:2505.24354  [pdf, ps, other

    cs.CL

    Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research

    Authors: Qianqian Zhang, Jiajia Liao, Heting Ying, Yibo Ma, Haozhan Shen, Jingcheng Li, Peng Liu, Lu Zhang, Chunxin Fang, Kyusong Lee, Ruochen Xu, Tiancheng Zhao

    Abstract: Language agents powered by large language models (LLMs) have demonstrated remarkable capabilities in understanding, reasoning, and executing complex tasks. However, developing robust agents presents significant challenges: substantial engineering overhead, lack of standardized components, and insufficient evaluation frameworks for fair comparison. We introduce Agent Graph-based Orchestration for R… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

    Comments: Accepted by ACL 2025 Demo

  35. arXiv:2505.24139  [pdf, ps, other

    cs.CV cs.AI

    S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Modelwith Spatio-Temporal Visual Representation

    Authors: Yichen Xie, Runsheng Xu, Tong He, Jyh-Jing Hwang, Katie Luo, Jingwei Ji, Hubert Lin, Letian Chen, Yiren Lu, Zhaoqi Leng, Dragomir Anguelov, Mingxing Tan

    Abstract: The latest advancements in multi-modal large language models (MLLMs) have spurred a strong renewed interest in end-to-end motion planning approaches for autonomous driving. Many end-to-end approaches rely on human annotations to learn intermediate perception and prediction tasks, while purely self-supervised approaches--which directly learn from sensor inputs to generate planning trajectories with… ▽ More

    Submitted 3 June, 2025; v1 submitted 29 May, 2025; originally announced May 2025.

    Comments: Accepted by CVPR2025; Project website: s4-driver.github.io

  36. arXiv:2505.23764  [pdf, other

    cs.CV cs.CL

    MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

    Authors: Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, Jiangmiao Pang

    Abstract: Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real-world deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent mo… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: 34 pages. A comprehensive, fully human-curated, multi-image-based spatial intelligence benchmark with reasoning annotation for MLLMs. Project page: https://runsenxu.com/projects/MMSI_Bench

  37. arXiv:2505.23214  [pdf, ps, other

    cs.CV cs.AI

    SAMamba: Adaptive State Space Modeling with Hierarchical Vision for Infrared Small Target Detection

    Authors: Wenhao Xu, Shuchen Zheng, Changwei Wang, Zherui Zhang, Chuan Ren, Rongtao Xu, Shibiao Xu

    Abstract: Infrared small target detection (ISTD) is vital for long-range surveillance in military, maritime, and early warning applications. ISTD is challenged by targets occupying less than 0.15% of the image and low distinguishability from complex backgrounds. Existing deep learning methods often suffer from information loss during downsampling and inefficient global context modeling. This paper presents… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: Information Fusion 2025

  38. arXiv:2505.23123  [pdf, ps, other

    cs.SI

    Offline Map Matching Based on Localization Error Distribution Modeling

    Authors: Ruilin Xu, Yuchen Song, Kaijie Li, Xitong Gao, Kejiang Ye, Fan Zhang, Juanjuan Zhao

    Abstract: Offline map matching involves aligning historical trajectories of mobile objects, which may have positional errors, with digital maps. This is essential for applications in intelligent transportation systems (ITS), such as route analysis and traffic pattern mining. Existing methods have two main limitations: (i) they assume a uniform Localization Error Distribution (LED) across urban areas, neglec… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: 13 pages

  39. arXiv:2505.23068  [pdf, ps, other

    cs.CV

    URWKV: Unified RWKV Model with Multi-state Perspective for Low-light Image Restoration

    Authors: Rui Xu, Yuzhen Niu, Yuezhou Li, Huangbiao Xu, Wenxi Liu, Yuzhong Chen

    Abstract: Existing low-light image enhancement (LLIE) and joint LLIE and deblurring (LLIE-deblur) models have made strides in addressing predefined degradations, yet they are often constrained by dynamically coupled degradations. To address these challenges, we introduce a Unified Receptance Weighted Key Value (URWKV) model with multi-state perspective, enabling flexible and effective degradation restoratio… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: This paper has been accepted to CVPR 2025

  40. arXiv:2505.20444  [pdf, other

    cs.LG cs.CV

    HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models

    Authors: Haoran Li, Yingjie Qin, Baoyuan Ou, Lai Xu, Ruiwen Xu

    Abstract: Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains a… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  41. arXiv:2505.20231  [pdf, ps, other

    cs.CL

    Bridging the Long-Term Gap: A Memory-Active Policy for Multi-Session Task-Oriented Dialogue

    Authors: Yiming Du, Bingbing Wang, Yang He, Bin Liang, Baojun Wang, Zhongyang Li, Lin Gui, Jeff Z. Pan, Ruifeng Xu, Kam-Fai Wong

    Abstract: Existing Task-Oriented Dialogue (TOD) systems primarily focus on single-session dialogues, limiting their effectiveness in long-term memory augmentation. To address this challenge, we introduce a MS-TOD dataset, the first multi-session TOD dataset designed to retain long-term memory across sessions, enabling fewer turns and more efficient task completion. This defines a new benchmark task for eval… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  42. arXiv:2505.18608  [pdf, other

    cs.CV

    Spiking Transformers Need High Frequency Information

    Authors: Yuetong Fang, Deming Zhou, Ziqing Wang, Hongwei Ren, ZeCui Zeng, Lusong Li, Shibo Zhou, Renjing Xu

    Abstract: Spiking Transformers offer an energy-efficient alternative to conventional deep learning by transmitting information solely through binary (0/1) spikes. However, there remains a substantial performance gap compared to artificial neural networks. A common belief is that their binary and sparse activation transmission leads to information loss, thus degrading feature representation and accuracy. In… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

  43. arXiv:2505.18542  [pdf, ps, other

    cs.CL

    Business as Rulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs

    Authors: Chen Yang, Ruping Xu, Ruizhe Li, Bin Cao, Jing Fan

    Abstract: Process mining aims to discover, monitor and optimize the actual behaviors of real processes. While prior work has mainly focused on extracting procedural action flows from instructional texts, rule flows embedded in business documents remain underexplored. To this end, we introduce a novel annotated Chinese dataset, BPRF, which contains 50 business process documents with 326 explicitly labeled bu… ▽ More

    Submitted 28 May, 2025; v1 submitted 24 May, 2025; originally announced May 2025.

  44. arXiv:2505.18082  [pdf, ps, other

    cs.LG

    An Iterative Framework for Generative Backmapping of Coarse Grained Proteins

    Authors: Georgios Kementzidis, Erin Wong, John Nicholson, Ruichen Xu, Yuefan Deng

    Abstract: The techniques of data-driven backmapping from coarse-grained (CG) to fine-grained (FG) representation often struggle with accuracy, unstable training, and physical realism, especially when applied to complex systems such as proteins. In this work, we introduce a novel iterative framework by using conditional Variational Autoencoders and graph-based neural networks, specifically designed to tackle… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

    Comments: 17 pages, 8 figures. For associated code repositories, see: CGVAE: https://github.com/wwang2/CoarseGrainingVAE GenZProT: https://github.com/learningmatter-mit/GenZProt See also arXiv:2201.12176 and arXiv:2303.01569 for related methods

  45. arXiv:2505.18053  [pdf, ps, other

    cs.CV cs.AI

    FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation

    Authors: Zherui Zhang, Jiaxin Wu, Changwei Wang, Rongtao Xu, Longzhao Huang, Wenhao Xu, Wenbo Xu, Li Guo, Shibiao Xu

    Abstract: Prompt learning as a parameter-efficient method that has been widely adopted to adapt Vision-Language Models (VLMs) to downstream tasks. While hard-prompt design requires domain expertise and iterative optimization, soft-prompt methods rely heavily on task-specific hard labels, limiting their generalization to unseen categories. Recent popular distillation-based prompt learning methods improve gen… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  46. arXiv:2505.17650  [pdf, ps, other

    cs.AI

    Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking?

    Authors: Chengda Lu, Xiaoyu Fan, Yu Huang, Rongwu Xu, Jijie Li, Wei Xu

    Abstract: Jailbreak attacks have been observed to largely fail against recent reasoning models enhanced by Chain-of-Thought (CoT) reasoning. However, the underlying mechanism remains underexplored, and relying solely on reasoning capacity may raise security concerns. In this paper, we try to answer the question: Does CoT reasoning really reduce harmfulness from jailbreaking? Through rigorous theoretical ana… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  47. arXiv:2505.17015  [pdf, other

    cs.CV cs.CL

    Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

    Authors: Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feiszli, Kevin J. Liang

    Abstract: Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual corresp… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: 24 pages. An MLLM, dataset, and benchmark for multi-frame spatial understanding. Project page: https://runsenxu.com/projects/Multi-SpatialMLLM

  48. arXiv:2505.16505  [pdf, ps, other

    cs.CL cs.AI cs.HC

    Sparse Activation Editing for Reliable Instruction Following in Narratives

    Authors: Runcong Zhao, Chengyu Cao, Qinglin Zhu, Xiucheng Lv, Shun Shao, Lin Gui, Ruifeng Xu, Yulan He

    Abstract: Complex narrative contexts often challenge language models' ability to follow instructions, and existing benchmarks fail to capture these difficulties. To address this, we propose Concise-SAE, a training-free framework that improves instruction following by identifying and editing instruction-relevant neurons using only natural language instructions, without requiring labelled data. To thoroughly… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

  49. arXiv:2505.13032  [pdf, other

    cs.SD cs.CL cs.MM eess.AS

    MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

    Authors: Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, Kai Li, Keliang Li, Siyou Li, Xinfeng Li, Xiquan Li, Zheng Lian, Yuzhe Liang, Minghao Liu, Zhikang Niu, Tianrui Wang, Yuping Wang, Yuxuan Wang, Yihao Wu, Guanrou Yang, Jianwei Yu , et al. (9 additional authors not shown)

    Abstract: We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: Open-source at https://github.com/ddlBoJack/MMAR

  50. arXiv:2505.09928  [pdf, other

    cs.CR

    DeFeed: Secure Decentralized Cross-Contract Data Feed in Web 3.0 for Connected Autonomous Vehicles

    Authors: Xingchen Sun, Runhua Xu, Wei Ni, Li Duan, Chao Li

    Abstract: Smart contracts have been a topic of interest in blockchain research and are a key enabling technology for Connected Autonomous Vehicles (CAVs) in the era of Web 3.0. These contracts enable trustless interactions without the need for intermediaries, as they operate based on predefined rules encoded on the blockchain. However, smart contacts face significant challenges in cross-contract communicati… ▽ More

    Submitted 19 May, 2025; v1 submitted 14 May, 2025; originally announced May 2025.