Skip to main content

Showing 1–50 of 5,118 results for author: Chen, X

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.01908  [pdf, ps, other

    cs.CV

    Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning

    Authors: Qingdong He, Xueqin Chen, Chaoyi Wang, Yanjie Pan, Xiaobin Hu, Zhenye Gan, Yabiao Wang, Chengjie Wang, Xiangtai Li, Jiangning Zhang

    Abstract: Instruction-based image editing (IIE) has advanced rapidly with the success of diffusion models. However, existing efforts primarily focus on simple and explicit instructions to execute editing operations such as adding, deleting, moving, or swapping objects. They struggle to handle more complex implicit hypothetical instructions that require deeper reasoning to infer plausible visual changes and… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  2. arXiv:2507.01299  [pdf, ps, other

    cs.CL

    La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation

    Authors: Kai Liu, Bowen Xu, Shaoyu Wu, Xin Chen, Hao Zhou, Yongliang Tao, Lulu Hu

    Abstract: Activation sparsity can reduce the computational overhead and memory transfers during the forward pass of Large Language Model (LLM) inference. Existing methods face limitations, either demanding time-consuming recovery training that hinders real-world adoption, or relying on empirical magnitude-based pruning, which causes fluctuating sparsity and unstable inference speed-up. This paper introduces… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: ICML 2025 Acceptance

  3. arXiv:2507.01291  [pdf, ps, other

    eess.IV cs.CV

    PanTS: The Pancreatic Tumor Segmentation Dataset

    Authors: Wenxuan Li, Xinze Zhou, Qi Chen, Tianyu Lin, Pedro R. A. S. Bassi, Szymon Plotka, Jaroslaw B. Cwikla, Xiaoxi Chen, Chen Ye, Zheren Zhu, Kai Ding, Heng Li, Kang Wang, Yang Yang, Yucheng Tang, Daguang Xu, Alan L. Yuille, Zongwei Zhou

    Abstract: PanTS is a large-scale, multi-institutional dataset curated to advance research in pancreatic CT analysis. It contains 36,390 CT scans from 145 medical centers, with expert-validated, voxel-wise annotations of over 993,000 anatomical structures, covering pancreatic tumors, pancreas head, body, and tail, and 24 surrounding anatomical structures such as vascular/skeletal structures and abdominal/tho… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  4. arXiv:2507.01145  [pdf, ps, other

    cs.AR

    CarbonClarity: Understanding and Addressing Uncertainty in Embodied Carbon for Sustainable Computing

    Authors: Xuesi Chen, Leo Han, Anvita Bhagavathula, Udit Gupta

    Abstract: Embodied carbon footprint modeling has become an area of growing interest due to its significant contribution to carbon emissions in computing. However, the deterministic nature of the existing models fail to account for the spatial and temporal variability in the semiconductor supply chain. The absence of uncertainty modeling limits system designers' ability to make informed, carbon-aware decisio… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  5. arXiv:2507.00041  [pdf, ps, other

    cs.AI cs.CV cs.IR

    TalentMine: LLM-Based Extraction and Question-Answering from Multimodal Talent Tables

    Authors: Varun Mannam, Fang Wang, Chaochun Liu, Xin Chen

    Abstract: In talent management systems, critical information often resides in complex tabular formats, presenting significant retrieval challenges for conventional language models. These challenges are pronounced when processing Talent documentation that requires precise interpretation of tabular relationships for accurate information retrieval and downstream decision-making. Current table extraction method… ▽ More

    Submitted 22 June, 2025; originally announced July 2025.

    Comments: Submitted to KDD conference, workshop: Talent and Management Computing (TMC 2025), https://tmcworkshop.github.io/2025/

  6. arXiv:2506.24086  [pdf, ps, other

    cs.CV cs.CL

    MotionGPT3: Human Motion as a Second Modality

    Authors: Bingfan Zhu, Biao Jiang, Sunyi Wang, Shixiang Tang, Tao Chen, Linjie Luo, Youyi Zheng, Xin Chen

    Abstract: Though recent advances in multimodal models have demonstrated strong capabilities and opportunities in unified understanding and generation, the development of unified motion-language models remains underexplored. To enable such models with high-fidelity human motion, two core challenges must be addressed. The first is the reconstruction gap between the continuous motion modality and discrete repr… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: 21 pages, 8 figures

  7. arXiv:2506.24045  [pdf, ps, other

    cs.DC cs.LG

    Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC

    Authors: Xinming Wei, Jiahao Zhang, Haoran Li, Jiayu Chen, Rui Qu, Maoliang Li, Xiang Chen, Guojie Luo

    Abstract: The proliferation of agentic Large Language Models (LLMs) on personal devices introduces a new class of workloads characterized by a dichotomy of objectives. Reactive tasks, initiated by users, demand immediate, low-latency responses, while proactive tasks operate invisibly and prioritize throughput. Existing on-device LLM engines, designed for isolated inferences, fail to efficiently manage these… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  8. arXiv:2506.24009  [pdf, ps, other

    cs.IT cs.AI

    Bridging Physical and Digital Worlds: Embodied Large AI for Future Wireless Systems

    Authors: Xinquan Wang, Fenghao Zhu, Zhaohui Yang, Chongwen Huang, Xiaoming Chen, Zhaoyang Zhang, Sami Muhaidat, Mérouane Debbah

    Abstract: Large artificial intelligence (AI) models offer revolutionary potential for future wireless systems, promising unprecedented capabilities in network optimization and performance. However, current paradigms largely overlook crucial physical interactions. This oversight means they primarily rely on offline datasets, leading to difficulties in handling real-time wireless dynamics and non-stationary e… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: 7 pages, 4 figures

  9. arXiv:2506.23700  [pdf, ps, other

    eess.IV cs.CV

    MedSAM-CA: A CNN-Augmented ViT with Attention-Enhanced Multi-Scale Fusion for Medical Image Segmentation

    Authors: Peiting Tian, Xi Chen, Haixia Bi, Fan Li

    Abstract: Medical image segmentation plays a crucial role in clinical diagnosis and treatment planning, where accurate boundary delineation is essential for precise lesion localization, organ identification, and quantitative assessment. In recent years, deep learning-based methods have significantly advanced segmentation accuracy. However, two major challenges remain. First, the performance of these methods… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  10. arXiv:2506.23534  [pdf, ps, other

    cs.SE

    Improving vulnerability type prediction and line-level detection via adversarial training-based data augmentation and multi-task learning

    Authors: Siyu Chen, Jiongyi Yang, Xiang Chen, Menglin Zheng, Minnan Wei, Xiaolin Ju

    Abstract: Context: Software vulnerabilities pose a significant threat to modern software systems, as evidenced by the growing number of reported vulnerabilities and cyberattacks. These escalating trends underscore the urgent need for effective approaches that can automatically detect and understand software vulnerabilities. Objective: However, the scarcity of labeled samples and the class imbalance issue in… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  11. arXiv:2506.23395  [pdf, ps, other

    cs.DC

    FastSet: Parallel Claim Settlement

    Authors: Xiaohong Chen, Grigore Rosu

    Abstract: FastSet is an actor-based distributed protocol for decentralized finance and settlement, which is inspired from blockchains. Account holders cooperate by making claims, which can include payments, holding and transferring assets, accessing and updating shared data, medical records, digital identity, and mathematical theorems, among many others. The claims are signed by their owners and are broadca… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

  12. arXiv:2506.23361  [pdf, ps, other

    cs.CV

    OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

    Authors: Yuanhao Cai, He Zhang, Xi Chen, Jinbo Xing, Yiwei Hu, Yuqian Zhou, Kai Zhang, Zhifei Zhang, Soo Ye Kim, Tianyu Wang, Yulun Zhang, Xiaokang Yang, Zhe Lin, Alan Yuille

    Abstract: Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data cons… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: A data construction pipeline and a diffusion Transformer framework for controllable subject-driven video customization

  13. arXiv:2506.23285  [pdf, ps, other

    cs.CV

    Competitive Distillation: A Simple Learning Strategy for Improving Visual Classification

    Authors: Daqian Shi, Xiaolei Diao, Xu Chen, Cédric M. John

    Abstract: Deep Neural Networks (DNNs) have significantly advanced the field of computer vision. To improve DNN training process, knowledge distillation methods demonstrate their effectiveness in accelerating network training by introducing a fixed learning direction from the teacher network to student networks. In this context, several distillation-based optimization strategies are proposed, e.g., deep mutu… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: Accepted by ICCV 2025

  14. arXiv:2506.23235  [pdf, ps, other

    cs.CL

    Generalist Reward Models: Found Inside Large Language Models

    Authors: Yi-Chen Li, Tian Xu, Yang Yu, Xuqin Zhang, Xiong-Hui Chen, Zhongxiang Ling, Ningjing Chao, Lei Yuan, Zhi-Hua Zhou

    Abstract: The alignment of Large Language Models (LLMs) is critically dependent on reward models trained on costly human preference data. While recent work explores bypassing this cost with AI feedback, these methods often lack a rigorous theoretical foundation. In this paper, we discover that a powerful generalist reward model is already latently present within any LLM trained via standard next-token predi… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

  15. arXiv:2506.23207  [pdf, ps, other

    cs.CV

    TVG-SLAM: Robust Gaussian Splatting SLAM with Tri-view Geometric Constraints

    Authors: Zhen Tan, Xieyuanli Chen, Lei Feng, Yangbing Ge, Shuaifeng Zhi, Jiaxiong Liu, Dewen Hu

    Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled RGB-only SLAM systems to achieve high-fidelity scene representation. However, the heavy reliance of existing systems on photometric rendering loss for camera tracking undermines their robustness, especially in unbounded outdoor environments with severe viewpoint and illumination changes. To address these challenges, we propose TVG-SLAM,… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

  16. arXiv:2506.23044  [pdf, ps, other

    cs.CV cs.AI

    Ovis-U1 Technical Report

    Authors: Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiaohao Chen, Jianshan Zhao, Yang Li, Qing-Guo Chen

    Abstract: In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike s… ▽ More

    Submitted 1 July, 2025; v1 submitted 28 June, 2025; originally announced June 2025.

    Comments: An unified model for multimodal understanding, text-to-image generation, and image editing. GitHub: https://github.com/AIDC-AI/Ovis-U1

  17. arXiv:2506.22954  [pdf, ps, other

    cs.SI cs.SE

    Evaluating and Improving Large Language Models for Competitive Program Generation

    Authors: Minnan Wei, Ziming Li, Xiang Chen, Menglin Zheng, Ziyan Qu, Cheng Yu, Siyu Chen, Xiaolin Ju

    Abstract: Context: Due to the demand for strong algorithmic reasoning, complex logic implementation, and strict adherence to input/output formats and resource constraints, competitive programming generation by large language models (LLMs) is considered the most challenging problem in current LLM-based code generation. However, previous studies often evaluate LLMs using simple prompts and benchmark datasets… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

  18. arXiv:2506.22895  [pdf, ps, other

    cs.LG cs.AI

    Interpretable Time Series Autoregression for Periodicity Quantification

    Authors: Xinyu Chen, Vassilis Digalakis Jr, Lijun Ding, Dingyi Zhuang, Jinhua Zhao

    Abstract: Time series autoregression is a classical statistical model for capturing auto-correlations and identifying temporal patterns such as periodicity and seasonality. In this work, we propose a novel sparse autoregression framework from an interpretable machine learning perspective and the model interpretability for periodicity quantification is reinforced by $\ell_0$-norm induced sparsity constraints… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

  19. arXiv:2506.22890  [pdf, ps, other

    cs.CV cs.CR

    CP-Guard: A Unified, Probability-Agnostic, and Adaptive Framework for Malicious Agent Detection and Defense in Multi-Agent Embodied Perception Systems

    Authors: Senkang Hu, Yihang Tao, Guowen Xu, Xinyuan Qian, Yiqin Deng, Xianhao Chen, Sam Tak Wu Kwong, Yuguang Fang

    Abstract: Collaborative Perception (CP) has been shown to be a promising technique for multi-agent autonomous driving and multi-agent robotic systems, where multiple agents share their perception information to enhance the overall perception performance and expand the perception range. However, in CP, an ego agent needs to receive messages from its collaborators, which makes it vulnerable to attacks from ma… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

  20. arXiv:2506.22726  [pdf, ps, other

    cs.CV cs.LG

    XTransfer: Cross-Modality Model Transfer for Human Sensing with Few Data at the Edge

    Authors: Yu Zhang, Xi Zhang, Hualin zhou, Xinyuan Chen, Shang Gao, Hong Jia, Jianfei Yang, Yuankai Qi, Tao Gu

    Abstract: Deep learning for human sensing on edge systems offers significant opportunities for smart applications. However, its training and development are hindered by the limited availability of sensor data and resource constraints of edge systems. Current methods that rely on transferring pre-trained models often encounter issues such as modality shift and high resource demands, resulting in substantial… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  21. arXiv:2506.22434  [pdf, ps, other

    cs.CV

    MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

    Authors: Xi Chen, Mingkang Zhu, Shaoteng Liu, Xiaoyang Wu, Xiaogang Xu, Yu Liu, Xiang Bai, Hengshuang Zhao

    Abstract: This work explores enabling Chain-of-Thought (CoT) reasoning to link visual cues across multiple images. A straightforward solution is to adapt rule-based reinforcement learning for Vision-Language Models (VLMs). However, such methods typically rely on manually curated question-answer pairs, which can be particularly challenging when dealing with fine grained visual details and complex logic acros… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  22. arXiv:2506.22058  [pdf, ps, other

    cs.CL

    Lost at the Beginning of Reasoning

    Authors: Baohao Liao, Xinyi Chen, Sara Rajaee, Yuhui Xu, Christian Herold, Anders Søgaard, Maarten de Rijke, Christof Monz

    Abstract: Recent advancements in large language models (LLMs) have significantly advanced complex reasoning capabilities, particularly through extended chain-of-thought (CoT) reasoning that incorporates mechanisms such as backtracking, self-reflection and self-correction. Despite these developments, the self-correction abilities of LLMs during long CoT reasoning remain underexplored. And recent findings on… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

    Comments: 9 pages, 5 figures, 2 tables

  23. arXiv:2506.22023  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy

    Authors: Bohan Li, Zhihan Li, Haoran Wang, Hanglei Zhang, Yiwei Guo, Hankun Wang, Xie Chen, Kai Yu

    Abstract: Recently, autoregressive (AR) language models have emerged as a dominant approach in speech synthesis, offering expressive generation and scalable training. However, conventional AR speech synthesis models relying on the next-token prediction paradigm often encounter significant challenges when handling long speech sequences. These models often struggle to construct stable frame-to-frame attention… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

    Comments: 17 pages, 8 figures, 5 tables

  24. arXiv:2506.21734  [pdf, ps, other

    cs.AI cs.LG

    Hierarchical Reasoning Model

    Authors: Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, Yasin Abbasi Yadkori

    Abstract: Reasoning, the process of devising and executing complex goal-oriented action sequences, remains a critical challenge in AI. Current large language models (LLMs) primarily employ Chain-of-Thought (CoT) techniques, which suffer from brittle task decomposition, extensive data requirements, and high latency. Inspired by the hierarchical and multi-timescale processing in the human brain, we propose th… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  25. arXiv:2506.21605  [pdf, ps, other

    cs.CL cs.AI

    MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents

    Authors: Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, Zhenhua Dong

    Abstract: Recent works have highlighted the significance of memory mechanisms in LLM-based agents, which enable them to store observed information and adapt to dynamic environments. However, evaluating their memory capabilities still remains challenges. Previous evaluations are commonly limited by the diversity of memory levels and interactive scenarios. They also lack comprehensive metrics to reflect the m… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: 17 pages, 5 figures. Accepted by ACL 2025 findings

  26. arXiv:2506.21604  [pdf, ps, other

    cs.IR cs.AI cs.CV cs.HC cs.LG

    Evaluating VisualRAG: Quantifying Cross-Modal Performance in Enterprise Document Understanding

    Authors: Varun Mannam, Fang Wang, Xin Chen

    Abstract: Current evaluation frameworks for multimodal generative AI struggle to establish trustworthiness, hindering enterprise adoption where reliability is paramount. We introduce a systematic, quantitative benchmarking framework to measure the trustworthiness of progressively integrating cross-modal inputs such as text, images, captions, and OCR within VisualRAG systems for enterprise document intellige… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

    Comments: Conference: KDD conference workshop: https://kdd-eval-workshop.github.io/genai-evaluation-kdd2025/

  27. arXiv:2506.21119  [pdf, ps, other

    cs.CL cs.AI

    Progtuning: Progressive Fine-tuning Framework for Transformer-based Language Models

    Authors: Xiaoshuang Ji, Zhendong Zhao, Xiaojun Chen, Xin Zhao, Zeyao Liu

    Abstract: Fine-tuning is a promising technique for leveraging Transformer-based language models in downstream tasks. As model sizes continue to grow, updating all model parameters becomes increasingly costly. Parameter-efficient fine-tuning methods effectively address this issue by selectively updating a small subset of parameters. However, fine-tuning and most existing parameter-efficient fine-tuning metho… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: Accepted by ICONIP 2024

  28. arXiv:2506.21074  [pdf, ps, other

    eess.AS cs.SD

    CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

    Authors: Hankun Wang, Yiwei Guo, Chongtian Shao, Bohan Li, Xie Chen, Kai Yu

    Abstract: Neural speech codecs have been widely used in audio compression and various downstream tasks. Current mainstream codecs are fixed-frame-rate (FFR), which allocate the same number of tokens to every equal-duration slice. However, speech is inherently non-uniform in temporal information density. As a result, many tokens are wasted on steady-state segments like long vowels and silences. To address th… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: 16 pages, 5 figures, 9 tables

  29. arXiv:2506.20960  [pdf, ps, other

    cs.CV cs.AI

    OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs

    Authors: Yiman Zhang, Ziheng Luo, Qiangyu Yan, Wei He, Borui Jiang, Xinghao Chen, Kai Han

    Abstract: In this paper, we introduce OmniEval, a benchmark for evaluating omni-modality models like MiniCPM-O 2.6, which encompasses visual, auditory, and textual inputs. Compared with existing benchmarks, our OmniEval has several distinctive features: (i) Full-modal collaboration: We design evaluation tasks that highlight the strong coupling between audio and video, requiring models to effectively leverag… ▽ More

    Submitted 29 June, 2025; v1 submitted 25 June, 2025; originally announced June 2025.

  30. arXiv:2506.20381  [pdf, ps, other

    cs.CV cs.LG

    Exploiting Lightweight Hierarchical ViT and Dynamic Framework for Efficient Visual Tracking

    Authors: Ben Kang, Xin Chen, Jie Zhao, Chunjuan Bo, Dong Wang, Huchuan Lu

    Abstract: Transformer-based visual trackers have demonstrated significant advancements due to their powerful modeling capabilities. However, their practicality is limited on resource-constrained devices because of their slow processing speeds. To address this challenge, we present HiT, a novel family of efficient tracking models that achieve high performance while maintaining fast operation across various d… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Comments: This paper was accepted by International Journal of Computer Vision(IJCV)

  31. arXiv:2506.19975  [pdf, ps, other

    eess.IV cs.AI cs.CV eess.SP

    VoxelOpt: Voxel-Adaptive Message Passing for Discrete Optimization in Deformable Abdominal CT Registration

    Authors: Hang Zhang, Yuxi Zhang, Jiazheng Wang, Xiang Chen, Renjiu Hu, Xin Tian, Gaolei Li, Min Liu

    Abstract: Recent developments in neural networks have improved deformable image registration (DIR) by amortizing iterative optimization, enabling fast and accurate DIR results. However, learning-based methods often face challenges with limited training data, large deformations, and tend to underperform compared to iterative approaches when label supervision is unavailable. While iterative methods can achiev… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: Accepted for publication at MICCAI 2025

  32. arXiv:2506.19889  [pdf, ps, other

    cs.CR cs.AI

    Retrieval-Confused Generation is a Good Defender for Privacy Violation Attack of Large Language Models

    Authors: Wanli Peng, Xin Chen, Hang Fu, XinYu He, Xue Yiming, Juan Wen

    Abstract: Recent advances in large language models (LLMs) have made a profound impact on our society and also raised new security concerns. Particularly, due to the remarkable inference ability of LLMs, the privacy violation attack (PVA), revealed by Staab et al., introduces serious personal privacy issues. Existing defense methods mainly leverage LLMs to anonymize the input query, which requires costly inf… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  33. arXiv:2506.19816  [pdf, ps, other

    cs.RO cs.CV

    CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation

    Authors: Hao Li, Shuai Yang, Yilun Chen, Yang Tian, Xiaoda Yang, Xinyi Chen, Hanqing Wang, Tai Wang, Feng Zhao, Dahua Lin, Jiangmiao Pang

    Abstract: Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong generalization across manipulation tasks. However, they remain constrained by a single-frame observation paradigm and cannot fully benefit from the motion information offered by aggregated multi-frame historical observations, as the large vision-language backbone introduces substan… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: 36 pages, 21 figures

  34. arXiv:2506.19505  [pdf, ps, other

    cs.CL

    AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models

    Authors: Zeyu Li, Chuanfu Xiao, Yang Wang, Xiang Liu, Zhenheng Tang, Baotong Lu, Mao Yang, Xinyu Chen, Xiaowen Chu

    Abstract: Quantization has emerged as an effective and lightweight solution to reduce the memory footprint of the KV cache in Large Language Models (LLMs). Nevertheless, minimizing the performance degradation caused by ultra-low-bit KV cache quantization remains a significant challenge. We observe that quantizing the KV cache of different tokens has varying impacts on the quality of attention outputs. To sy… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  35. arXiv:2506.19385  [pdf, ps, other

    cs.AI

    Conversational Intent-Driven GraphRAG: Enhancing Multi-Turn Dialogue Systems through Adaptive Dual-Retrieval of Flow Patterns and Context Semantics

    Authors: Ziqi Zhu, Tao Hu, Honglong Zhang, Dan Yang, HanGeng Chen, Mengran Zhang, Xilun Chen

    Abstract: We present CID-GraphRAG (Conversational Intent-Driven Graph Retrieval Augmented Generation), a novel framework that addresses the limitations of existing dialogue systems in maintaining both contextual coherence and goal-oriented progression in multi-turn customer service conversations. Unlike traditional RAG systems that rely solely on semantic similarity (Conversation RAG) or standard knowledge… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  36. arXiv:2506.19283  [pdf, ps, other

    cs.CV cs.AI cs.RO

    AirV2X: Unified Air-Ground Vehicle-to-Everything Collaboration

    Authors: Xiangbo Gao, Yuheng Wu, Xuewen Luo, Keshu Wu, Xinghao Chen, Yuping Wang, Chenxi Liu, Yang Zhou, Zhengzhong Tu

    Abstract: While multi-vehicular collaborative driving demonstrates clear advantages over single-vehicle autonomy, traditional infrastructure-based V2X systems remain constrained by substantial deployment costs and the creation of "uncovered danger zones" in rural and suburban areas. We present AirV2X-Perception, a large-scale dataset that leverages Unmanned Aerial Vehicles (UAVs) as a flexible alternative o… ▽ More

    Submitted 30 June, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

  37. arXiv:2506.18901  [pdf, ps, other

    cs.CV

    From Virtual Games to Real-World Play

    Authors: Wenqiang Sun, Fangyun Wei, Jinjing Zhao, Xi Chen, Zilong Chen, Hongyang Zhang, Jun Zhang, Yan Lu

    Abstract: We introduce RealPlay, a neural network-based real-world game engine that enables interactive video generation from user control signals. Unlike prior works focused on game-style visuals, RealPlay aims to produce photorealistic, temporally consistent video sequences that resemble real-world footage. It operates in an interactive loop: users observe a generated scene, issue a control command, and r… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: Project page: https://wenqsun.github.io/RealPlay/

  38. arXiv:2506.18890  [pdf, ps, other

    cs.CV

    4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time

    Authors: Ziqiao Ma, Xuweiyi Chen, Shoubin Yu, Sai Bi, Kai Zhang, Chen Ziwen, Sihan Xu, Jianing Yang, Zexiang Xu, Kalyan Sunkavalli, Mohit Bansal, Joyce Chai, Hao Tan

    Abstract: Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at some times to any view at any time? We provide an affirmative answer with 4D-LRM, the first large-scale 4D reconstruction model that takes input from unconstrained views and timestamps and renders arbitrary novel view-time combinations. Unlike prior 4D approaches, e.g., optimizati… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: Project page: https://4dlrm.github.io/

  39. arXiv:2506.18862  [pdf, ps, other

    cs.CV cs.AI

    TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting

    Authors: Zhongbin Guo, Yuhao Wang, Ping Jian, Xinyue Chen, Wei Peng, Ertai E

    Abstract: Satellite image time-series analysis demands fine-grained spatial-temporal reasoning, which remains a challenge for existing multimodal large language models (MLLMs). In this work, we study the capabilities of MLLMs on a novel task that jointly targets temporal change understanding and future scene generation, aiming to assess their potential for modeling complex multimodal dynamics over time. We… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: Submitted to the 33rd ACM International Conference on Multimedia. Our dataset can be found at https://huggingface.co/datasets/IceInPot/TAMMs

  40. arXiv:2506.18678  [pdf, ps, other

    cs.CV cs.RO

    MCN-SLAM: Multi-Agent Collaborative Neural SLAM with Hybrid Implicit Neural Scene Representation

    Authors: Tianchen Deng, Guole Shen, Xun Chen, Shenghai Yuan, Hongming Shen, Guohao Peng, Zhenyu Wu, Jingchuan Wang, Lihua Xie, Danwei Wang, Hesheng Wang, Weidong Chen

    Abstract: Neural implicit scene representations have recently shown promising results in dense visual SLAM. However, existing implicit SLAM algorithms are constrained to single-agent scenarios, and fall difficulties in large-scale scenes and long sequences. Existing NeRF-based multi-agent SLAM frameworks cannot meet the constraints of communication bandwidth. To this end, we propose the first distributed mu… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  41. arXiv:2506.18565  [pdf, ps, other

    cs.CE

    A Physics-Informed Neural Network Framework for Simulating Creep Buckling in Growing Viscoelastic Biological Tissues

    Authors: Zhongya Lin, Jinshuai Bai, Shuang Li, Xindong Chen, Bo Li, Xi-Qiao Feng

    Abstract: Modeling viscoelastic behavior is crucial in engineering and biomechanics, where materials undergo time-dependent deformations, including stress relaxation, creep buckling and biological tissue development. Traditional numerical methods, like the finite element method, often require explicit meshing, artificial perturbations or embedding customised programs to capture these phenomena, adding compu… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  42. arXiv:2506.18348  [pdf, ps, other

    cs.AI

    Dynamic Knowledge Exchange and Dual-diversity Review: Concisely Unleashing the Potential of a Multi-Agent Research Team

    Authors: Weilun Yu, Shixiang Tang, Yonggui Huang, Nanqing Dong, Li Fan, Honggang Qi, Wei Liu, Xiaoli Diao, Xi Chen, Wanli Ouyang

    Abstract: Scientific progress increasingly relies on effective collaboration among researchers, a dynamic that large language models (LLMs) have only begun to emulate. While recent LLM-based scientist agents show promise in autonomous scientific discovery, they often lack the interactive reasoning and evaluation mechanisms essential to real-world research. We propose IDVSCI (Internal Discussion and Vote SCI… ▽ More

    Submitted 27 June, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

  43. arXiv:2506.18023  [pdf, ps, other

    cs.CV cs.AI cs.CL

    PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding

    Authors: Kui Huang, Xinrong Chen, Wenyu Lv, Jincheng Liao, Guanzhong Wang, Yi Liu

    Abstract: This report introduces PP-DocBee2, an advanced version of the PP-DocBee, designed to enhance multimodal document understanding. Built on a large multimodal model architecture, PP-DocBee2 addresses the limitations of its predecessor through key technological improvements, including enhanced synthetic data quality, improved visual feature fusion strategy, and optimized inference methodologies. These… ▽ More

    Submitted 24 June, 2025; v1 submitted 22 June, 2025; originally announced June 2025.

  44. arXiv:2506.17667  [pdf, ps, other

    cs.AI

    PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models

    Authors: Lintao Wang, Encheng Su, Jiaqi Liu, Pengze Li, Peng Xia, Jiabei Xiao, Wenlong Zhang, Xinnan Dai, Xi Chen, Yuan Meng, Mingyu Ding, Lei Bai, Wanli Ouyang, Shixiang Tang, Aoran Wang, Xinzhu Ma

    Abstract: Physics problem-solving is a challenging domain for large AI models, requiring integration of conceptual understanding, mathematical reasoning, and interpretation of physical diagrams. Current evaluation methodologies show notable limitations in capturing the breadth and complexity of undergraduate-level physics, underscoring the need for more rigorous assessments. To this end, we present PhysUniB… ▽ More

    Submitted 27 June, 2025; v1 submitted 21 June, 2025; originally announced June 2025.

  45. arXiv:2506.17642  [pdf, ps, other

    cs.SE

    May the Feedback Be with You! Unlocking the Power of Feedback-Driven Deep Learning Framework Fuzzing via LLMs

    Authors: Shaoyu Yang, Chunrong Fang, Haifeng Lin, Xiang Chen, Zhenyu Chen

    Abstract: Artificial Intelligence (AI) Infrastructures, represented by Deep Learning (DL) frameworks, have served as fundamental DL systems over the last decade. However, the bugs in DL frameworks could lead to catastrophic consequences in some critical scenarios (e.g., healthcare and autonomous driving). A simple yet effective way to find bugs in DL frameworks is fuzz testing (Fuzzing). Unfortunately, exis… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

  46. arXiv:2506.17638  [pdf, ps, other

    cs.SE

    Deep Learning Framework Testing via Model Mutation: How Far Are We?

    Authors: Yanzhou Mu, Rong Wang, Juan Zhai, Chunrong Fang, Xiang Chen, Zhiyuan Peng, Peiran Yang, Ruixiang Qian, Shaoyu Yang, Zhenyu Chen

    Abstract: Deep Learning (DL) frameworks are a fundamental component of DL development. Therefore, the detection of DL framework defects is important and challenging. As one of the most widely adopted DL testing techniques, model mutation has recently gained significant attention. In this study, we revisit the defect detection ability of existing mutation-based testing methods and investigate the factors tha… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

    Comments: 27 pages, 9 figures

  47. arXiv:2506.17188  [pdf, ps, other

    cs.CL cs.AI cs.IR

    Towards AI Search Paradigm

    Authors: Yuchen Li, Hengyi Cai, Rui Kong, Xinran Chen, Jiamin Chen, Jun Yang, Haojie Zhang, Jiayi Li, Jiayi Wu, Yiqun Chen, Changle Qu, Keyi Kong, Wenwen Ye, Lixin Su, Xinyu Ma, Long Xia, Daiting Shi, Jiashu Zhao, Haoyi Xiong, Shuaiqiang Wang, Dawei Yin

    Abstract: In this paper, we introduce the AI Search Paradigm, a comprehensive blueprint for next-generation search systems capable of emulating human information processing and decision-making. The paradigm employs a modular architecture of four LLM-powered agents (Master, Planner, Executor and Writer) that dynamically adapt to the full spectrum of information needs, from simple factual queries to complex m… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

  48. arXiv:2506.16796  [pdf, ps, other

    cs.CV

    RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought

    Authors: Junbo Qiao, Miaomiao Cai, Wei Li, Yutong Liu, Xudong Huang, Gaoqi He, Jiao Xie, Jie Hu, Xinghao Chen, Shaohui Lin

    Abstract: Real-World Image Super-Resolution is one of the most challenging task in image restoration. However, existing methods struggle with an accurate understanding of degraded image content, leading to reconstructed results that are both low-fidelity and unnatural. We present RealSR-R1 in this work, which empowers the RealSR models with understanding and reasoning capabilities. Inspired by the success o… ▽ More

    Submitted 23 June, 2025; v1 submitted 20 June, 2025; originally announced June 2025.

  49. arXiv:2506.16730  [pdf, ps, other

    cs.CV

    TeSG: Textual Semantic Guidance for Infrared and Visible Image Fusion

    Authors: Mingrui Zhu, Xiru Chen, Xin Wei, Nannan Wang, Xinbo Gao

    Abstract: Infrared and visible image fusion (IVF) aims to combine complementary information from both image modalities, producing more informative and comprehensive outputs. Recently, text-guided IVF has shown great potential due to its flexibility and versatility. However, the effective integration and utilization of textual semantic information remains insufficiently studied. To tackle these challenges, w… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

    Comments: 11 pages, 6 figures

  50. arXiv:2506.16394  [pdf, ps, other

    stat.ML cs.LG

    Identifying Heterogeneity in Distributed Learning

    Authors: Zelin Xiao, Jia Gu, Song Xi Chen

    Abstract: We study methods for identifying heterogeneous parameter components in distributed M-estimation with minimal data transmission. One is based on a re-normalized Wald test, which is shown to be consistent as long as the number of distributed data blocks $K$ is of a smaller order of the minimum block sample size and the level of heterogeneity is dense. The second one is an extreme contrast test (ECT)… ▽ More

    Submitted 24 June, 2025; v1 submitted 19 June, 2025; originally announced June 2025.