Skip to main content

Showing 1–50 of 2,196 results for author: Li, Q

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.00993  [pdf, ps, other

    eess.IV cs.CV

    Advancing Lung Disease Diagnosis in 3D CT Scans

    Authors: Qingqiu Li, Runtian Yuan, Junlin Hou, Jilan Xu, Yuejie Zhang, Rui Feng, Hao Chen

    Abstract: To enable more accurate diagnosis of lung disease in chest CT scans, we propose a straightforward yet effective model. Firstly, we analyze the characteristics of 3D CT scans and remove non-lung regions, which helps the model focus on lesion-related areas and reduces computational cost. We adopt ResNeSt50 as a strong feature extractor, and use a weighted cross-entropy loss to mitigate class imbalan… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  2. arXiv:2507.00980  [pdf, ps, other

    cs.CV

    RTMap: Real-Time Recursive Mapping with Change Detection and Localization

    Authors: Yuheng Du, Sheng Yang, Lingxuan Wang, Zhenghua Hou, Chengying Cai, Zhitao Tan, Mingxia Chen, Shi-Sheng Huang, Qiang Li

    Abstract: While recent online HD mapping methods relieve burdened offline pipelines and solve map freshness, they remain limited by perceptual inaccuracies, occlusion in dense traffic, and an inability to fuse multi-agent observations. We propose RTMap to enhance these single-traversal methods by persistently crowdsourcing a multi-traversal HD map as a self-evolutional memory. On onboard agents, RTMap simul… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  3. arXiv:2507.00330  [pdf, ps, other

    cs.CL cs.IR

    Modeling Data Diversity for Joint Instance and Verbalizer Selection in Cold-Start Scenarios

    Authors: Mohna Chakraborty, Adithya Kulkarni, Qi Li

    Abstract: Prompt-based methods leverage the knowledge of pre-trained language models (PLMs) trained with a masked language modeling (MLM) objective; however, these methods are sensitive to template, verbalizer, and few-shot instance selection, particularly in cold-start settings with no labeled data. Existing studies overlook the dependency between instances and verbalizers, where instance-label probabiliti… ▽ More

    Submitted 30 June, 2025; originally announced July 2025.

  4. arXiv:2506.23726  [pdf, ps, other

    cs.LG cs.AI

    System-Embedded Diffusion Bridge Models

    Authors: Bartlomiej Sobieski, Matthew Tivnan, Yuang Wang, Siyeop Yoon, Pengfei Jin, Dufan Wu, Quanzheng Li, Przemyslaw Biecek

    Abstract: Solving inverse problems -- recovering signals from incomplete or noisy measurements -- is fundamental in science and engineering. Score-based generative models (SGMs) have recently emerged as a powerful framework for this task. Two main paradigms have formed: unsupervised approaches that adapt pretrained generative models to inverse problems, and supervised bridge methods that train stochastic pr… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Preprint

  5. arXiv:2506.23590  [pdf, ps, other

    cs.CV

    CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models

    Authors: Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Libo Qin, Ruihan Chen, Baohang Li, Kui Jiang, Yaowei Wang, Ting Liu, Bing Qin

    Abstract: Although Large Vision-Language Models (LVLMs) have demonstrated powerful capabilities in interpreting visual information, they frequently produce content that deviates from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or significantly increase inference time. In this work, we observe that LVLMs' a… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  6. arXiv:2506.23551  [pdf, ps, other

    cs.LG math.OC

    A unified framework on the universal approximation of transformer-type architectures

    Authors: Jingpu Cheng, Qianxiao Li, Ting Lin, Zuowei Shen

    Abstract: We investigate the universal approximation property (UAP) of transformer-type architectures, providing a unified theoretical framework that extends prior results on residual networks to models incorporating attention mechanisms. Our work identifies token distinguishability as a fundamental requirement for UAP and introduces a general sufficient condition that applies to a broad class of architectu… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  7. arXiv:2506.23208  [pdf, ps, other

    eess.IV cs.CV

    Multi-Source COVID-19 Detection via Variance Risk Extrapolation

    Authors: Runtian Yuan, Qingqiu Li, Junlin Hou, Jilan Xu, Yuejie Zhang, Rui Feng, Hao Chen

    Abstract: We present our solution for the Multi-Source COVID-19 Detection Challenge, which aims to classify chest CT scans into COVID and Non-COVID categories across data collected from four distinct hospitals and medical centers. A major challenge in this task lies in the domain shift caused by variations in imaging protocols, scanners, and patient populations across institutions. To enhance the cross-doma… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

  8. arXiv:2506.22969  [pdf, ps, other

    cs.CE

    SparStencil: Retargeting Sparse Tensor Cores to Scientific Stencil Computations via Structured Sparsity Transformation

    Authors: Qi Li, Kun Li, Haozhi Han, Liang Yuan, Junshi Chen, Yunquan Zhang, Yifeng Chen, Hong An, Ting Cao, Mao Yang

    Abstract: Sparse Tensor Cores offer exceptional performance gains for AI workloads by exploiting structured 2:4 sparsity. However, their potential remains untapped for core scientific workloads such as stencil computations, which exhibit irregular sparsity patterns.This paper presents SparStencil, the first system to retarget sparse TCUs for scientific stencil computations through structured sparsity transf… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

    Comments: Accepted to SC'25 (June 3, 2025). This work was previously submitted to ISCA'25 (Nov 22, 2024) and substantially revised based on feedback

  9. arXiv:2506.22376  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Probabilistic Optimality for Inference-time Scaling

    Authors: Youkang Wang, Jian Wang, Rubing Chen, Xiao-Yong Wei, Qing Li

    Abstract: Inference-time scaling has emerged as a powerful technique for enhancing the reasoning performance of Large Language Models (LLMs). However, existing approaches often rely on heuristic strategies for parallel sampling, lacking a principled foundation. To address this gap, we propose a probabilistic framework that formalizes the optimality of inference-time scaling under the assumption that paralle… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  10. arXiv:2506.22316  [pdf, ps, other

    cs.CL

    Evaluating Scoring Bias in LLM-as-a-Judge

    Authors: Qingquan Li, Shaoyu Dou, Kailai Shao, Chao Chen, Haixiang Hu

    Abstract: The remarkable performance of Large Language Models (LLMs) gives rise to``LLM-as-a-Judge'', where LLMs are employed as evaluators for complex tasks. Moreover, it has been widely adopted across fields such as Natural Language Processing (NLP), preference learning, and various specific domains. However, there are various biases within LLM-as-a-Judge, which adversely affect the fairness and reliabili… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  11. arXiv:2506.21895  [pdf, ps, other

    cs.CV

    Exploring Task-Solving Paradigm for Generalized Cross-Domain Face Anti-Spoofing via Reinforcement Fine-Tuning

    Authors: Fangling Jiang, Qi Li, Weining Wang, Gang Wang, Bing Liu, Zhenan Sun

    Abstract: Recently the emergence of novel presentation attacks has drawn increasing attention to face anti-spoofing. However, existing methods tend to memorize data patterns from the training set, resulting in poor generalization to unknown attack types across different scenarios and limited interpretability. To address these challenges, this paper presents a reinforcement fine-tuning-based face anti-spoofi… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  12. arXiv:2506.21765  [pdf, ps, other

    eess.IV cs.CV

    TUS-REC2024: A Challenge to Reconstruct 3D Freehand Ultrasound Without External Tracker

    Authors: Qi Li, Shaheer U. Saeed, Yuliang Huang, Mingyuan Luo, Zhongnuo Yan, Jiongquan Chen, Xin Yang, Dong Ni, Nektarios Winter, Phuc Nguyen, Lucas Steinberger, Caelan Haney, Yuan Zhao, Mingjie Jiang, Bowen Ren, SiYeoul Lee, Seonho Kim, MinKyung Seo, MinWoo Kim, Yimeng Dou, Zhiwei Zhang, Yin Li, Tomy Varghese, Dean C. Barratt, Matthew J. Clarkson , et al. (2 additional authors not shown)

    Abstract: Trackerless freehand ultrasound reconstruction aims to reconstruct 3D volumes from sequences of 2D ultrasound images without relying on external tracking systems, offering a low-cost, portable, and widely deployable alternative for volumetric imaging. However, it presents significant challenges, including accurate inter-frame motion estimation, minimisation of drift accumulation over long sequence… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  13. arXiv:2506.21618  [pdf, ps, other

    cs.CL cs.AI

    TrajTok: Technical Report for 2025 Waymo Open Sim Agents Challenge

    Authors: Zhiyuan Zhang, Xiaosong Jia, Guanyu Chen, Qifeng Li, Junchi Yan

    Abstract: In this technical report, we introduce TrajTok, a trajectory tokenizer for discrete next-token-prediction based behavior generation models, which combines data-driven and rule-based methods with better coverage, symmetry and robustness, along with a spatial-aware label smoothing method for cross-entropy loss. We adopt the tokenizer and loss for the SMART model and reach a superior performance with… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  14. arXiv:2506.21547  [pdf, ps, other

    cs.CV cs.RO

    SAM4D: Segment Anything in Camera and LiDAR Streams

    Authors: Jianyun Xu, Song Wang, Ziqian Ni, Chunyong Hu, Sheng Yang, Jianke Zhu, Qiang Li

    Abstract: We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages eg… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: Accepted by ICCV2025, Project Page: https://SAM4D-Project.github.io

  15. arXiv:2506.19303  [pdf, ps, other

    cs.RO

    Robotic Perception with a Large Tactile-Vision-Language Model for Physical Property Inference

    Authors: Zexiang Guo, Hengxiang Chen, Xinheng Mai, Qiusang Qiu, Gan Ma, Zhanat Kappassov, Qiang Li, Nutan Chen

    Abstract: Inferring physical properties can significantly enhance robotic manipulation by enabling robots to handle objects safely and efficiently through adaptive grasping strategies. Previous approaches have typically relied on either tactile or visual data, limiting their ability to fully capture properties. We introduce a novel cross-modal perception framework that integrates visual observations with ta… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: This paper has been accepted by the 2025 International Conference on Climbing and Walking Robots (CLAWAR). These authors contributed equally to this work: Zexiang Guo, Hengxiang Chen, Xinheng Mai

  16. arXiv:2506.19296  [pdf, ps, other

    cs.LG

    The Effect of Depth on the Expressivity of Deep Linear State-Space Models

    Authors: Zeyu Bao, Penghao Yu, Haotian Jiang, Qianxiao Li

    Abstract: Deep state-space models (SSMs) have gained increasing popularity in sequence modelling. While there are numerous theoretical investigations of shallow SSMs, how the depth of the SSM affects its expressiveness remains a crucial problem. In this paper, we systematically investigate the role of depth and width in deep linear SSMs, aiming to characterize how they influence the expressive capacity of t… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  17. arXiv:2506.18939  [pdf, ps, other

    cs.CV cs.AI

    Damba-ST: Domain-Adaptive Mamba for Efficient Urban Spatio-Temporal Prediction

    Authors: Rui An, Yifeng Zhang, Ziran Liang, Wenqi Fan, Yuxuan Liang, Xuequn Shang, Qing Li

    Abstract: Training urban spatio-temporal foundation models that generalize well across diverse regions and cities is critical for deploying urban services in unseen or data-scarce regions. Recent studies have typically focused on fusing cross-domain spatio-temporal data to train unified Transformer-based models. However, these models suffer from quadratic computational complexity and high memory overhead, l… ▽ More

    Submitted 22 June, 2025; originally announced June 2025.

  18. arXiv:2506.18658  [pdf, ps, other

    cs.CV cs.AI

    Historical Report Guided Bi-modal Concurrent Learning for Pathology Report Generation

    Authors: Ling Zhang, Boxiang Yun, Qingli Li, Yan Wang

    Abstract: Automated pathology report generation from Whole Slide Images (WSIs) faces two key challenges: (1) lack of semantic content in visual features and (2) inherent information redundancy in WSIs. To address these issues, we propose a novel Historical Report Guided \textbf{Bi}-modal Concurrent Learning Framework for Pathology Report \textbf{Gen}eration (BiGen) emulating pathologists' diagnostic reasoni… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  19. arXiv:2506.18645  [pdf, ps, other

    stat.ML cs.LG stat.ME

    Tight Generalization Error Bounds for Stochastic Gradient Descent in Non-convex Learning

    Authors: Wenjun Xiong, Juan Ding, Xinlei Zuo, Qizhai Li

    Abstract: Stochastic Gradient Descent (SGD) is fundamental for training deep neural networks, especially in non-convex settings. Understanding SGD's generalization properties is crucial for ensuring robust model performance on unseen data. In this paper, we analyze the generalization error bounds of SGD for non-convex learning by introducing the Type II perturbed SGD (T2pm-SGD), which accommodates both sub-… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  20. arXiv:2506.18134  [pdf, ps, other

    cs.CV

    Targeted False Positive Synthesis via Detector-guided Adversarial Diffusion Attacker for Robust Polyp Detection

    Authors: Quan Zhou, Gan Luo, Qiang Hu, Qingyong Zhang, Jinhua Zhang, Yinjiao Tian, Qiang Li, Zhiwei Wang

    Abstract: Polyp detection is crucial for colorectal cancer screening, yet existing models are limited by the scale and diversity of available data. While generative models show promise for data augmentation, current methods mainly focus on enhancing polyp diversity, often overlooking the critical issue of false positives. In this paper, we address this gap by proposing an adversarial diffusion framework to… ▽ More

    Submitted 22 June, 2025; originally announced June 2025.

    Comments: Early Accepted by MICCAI 2025

  21. arXiv:2506.16401  [pdf, ps, other

    cs.CY cs.CV

    TrajSceneLLM: A Multimodal Perspective on Semantic GPS Trajectory Analysis

    Authors: Chunhou Ji, Qiumeng Li

    Abstract: GPS trajectory data reveals valuable patterns of human mobility and urban dynamics, supporting a variety of spatial applications. However, traditional methods often struggle to extract deep semantic representations and incorporate contextual map information. We propose TrajSceneLLM, a multimodal perspective for enhancing semantic understanding of GPS trajectories. The framework integrates visualiz… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

    Comments: Under review for ACM SIGSPATIAL 2025

  22. arXiv:2506.15279  [pdf, ps, other

    cs.CV

    BCRNet: Enhancing Landmark Detection in Laparoscopic Liver Surgery via Bezier Curve Refinement

    Authors: Qian Li, Feng Liu, Shuojue Yang, Daiyun Shen, Yueming Jin

    Abstract: Laparoscopic liver surgery, while minimally invasive, poses significant challenges in accurately identifying critical anatomical structures. Augmented reality (AR) systems, integrating MRI/CT with laparoscopic images based on 2D-3D registration, offer a promising solution for enhancing surgical navigation. A vital aspect of the registration progress is the precise detection of curvilinear anatomic… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

    Comments: Accepted at MICCAI 2025, 11 pages, 2 figures

  23. arXiv:2506.14753  [pdf, ps, other

    cs.CV cs.LG

    Cost-Aware Routing for Efficient Text-To-Image Generation

    Authors: Qinchan Li, Kenneth Chen, Changyue Su, Wittawat Jitkrittum, Qi Sun, Patsorn Sangkloy

    Abstract: Diffusion models are well known for their ability to generate a high-fidelity image for an input prompt through an iterative denoising process. Unfortunately, the high fidelity also comes at a high computational cost due the inherently sequential generative process. In this work, we seek to optimally balance quality and computational cost, and propose a framework to allow the amount of computation… ▽ More

    Submitted 22 June, 2025; v1 submitted 17 June, 2025; originally announced June 2025.

  24. arXiv:2506.13759  [pdf, ps, other

    cs.LG cs.AI

    Discrete Diffusion in Large Language and Multimodal Models: A Survey

    Authors: Runpeng Yu, Qi Li, Xinchao Wang

    Abstract: In this work, we provide a systematic survey of Discrete Diffusion Language Models (dLLMs) and Discrete Diffusion Multimodal Language Models (dMLLMs). Unlike autoregressive (AR) models, dLLMs and dMLLMs adopt a multi-token, parallel decoding paradigm using full attention and a denoising-based generation strategy. This paradigm naturally enables parallel generation, fine-grained output controllabil… ▽ More

    Submitted 1 July, 2025; v1 submitted 16 June, 2025; originally announced June 2025.

  25. arXiv:2506.13585  [pdf, ps, other

    cs.CL cs.LG

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Authors: MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou , et al. (103 additional authors not shown)

    Abstract: We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: A technical report from MiniMax. The authors are listed in alphabetical order. We open-source our MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1

  26. arXiv:2506.13465  [pdf, ps, other

    cs.CV eess.IV

    SA-LUT: Spatial Adaptive 4D Look-Up Table for Photorealistic Style Transfer

    Authors: Zerui Gong, Zhonghua Wu, Qingyi Tao, Qinyue Li, Chen Change Loy

    Abstract: Photorealistic style transfer (PST) enables real-world color grading by adapting reference image colors while preserving content structure. Existing methods mainly follow either approaches: generation-based methods that prioritize stylistic fidelity at the cost of content integrity and efficiency, or global color transformation methods such as LUT, which preserve structure but lack local adaptabil… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  27. arXiv:2506.12696  [pdf, ps, other

    cs.LG

    TFKAN: Time-Frequency KAN for Long-Term Time Series Forecasting

    Authors: Xiaoyan Kui, Canwei Liu, Qinsong Li, Zhipeng Hu, Yangyang Shi, Weixin Si, Beiji Zou

    Abstract: Kolmogorov-Arnold Networks (KANs) are highly effective in long-term time series forecasting due to their ability to efficiently represent nonlinear relationships and exhibit local plasticity. However, prior research on KANs has predominantly focused on the time domain, neglecting the potential of the frequency domain. The frequency domain of time series data reveals recurring patterns and periodic… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

    Comments: 11 pages,5 figures

  28. arXiv:2506.12314  [pdf, ps, other

    cs.RO eess.SY

    Explosive Output to Enhance Jumping Ability: A Variable Reduction Ratio Design Paradigm for Humanoid Robots Knee Joint

    Authors: Xiaoshuai Ma, Haoxiang Qi, Qingqing Li, Haochen Xu, Xuechao Chen, Junyao Gao, Zhangguo Yu, Qiang Huang

    Abstract: Enhancing the explosive power output of the knee joints is critical for improving the agility and obstacle-crossing capabilities of humanoid robots. However, a mismatch between the knee-to-center-of-mass (CoM) transmission ratio and jumping demands, coupled with motor performance degradation at high speeds, restricts the duration of high-power output and limits jump performance. To address these p… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

  29. arXiv:2506.12186  [pdf, ps, other

    eess.IV cs.AI cs.CV cs.LG

    MRI-CORE: A Foundation Model for Magnetic Resonance Imaging

    Authors: Haoyu Dong, Yuwen Chen, Hanxue Gu, Nicholas Konz, Yaqian Chen, Qihang Li, Maciej A. Mazurowski

    Abstract: The widespread use of Magnetic Resonance Imaging (MRI) and the rise of deep learning have enabled the development of powerful predictive models for a wide range of diagnostic tasks in MRI, such as image classification or object segmentation. However, training models for specific new tasks often requires large amounts of labeled data, which is difficult to obtain due to high annotation costs and da… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: 19 pages, 5 figures

  30. arXiv:2506.12027  [pdf, ps, other

    cs.CC cs.LG

    Constant Bit-size Transformers Are Turing Complete

    Authors: Qian Li, Yuyi Wang

    Abstract: We prove that any Turing machine running on inputs of arbitrary length can be simulated by a constant bit-size transformer, as long as the context window is sufficiently long. This improves previous works, which require scaling up either the model's precision or the number of parameters on longer inputs. Furthermore, we prove that the complexity class SPACE$[s(n)]$ exactly characterizes the expres… ▽ More

    Submitted 21 May, 2025; originally announced June 2025.

    Comments: 12 pages

  31. arXiv:2506.11842  [pdf, ps, other

    cs.RO

    Your Ride, Your Rules: Psychology and Cognition Enabled Automated Driving Systems

    Authors: Zhipeng Bao, Qianwen Li

    Abstract: Despite rapid advances in autonomous driving technology, current autonomous vehicles (AVs) lack effective bidirectional human-machine communication, limiting their ability to personalize the riding experience and recover from uncertain or immobilized states. This limitation undermines occupant comfort and trust, potentially hindering the adoption of AV technologies. We propose PACE-ADS (Psychology… ▽ More

    Submitted 19 June, 2025; v1 submitted 13 June, 2025; originally announced June 2025.

    Comments: 10 figures,13 pages, two colummns

  32. arXiv:2506.11167  [pdf, ps, other

    cs.CV cs.LG

    Towards a general-purpose foundation model for fMRI analysis

    Authors: Cheng Wang, Yu Jiang, Zhihao Peng, Chenxin Li, Changbae Bang, Lin Zhao, Jinglei Lv, Jorge Sepulcre, Carl Yang, Lifang He, Tianming Liu, Daniel Barron, Quanzheng Li, Randy Hirschtick, Byung-Hoon Kim, Xiang Li, Yixuan Yuan

    Abstract: Functional Magnetic Resonance Imaging (fMRI) is essential for studying brain function and diagnosing neurological disorders, but current analysis methods face reproducibility and transferability issues due to complex pre-processing and task-specific models. We introduce NeuroSTORM (Neuroimaging Foundation Model with Spatial-Temporal Optimized Representation Modeling), a generalizable framework tha… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  33. arXiv:2506.11073  [pdf, ps, other

    cs.CL cs.AI cs.CV

    CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention

    Authors: Zekai Ye, Qiming Li, Xiaocheng Feng, Libo Qin, Yichong Huang, Baohang Li, Kui Jiang, Yang Xiang, Zhirui Zhang, Yunfei Lu, Duyu Tang, Dandan Tu, Bing Qin

    Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal abilities but remain prone to multilingual object hallucination, with a higher likelihood of generating responses inconsistent with the visual input when utilizing queries in non-English languages compared to English. Most existing approaches to address these rely on pretraining or fine-tuning, which are resource-intensiv… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: ACL2025 Main

  34. arXiv:2506.11066  [pdf, ps, other

    cs.SE cs.AI

    CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval

    Authors: Jiahui Geng, Fengyu Cai, Shaobo Cui, Qing Li, Liangwei Chen, Chenyang Lyu, Haonan Li, Derui Zhu, Walter Pretschner, Heinz Koeppl, Fakhri Karray

    Abstract: Code retrieval is essential in modern software development, as it boosts code reuse and accelerates debugging. However, current benchmarks primarily emphasize functional relevance while neglecting critical dimensions of software quality. Motivated by this gap, we introduce CoQuIR, the first large-scale, multilingual benchmark specifically designed to evaluate quality-aware code retrieval across fo… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

  35. arXiv:2506.10857  [pdf, ps, other

    cs.CV cs.AI cs.MM

    VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

    Authors: Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, Zhenxiang Li, Zhongying Tu, Conghui He, Yu Qiao, Yali Wang, Yi Wang, Limin Wang

    Abstract: We present VRBench, the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity. It comprises 1,010 long videos (with an average duration of 1.6 hours), along with 9,468 human-labeled multi-step question-answering pairs and 30,292 reasoning st… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: Technical Report

  36. arXiv:2506.10508  [pdf, other

    cs.CL cs.AI

    Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs

    Authors: Yilin Xiao, Chuang Zhou, Qinggang Zhang, Bo Li, Qing Li, Xiao Huang

    Abstract: Large language models (LLMs) often struggle with knowledge-intensive tasks due to a lack of background knowledge and a tendency to hallucinate. To address these limitations, integrating knowledge graphs (KGs) with LLMs has been intensively studied. Existing KG-enhanced LLMs focus on supplementary factual knowledge, but still struggle with solving complex questions. We argue that refining the relat… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  37. arXiv:2506.10264  [pdf, ps, other

    cs.AI

    WGSR-Bench: Wargame-based Game-theoretic Strategic Reasoning Benchmark for Large Language Models

    Authors: Qiyue Yin, Pei Xu, Qiaozhe Li, Shengda Liu, Shengqi Shen, Tong Wang, Yihong Han, Xiaonan Zhao, Likun Yang, Shiyue Cao, Shiyu Qiu, Yuxuan Liu, Shizhao Yu, Lei Cui, Chengxin Yan, Jie Sun, Xiangquan Tang, Kaiqi Huang

    Abstract: Recent breakthroughs in Large Language Models (LLMs) have led to a qualitative leap in artificial intelligence' s performance on reasoning tasks, particularly demonstrating remarkable capabilities in mathematical, symbolic, and commonsense reasoning. However, as a critical component of advanced human cognition, strategic reasoning, i.e., the ability to assess multi-agent behaviors in dynamic envir… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: 15 pages, 17 figures

  38. arXiv:2506.09935  [pdf, ps, other

    cs.CV

    LEO-VL: Towards 3D Vision-Language Generalists via Data Scaling with Efficient Representation

    Authors: Jiangyong Huang, Xiaojian Ma, Xiongkun Linghu, Yue Fan, Junchao He, Wenxin Tan, Qing Li, Song-Chun Zhu, Yixin Chen, Baoxiong Jia, Siyuan Huang

    Abstract: Developing 3D-VL generalists capable of understanding 3D scenes and following natural language instructions to perform a wide range of tasks has been a long-standing goal in the 3D-VL community. Despite recent progress, 3D-VL models still lag behind their 2D counterparts in capability and robustness, falling short of the generalist standard. A key obstacle to developing 3D-VL generalists lies in d… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: Project page: https://leo-vl.github.io

  39. arXiv:2506.09836  [pdf, ps, other

    cs.CV cs.AI

    DynaSplat: Dynamic-Static Gaussian Splatting with Hierarchical Motion Decomposition for Scene Reconstruction

    Authors: Junli Deng, Ping Shi, Qipei Li, Jinyang Guo

    Abstract: Reconstructing intricate, ever-changing environments remains a central ambition in computer vision, yet existing solutions often crumble before the complexity of real-world dynamics. We present DynaSplat, an approach that extends Gaussian Splatting to dynamic scenes by integrating dynamic-static separation and hierarchical motion modeling. First, we classify scene elements as static or dynamic thr… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  40. arXiv:2506.09565  [pdf, ps, other

    cs.CV

    SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields

    Authors: Qijing Li, Jingxiang Sun, Liang An, Zhaoqi Su, Hongwen Zhang, Yebin Liu

    Abstract: Holistic 3D scene understanding, which jointly models geometry, appearance, and semantics, is crucial for applications like augmented reality and robotic interaction. Existing feed-forward 3D scene understanding methods (e.g., LSM) are limited to extracting language-based semantics from scenes, failing to achieve holistic scene comprehension. Additionally, they suffer from low-quality geometry rec… ▽ More

    Submitted 13 June, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

  41. arXiv:2506.07964  [pdf, ps, other

    cs.CV cs.AI

    SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design

    Authors: Wenxin Tang, Jingyu Xiao, Wenxuan Jiang, Xi Xiao, Yuhang Wang, Xuxin Tang, Qing Li, Yuehe Ma, Junliang Liu, Shisong Tang, Michael R. Lyu

    Abstract: Manual slide creation is labor-intensive and requires expert prior knowledge. Existing natural language-based LLM generation methods struggle to capture the visual and structural nuances of slide designs. To address this, we formalize the Reference Image to Slide Generation task and propose Slide2Code, the first benchmark with difficulty-tiered samples based on a novel Slide Complexity Metric. We… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  42. arXiv:2506.07900  [pdf, ps, other

    cs.CL cs.AI

    MiniCPM4: Ultra-Efficient LLMs on End Devices

    Authors: MiniCPM Team, Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Xin Cong, Ganqu Cui, Ning Ding, Shengdan Fan, Yewei Fang, Zixuan Fu, Wenyu Guan, Yitong Guan, Junshao Guo, Yufeng Han, Bingxiang He, Yuxiang Huang, Cunliang Kong, Qiuzuo Li, Siyuan Li, Wenhao Li, Yanghao Li , et al. (50 additional authors not shown)

    Abstract: This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelera… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: MiniCPM4 Technical Report

  43. arXiv:2506.06823  [pdf, other

    cs.CV cs.AI

    Exploring Visual Prompting: Robustness Inheritance and Beyond

    Authors: Qi Li, Liangzhi Li, Zhouqiang Jiang, Bowen Wang, Keke Tang

    Abstract: Visual Prompting (VP), an efficient method for transfer learning, has shown its potential in vision tasks. However, previous works focus exclusively on VP from standard source models, it is still unknown how it performs under the scenario of a robust source model: Can the robustness of the source model be successfully inherited? Does VP also encounter the same trade-off between robustness and gene… ▽ More

    Submitted 7 June, 2025; originally announced June 2025.

    Comments: arXiv admin note: substantial text overlap with arXiv:2311.10992

  44. arXiv:2506.06294  [pdf, ps, other

    cs.LG cs.AI cs.CL q-bio.BM

    GLProtein: Global-and-Local Structure Aware Protein Representation Learning

    Authors: Yunqing Liu, Wenqi Fan, Xiaoyong Wei, Qing Li

    Abstract: Proteins are central to biological systems, participating as building blocks across all forms of life. Despite advancements in understanding protein functions through protein sequence analysis, there remains potential for further exploration in integrating protein structural information. We argue that the structural information of proteins is not only limited to their 3D information but also encom… ▽ More

    Submitted 17 May, 2025; originally announced June 2025.

  45. arXiv:2506.06240  [pdf, ps, other

    cs.CL

    Bridging External and Parametric Knowledge: Mitigating Hallucination of LLMs with Shared-Private Semantic Synergy in Dual-Stream Knowledge

    Authors: Yi Sui, Chaozhuo Li, Chen Zhang, Dawei song, Qiuchi Li

    Abstract: Retrieval-augmented generation (RAG) is a cost-effective approach to mitigate the hallucination of Large Language Models (LLMs) by incorporating the retrieved external knowledge into the generation process. However, external knowledge may conflict with the parametric knowledge of LLMs. Furthermore, current LLMs lack inherent mechanisms for resolving such knowledge conflicts, making traditional RAG… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

  46. arXiv:2506.06112  [pdf, ps, other

    cs.LG cs.AI cs.CR

    Towards Lifecycle Unlearning Commitment Management: Measuring Sample-level Unlearning Completeness

    Authors: Cheng-Long Wang, Qi Li, Zihang Xiang, Yinzhi Cao, Di Wang

    Abstract: Growing concerns over data privacy and security highlight the importance of machine unlearning--removing specific data influences from trained models without full retraining. Techniques like Membership Inference Attacks (MIAs) are widely used to externally assess successful unlearning. However, existing methods face two key limitations: (1) maximizing MIA effectiveness (e.g., via online attacks) r… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

    Comments: To appear in the Proceedings of USENIX Security Symposium, 2025

  47. arXiv:2506.05779  [pdf, ps, other

    cs.NI cs.LG

    Pegasus: A Universal Framework for Scalable Deep Learning Inference on the Dataplane

    Authors: Yinchao Zhang, Su Yao, Yong Feng, Kang Chen, Tong Li, Zhuotao Liu, Yi Zhao, Lexuan Zhang, Xiangyu Gao, Feng Xiong, Qi Li, Ke Xu

    Abstract: The paradigm of Intelligent DataPlane (IDP) embeds deep learning (DL) models on the network dataplane to enable intelligent traffic analysis at line-speed. However, the current use of the match-action table (MAT) abstraction on the dataplane is misaligned with DL inference, leading to several key limitations, including accuracy degradation, limited scale, and lack of generality. This paper propose… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

    Comments: to be published in Sigcomm 2025

  48. arXiv:2506.05678  [pdf, ps, other

    cs.LG

    Numerical Investigation of Sequence Modeling Theory using Controllable Memory Functions

    Authors: Haotian Jiang, Zeyu Bao, Shida Wang, Qianxiao Li

    Abstract: The evolution of sequence modeling architectures, from recurrent neural networks and convolutional models to Transformers and structured state-space models, reflects ongoing efforts to address the diverse temporal dependencies inherent in sequential data. Despite this progress, systematically characterizing the strengths and limitations of these architectures remains a fundamental challenge. In th… ▽ More

    Submitted 8 June, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

  49. arXiv:2506.04897  [pdf, ps, other

    cs.CV

    From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes

    Authors: Tianxu Wang, Zhuofan Zhang, Ziyu Zhu, Yue Fan, Jing Xiong, Pengxiang Li, Xiaojian Ma, Qing Li

    Abstract: 3D visual grounding has made notable progress in localizing objects within complex 3D scenes. However, grounding referring expressions beyond objects in 3D scenes remains unexplored. In this paper, we introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2,632 referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, uno… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  50. arXiv:2506.04575  [pdf, ps, other

    cs.CL

    Are LLMs Reliable Translators of Logical Reasoning Across Lexically Diversified Contexts?

    Authors: Qingchuan Li, Jiatong Li, Zirui Liu, Mingyue Cheng, Yuting Zeng, Qi Liu, Tongxuan Liu

    Abstract: Neuro-symbolic approaches combining large language models (LLMs) with solvers excels in logical reasoning problems need long reasoning chains. In this paradigm, LLMs serve as translators, converting natural language reasoning problems into formal logic formulas. Then reliable symbolic solvers return correct solutions. Despite their success, we find that LLMs, as translators, struggle to handle lex… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.