Skip to main content

Showing 1–50 of 1,133 results for author: Sun, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.05056  [pdf, ps, other

    cs.CV cs.AI

    INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling

    Authors: Xin Dong, Shichao Dong, Jin Wang, Jing Huang, Li Zhou, Zenghui Sun, Lihua Jing, Jingsong Lan, Xiaoyong Zhu, Bo Zheng

    Abstract: Hallucinations in large vision-language models (LVLMs) pose significant challenges for real-world applications, as LVLMs may generate responses that appear plausible yet remain inconsistent with the associated visual content. This issue rarely occurs in human cognition. We argue that this discrepancy arises from humans' ability to effectively leverage multimodal interaction information in data sam… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  2. arXiv:2507.04877  [pdf, ps, other

    cs.AI

    DoPI: Doctor-like Proactive Interrogation LLM for Traditional Chinese Medicine

    Authors: Zewen Sun, Ruoxiang Huang, Jiahe Feng, Rundong Kong, Yuqian Wang, Hengyu Liu, Ziqi Gong, Yuyuan Qin, Yingxue Wang, Yu Wang

    Abstract: Enhancing interrogation capabilities in Traditional Chinese Medicine (TCM) diagnosis through multi-turn dialogues and knowledge graphs presents a significant challenge for modern AI systems. Current large language models (LLMs), despite their advancements, exhibit notable limitations in medical applications, particularly in conducting effective multi-turn dialogues and proactive questioning. These… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  3. arXiv:2507.03779  [pdf, ps, other

    cs.CV cs.AI cs.LG

    FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed

    Authors: Jiaqi Zhang, Juntuo Wang, Zhixin Sun, John Zou, Randall Balestriero

    Abstract: Large-scale vision foundation models such as DINOv2 boast impressive performances by leveraging massive architectures and training datasets. But numerous scenarios require practitioners to reproduce those pre-training solutions, such as on private data, new modalities, or simply for scientific questioning--which is currently extremely demanding computation-wise. We thus propose a novel pre-trainin… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

  4. arXiv:2507.02503  [pdf, ps, other

    cs.LG cs.AI cs.CE

    Continual Gradient Low-Rank Projection Fine-Tuning for LLMs

    Authors: Chenxu Wang, Yilin Lyu, Zicheng Sun, Liping Jing

    Abstract: Continual fine-tuning of Large Language Models (LLMs) is hampered by the trade-off between efficiency and expressiveness. Low-Rank Adaptation (LoRA) offers efficiency but constrains the model's ability to learn new tasks and transfer knowledge due to its low-rank nature and reliance on explicit parameter constraints. We propose GORP (Gradient LOw Rank Projection) for Continual Learning, a novel tr… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 15 pages, 6 figures, accepted by ACL 2025 main

  5. arXiv:2507.02014  [pdf, ps, other

    cs.IR cs.AI cs.LG stat.ML

    ManifoldMind: Dynamic Hyperbolic Reasoning for Trustworthy Recommendations

    Authors: Anoushka Harit, Zhongtian Sun, Suncica Hadzidedic

    Abstract: We introduce ManifoldMind, a probabilistic geometric recommender system for exploratory reasoning over semantic hierarchies in hyperbolic space. Unlike prior methods with fixed curvature and rigid embeddings, ManifoldMind represents users, items, and tags as adaptive-curvature probabilistic spheres, enabling personalised uncertainty modeling and geometry-aware semantic exploration. A curvature-awa… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  6. arXiv:2507.01599  [pdf, ps, other

    cs.DB cs.AI cs.CL cs.LG

    Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems

    Authors: Zhaoyan Sun, Jiayi Wang, Xinyang Zhao, Jiachi Wang, Guoliang Li

    Abstract: Traditional Data+AI systems utilize data-driven techniques to optimize performance, but they rely heavily on human experts to orchestrate system pipelines, enabling them to adapt to changes in data, queries, tasks, and environments. For instance, while there are numerous data science tools available, developing a pipeline planning system to coordinate these tools remains challenging. This difficul… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  7. arXiv:2507.00938  [pdf, ps, other

    cs.IR cs.AI cs.DB

    WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks

    Authors: Zihao Sun, Meng Fang, Ling Chen

    Abstract: Recent progress in large language models (LLMs) has enabled the development of autonomous web agents capable of navigating and interacting with real websites. However, evaluating such agents remains challenging due to the instability and inconsistency of existing benchmarks, which often rely on dynamic content or oversimplified simulations. In this work, we introduce WebArXiv, a static and time-in… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: 10 pages, 9 figures, 4 tables

    ACM Class: F.2.2; I.2.7

  8. arXiv:2506.21895  [pdf, ps, other

    cs.CV

    Exploring Task-Solving Paradigm for Generalized Cross-Domain Face Anti-Spoofing via Reinforcement Fine-Tuning

    Authors: Fangling Jiang, Qi Li, Weining Wang, Gang Wang, Bing Liu, Zhenan Sun

    Abstract: Recently the emergence of novel presentation attacks has drawn increasing attention to face anti-spoofing. However, existing methods tend to memorize data patterns from the training set, resulting in poor generalization to unknown attack types across different scenarios and limited interpretability. To address these challenges, this paper presents a reinforcement fine-tuning-based face anti-spoofi… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  9. arXiv:2506.20923  [pdf, ps, other

    cs.CL

    KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

    Authors: Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Qian Chen, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, Min Zhang

    Abstract: In this paper, we propose KaLM-Embedding-V2, a versatile and compact embedding model, which achieves impressive performance in general-purpose text embedding tasks by leveraging superior training techniques and data. Our key innovations include: (1) To better align the architecture with representation learning, we remove the causal attention mask and adopt a fully bidirectional transformer with si… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Comments: Technical Report; 26 pages 12 tables 1 figure. arXiv admin note: substantial text overlap with arXiv:2501.01028

  10. arXiv:2506.19363  [pdf, ps, other

    eess.IV cs.CV

    Reconsidering Explicit Longitudinal Mammography Alignment for Enhanced Breast Cancer Risk Prediction

    Authors: Solveig Thrun, Stine Hansen, Zijun Sun, Nele Blum, Suaiba A. Salahuddin, Kristoffer Wickstrøm, Elisabeth Wetzer, Robert Jenssen, Maik Stille, Michael Kampffmeyer

    Abstract: Regular mammography screening is essential for early breast cancer detection. Deep learning-based risk prediction methods have sparked interest to adjust screening intervals for high-risk groups. While early methods focused only on current mammograms, recent approaches leverage the temporal aspect of screenings to track breast tissue changes over time, requiring spatial alignment across different… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: MICCAI 2025, early accepted

  11. arXiv:2506.18410  [pdf, ps, other

    cs.RO

    Integrating Maneuverable Planning and Adaptive Control for Robot Cart-Pushing under Disturbances

    Authors: Zhe Zhang, Peijia Xie, Zhirui Sun, Bingyi Xia, Bi-Ke Zhu, Jiankun Wang

    Abstract: Precise and flexible cart-pushing is a challenging task for mobile robots. The motion constraints during cart-pushing and the robot's redundancy lead to complex motion planning problems, while variable payloads and disturbances present complicated dynamics. In this work, we propose a novel planning and control framework for flexible whole-body coordination and robust adaptive control. Our motion p… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: 11 pages, 11 figures

  12. arXiv:2506.17840  [pdf, ps, other

    cs.LG cs.AI

    Causal Spherical Hypergraph Networks for Modelling Social Uncertainty

    Authors: Anoushka Harit, Zhongtian Sun

    Abstract: Human social behaviour is governed by complex interactions shaped by uncertainty, causality, and group dynamics. We propose Causal Spherical Hypergraph Networks (Causal-SphHN), a principled framework for socially grounded prediction that jointly models higher-order structure, directional influence, and epistemic uncertainty. Our method represents individuals as hyperspherical embeddings and group… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

  13. arXiv:2506.17826  [pdf, ps, other

    cs.LG cs.AI

    Actionable Interpretability via Causal Hypergraphs: Unravelling Batch Size Effects in Deep Learning

    Authors: Zhongtian Sun, Anoushka Harit, Pietro Lio

    Abstract: While the impact of batch size on generalisation is well studied in vision tasks, its causal mechanisms remain underexplored in graph and text domains. We introduce a hypergraph-based causal framework, HGCNet, that leverages deep structural causal models (DSCMs) to uncover how batch size influences generalisation via gradient noise, minima sharpness, and model complexity. Unlike prior approaches b… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

  14. arXiv:2506.17412  [pdf, ps, other

    eess.IV cs.CV

    VMRA-MaR: An Asymmetry-Aware Temporal Framework for Longitudinal Breast Cancer Risk Prediction

    Authors: Zijun Sun, Solveig Thrun, Michael Kampffmeyer

    Abstract: Breast cancer remains a leading cause of mortality worldwide and is typically detected via screening programs where healthy people are invited in regular intervals. Automated risk prediction approaches have the potential to improve this process by facilitating dynamically screening of high-risk groups. While most models focus solely on the most recent screening, there is growing interest in exploi… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: MICCAI 2025, Provisional Accept

  15. arXiv:2506.16233  [pdf, ps, other

    astro-ph.GA cs.LG

    Can AI Dream of Unseen Galaxies? Conditional Diffusion Model for Galaxy Morphology Augmentation

    Authors: Chenrui Ma, Zechang Sun, Tao Jing, Zheng Cai, Yuan-Sen Ting, Song Huang, Mingyu Li

    Abstract: Observational astronomy relies on visual feature identification to detect critical astrophysical phenomena. While machine learning (ML) increasingly automates this process, models often struggle with generalization in large-scale surveys due to the limited representativeness of labeled datasets -- whether from simulations or human annotation -- a challenge pronounced for rare yet scientifically va… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

    Comments: We have submitted to AAS journals. See another independent work for further reference -- Category-based Galaxy Image Generation via Diffusion Models (Fan, Tang et al.). Comments are welcome

  16. arXiv:2506.15733  [pdf, ps, other

    cs.AI cs.CL cs.LG

    $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts

    Authors: Mert Cemri, Nived Rajaraman, Rishabh Tiwari, Xiaoxuan Liu, Kurt Keutzer, Ion Stoica, Kannan Ramchandran, Ahmad Beirami, Ziteng Sun

    Abstract: Scaling test-time compute has driven the recent advances in the reasoning capabilities of large language models (LLMs), typically by allocating additional computation for more thorough exploration. However, increased compute often comes at the expense of higher user-facing latency, directly impacting user experience. Current test-time scaling methods primarily optimize for accuracy based on total… ▽ More

    Submitted 15 June, 2025; originally announced June 2025.

    Comments: 28 pages, 6 figures, 2 tables

  17. arXiv:2506.14305  [pdf, ps, other

    cs.RO

    Socially Aware Robot Crowd Navigation via Online Uncertainty-Driven Risk Adaptation

    Authors: Zhirui Sun, Xingrong Diao, Yao Wang, Bi-Ke Zhu, Jiankun Wang

    Abstract: Navigation in human-robot shared crowded environments remains challenging, as robots are expected to move efficiently while respecting human motion conventions. However, many existing approaches emphasize safety or efficiency while overlooking social awareness. This article proposes Learning-Risk Model Predictive Control (LR-MPC), a data-driven navigation algorithm that balances efficiency, safety… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

  18. arXiv:2506.13585  [pdf, ps, other

    cs.CL cs.LG

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Authors: MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou , et al. (103 additional authors not shown)

    Abstract: We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: A technical report from MiniMax. The authors are listed in alphabetical order. We open-source our MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1

  19. arXiv:2506.12723  [pdf, ps, other

    cs.CV cs.AI

    SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration

    Authors: Ye Li, Yuan Meng, Zewen Sun, Kangye Ji, Chen Tang, Jiajun Fan, Xinzhu Ma, Shutao Xia, Zhi Wang, Wenwu Zhu

    Abstract: Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities. However, their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation. Existing VLA acceleration methods primarily focus on structural optimization, overlooking the fact that these models oper… ▽ More

    Submitted 19 June, 2025; v1 submitted 15 June, 2025; originally announced June 2025.

  20. arXiv:2506.12537  [pdf, ps, other

    cs.CL cs.AI eess.AS

    Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction

    Authors: Xiaoran Fan, Zhichao Sun, Yangfan Gao, Jingfei Xiong, Hang Yan, Yifei Cao, Jiajun Sun, Shuo Li, Zhihao Zhang, Zhiheng Xi, Yuhao Zhou, Senjie Jin, Changhao Jiang, Junjie Ye, Ming Zhang, Rui Zheng, Zhenhua Han, Yunke Zhang, Demei Yan, Shaokang Dong, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

    Abstract: Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation. However, challenges remain in achieving effective cross-modal alignment and high-quality speech generation. In this work, we systematically investigate the impact of key components (i.e., speech tokenizers, speech heads, and speaker modeling) on the performance of LLM-centric SLMs. We… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  21. arXiv:2506.10503  [pdf, ps, other

    cs.CV cs.AI

    Semantic Localization Guiding Segment Anything Model For Reference Remote Sensing Image Segmentation

    Authors: Shuyang Li, Shuang Wang, Zhuangzhuang Sun, Jing Xiao

    Abstract: The Reference Remote Sensing Image Segmentation (RRSIS) task generates segmentation masks for specified objects in images based on textual descriptions, which has attracted widespread attention and research interest. Current RRSIS methods rely on multi-modal fusion backbones and semantic segmentation heads but face challenges like dense annotation requirements and complex scene interpretation. To… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  22. LightKG: Efficient Knowledge-Aware Recommendations with Simplified GNN Architecture

    Authors: Yanhui Li, Dongxia Wang, Zhu Sun, Haonan Zhang, Huizhong Guo

    Abstract: Recently, Graph Neural Networks (GNNs) have become the dominant approach for Knowledge Graph-aware Recommender Systems (KGRSs) due to their proven effectiveness. Building upon GNN-based KGRSs, Self-Supervised Learning (SSL) has been incorporated to address the sparity issue, leading to longer training time. However, through extensive experiments, we reveal that: (1)compared to other KGRSs, the exi… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

  23. arXiv:2506.10329  [pdf, ps, other

    cs.IR

    Context-Adaptive Graph Neural Networks for Next POI Recommendation

    Authors: Yu Lei, Limin Shen, Zhu Sun, Tiantian He, Yew-Soon Ong

    Abstract: Next Point-of-Interest (POI) recommendation is a critical task in location-based services, aiming to predict users' next visits based on their check-in histories. While many existing methods leverage Graph Neural Networks (GNNs) to incorporate collaborative information and improve recommendation accuracy, most of them model each type of context using separate graphs, treating different factors in… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: 12 pages, 6 figures

  24. arXiv:2506.07851  [pdf, ps, other

    cs.CL

    Learning to Focus: Causal Attention Distillation via Gradient-Guided Token Pruning

    Authors: Yiju Guo, Wenkai Yang, Zexu Sun, Ning Ding, Zhiyuan Liu, Yankai Lin

    Abstract: Large language models (LLMs) have demonstrated significant improvements in contextual understanding. However, their ability to attend to truly critical information during long-context reasoning and generation still falls behind the pace. Specifically, our preliminary experiments reveal that certain distracting patterns can misdirect the model's attention during inference, and removing these patter… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  25. arXiv:2506.07214  [pdf, other

    cs.CV cs.CR

    Backdoor Attack on Vision Language Models with Stealthy Semantic Manipulation

    Authors: Zhiyuan Zhong, Zhen Sun, Yepang Liu, Xinlei He, Guanhong Tao

    Abstract: Vision Language Models (VLMs) have shown remarkable performance, but are also vulnerable to backdoor attacks whereby the adversary can manipulate the model's outputs through hidden triggers. Prior attacks primarily rely on single-modality triggers, leaving the crucial cross-modal fusion nature of VLMs largely unexplored. Unlike prior work, we identify a novel attack surface that leverages cross-mo… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

  26. arXiv:2506.03635  [pdf, ps, other

    cs.CV

    FingerVeinSyn-5M: A Million-Scale Dataset and Benchmark for Finger Vein Recognition

    Authors: Yinfan Wang, Jie Gui, Baosheng Yu, Qi Li, Zhenan Sun, Juho Kannala, Guoying Zhao

    Abstract: A major challenge in finger vein recognition is the lack of large-scale public datasets. Existing datasets contain few identities and limited samples per finger, restricting the advancement of deep learning-based methods. To address this, we introduce FVeinSyn, a synthetic generator capable of producing diverse finger vein patterns with rich intra-class variations. Using FVeinSyn, we created Finge… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  27. arXiv:2506.02791  [pdf, ps, other

    cs.SE cs.AI

    Rethinking the effects of data contamination in Code Intelligence

    Authors: Zhen Yang, Hongyi Lin, Yifan He, Jie Xu, Zeyu Sun, Shuo Liu, Pengpeng Wang, Zhongxing Yu, Qingyuan Liang

    Abstract: In recent years, code intelligence has gained increasing importance in the field of automated software engineering. Meanwhile, the widespread adoption of Pretrained Language Models (PLMs) and Large Language Models (LLMs) has raised concerns regarding data contamination and its potential impact on model performance evaluation. This paper presents a systematic empirical study to investigate the fine… ▽ More

    Submitted 8 June, 2025; v1 submitted 3 June, 2025; originally announced June 2025.

  28. arXiv:2506.02711  [pdf, other

    cs.CR

    Privacy Leaks by Adversaries: Adversarial Iterations for Membership Inference Attack

    Authors: Jing Xue, Zhishen Sun, Haishan Ye, Luo Luo, Xiangyu Chang, Ivor Tsang, Guang Dai

    Abstract: Membership inference attack (MIA) has become one of the most widely used and effective methods for evaluating the privacy risks of machine learning models. These attacks aim to determine whether a specific sample is part of the model's training set by analyzing the model's output. While traditional membership inference attacks focus on leveraging the model's posterior output, such as confidence on… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  29. arXiv:2506.02461  [pdf, ps, other

    cs.CL

    XToM: Exploring the Multilingual Theory of Mind for Large Language Models

    Authors: Chunkit Chan, Yauwai Yim, Hongchuan Zeng, Zhiying Zou, Xinyuan Cheng, Zhifan Sun, Zheye Deng, Kawai Chung, Yuzhuo Ao, Yixiang Fan, Cheng Jiayang, Ercong Nie, Ginny Y. Wong, Helmut Schmid, Hinrich SchĂĽtze, Simon See, Yangqiu Song

    Abstract: Theory of Mind (ToM), the ability to infer mental states in others, is pivotal for human social cognition. Existing evaluations of ToM in LLMs are largely limited to English, neglecting the linguistic diversity that shapes human cognition. This limitation raises a critical question: can LLMs exhibit Multilingual Theory of Mind, which is the capacity to reason about mental states across diverse lin… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  30. arXiv:2506.00942  [pdf, ps, other

    cs.CL cs.AI eess.SP

    anyECG-chat: A Generalist ECG-MLLM for Flexible ECG Input and Multi-Task Understanding

    Authors: Haitao Li, Ziyu Li, Yiheng Mao, Ziyi Liu, Zhoujian Sun, Zhengxing Huang

    Abstract: The advent of multimodal large language models (MLLMs) has sparked interest in their application to electrocardiogram (ECG) analysis. However, existing ECG-focused MLLMs primarily focus on report generation tasks, often limited to single 12-lead, short-duration (10s) ECG inputs, thereby underutilizing the potential of MLLMs. To this end, we aim to develop a MLLM for ECG analysis that supports a br… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

  31. arXiv:2505.23843  [pdf, other

    cs.CL cs.LG

    Evaluation Hallucination in Multi-Round Incomplete Information Lateral-Driven Reasoning Tasks

    Authors: Wenhan Dong, Tianyi Hu, Jingyi Zheng, Zhen Sun, Yuemeng Zhao, Yule Liu, Xinlei He, Xinyi Huang

    Abstract: Multi-round incomplete information tasks are crucial for evaluating the lateral thinking capabilities of large language models (LLMs). Currently, research primarily relies on multiple benchmarks and automated evaluation metrics to assess these abilities. However, our study reveals novel insights into the limitations of existing methods, as they often yield misleading results that fail to uncover k… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  32. arXiv:2505.23265  [pdf, ps, other

    cs.CV

    Image Aesthetic Reasoning: A New Benchmark for Medical Image Screening with MLLMs

    Authors: Zheng Sun, Yi Wei, Long Yu

    Abstract: Multimodal Large Language Models (MLLMs) are of great application across many domains, such as multimodal understanding and generation. With the development of diffusion models (DM) and unified MLLMs, the performance of image generation has been significantly improved, however, the study of image screening is rare and its performance with MLLMs is unsatisfactory due to the lack of data and the wee… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  33. arXiv:2505.23207  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM

    Authors: Zhaokai Sun, Li Zhang, Qing Wang, Pan Zhou, Lei Xie

    Abstract: Overlapping Speech Detection (OSD) aims to identify regions where multiple speakers overlap in a conversation, a critical challenge in multi-party speech processing. This work proposes a speaker-aware progressive OSD model that leverages a progressive training strategy to enhance the correlation between subtasks such as voice activity detection (VAD) and overlap detection. To improve acoustic repr… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  34. arXiv:2505.22988  [pdf, ps, other

    cs.LG cs.AI

    Model-Preserving Adaptive Rounding

    Authors: Albert Tseng, Zhaofeng Sun, Christopher De Sa

    Abstract: The main goal of post-training quantization (PTQ) is to produced a compressed model whose output distribution is as close to the original model's as possible. To do this tractably, almost all LLM PTQ algorithms quantize linear layers by independently minimizing the immediate activation error. However, this localized objective ignores the effect of subsequent layers, so reducing it does not necessa… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: Preprint

  35. arXiv:2505.21285  [pdf, ps, other

    cs.LG stat.ML

    Learnable Kernel Density Estimation for Graphs

    Authors: Xudong Wang, Ziheng Sun, Chris Ding, Jicong Fan

    Abstract: This work proposes a framework LGKDE that learns kernel density estimation for graphs. The key challenge in graph density estimation lies in effectively capturing both structural patterns and semantic variations while maintaining theoretical guarantees. Combining graph kernels and kernel density estimation (KDE) is a standard approach to graph density estimation, but has unsatisfactory performance… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: Under Review

    ACM Class: I.2; I.5.1; I.5.2

  36. arXiv:2505.21165  [pdf, ps, other

    cs.IR

    Counterfactual Multi-player Bandits for Explainable Recommendation Diversification

    Authors: Yansen Zhang, Bowei He, Xiaokun Zhang, Haolun Wu, Zexu Sun, Chen Ma

    Abstract: Existing recommender systems tend to prioritize items closely aligned with users' historical interactions, inevitably trapping users in the dilemma of ``filter bubble''. Recent efforts are dedicated to improving the diversity of recommendations. However, they mainly suffer from two major issues: 1) a lack of explainability, making it difficult for the system designers to understand how diverse rec… ▽ More

    Submitted 11 June, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

    Comments: Accepted in ECML PKDD 2025

    Journal ref: ECML PKDD 2025

  37. arXiv:2505.20664  [pdf, ps, other

    cs.CL cs.AI

    Self-Route: Automatic Mode Switching via Capability Estimation for Efficient Reasoning

    Authors: Yang He, Xiao Ding, Bibo Cai, Yufei Zhang, Kai Xiong, Zhouhao Sun, Bing Qin, Ting Liu

    Abstract: While reasoning-augmented large language models (RLLMs) significantly enhance complex task performance through extended reasoning chains, they inevitably introduce substantial unnecessary token consumption, particularly for simpler problems where Short Chain-of-Thought (Short CoT) suffices. This overthinking phenomenon leads to inefficient resource usage without proportional accuracy gains. To add… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  38. arXiv:2505.20316  [pdf, ps, other

    cs.AI

    Reinforcement Speculative Decoding for Fast Ranking

    Authors: Yingpeng Du, Tianjun Wei, Zhu Sun, Jie Zhang

    Abstract: Large Language Models (LLMs) have been widely adopted in ranking systems such as information retrieval (IR) systems and recommender systems (RSs). To alleviate the latency of auto-regressive decoding, some studies explore the single (first) token decoding for ranking approximation, but they suffer from severe degradation in tail positions. Although speculative decoding (SD) methods can be a remedy… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: 21 pages, 5 figures, 5 table

    ACM Class: H.4.0

  39. arXiv:2505.19381  [pdf, ps, other

    cs.AI cs.CV cs.RO

    DiffVLA: Vision-Language Guided Diffusion Planning for Autonomous Driving

    Authors: Anqing Jiang, Yu Gao, Zhigang Sun, Yiru Wang, Jijun Wang, Jinghao Chai, Qian Cao, Yuweng Heng, Hao Jiang, Yunda Dong, Zongzheng Zhang, Xianda Guo, Hao Sun, Hao Zhao

    Abstract: Research interest in end-to-end autonomous driving has surged owing to its fully differentiable design integrating modular tasks, i.e. perception, prediction and planing, which enables optimization in pursuit of the ultimate goal. Despite the great potential of the end-to-end paradigm, existing methods suffer from several aspects including expensive BEV (bird's eye view) computation, action divers… ▽ More

    Submitted 2 June, 2025; v1 submitted 25 May, 2025; originally announced May 2025.

    Comments: 4pages

  40. arXiv:2505.18458  [pdf, ps, other

    cs.DB cs.AI cs.CL cs.IR cs.LG

    A Survey of LLM $\times$ DATA

    Authors: Xuanhe Zhou, Junxuan He, Wei Zhou, Haodong Chen, Zirui Tang, Haoyu Zhao, Xin Tong, Guoliang Li, Youmin Chen, Jun Zhou, Zhaojun Sun, Binyuan Hui, Shuo Wang, Conghui He, Zhiyuan Liu, Jingren Zhou, Fan Wu

    Abstract: The integration of large language model (LLM) and data management (DATA) is rapidly redefining both domains. In this survey, we comprehensively review the bidirectional relationships. On the one hand, DATA4LLM, spanning large-scale data processing, storage, and serving, feeds LLMs with high quality, diversity, and timeliness of data required for stages like pre-training, post-training, retrieval-a… ▽ More

    Submitted 1 June, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

    Comments: Please refer to the paper list at: https://github.com/weAIDB/awesome-data-llm

  41. arXiv:2505.17568  [pdf, ps, other

    cs.CR cs.AI cs.SD eess.AS

    JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models

    Authors: Zifan Peng, Yule Liu, Zhen Sun, Mingchen Li, Zeren Luo, Jingyi Zheng, Wenhan Dong, Xinlei He, Xuechao Wang, Yingjie Xue, Shengmin Xu, Xinyi Huang

    Abstract: Audio Language Models (ALMs) have made significant progress recently. These models integrate the audio modality directly into the model, rather than converting speech into text and inputting text to Large Language Models (LLMs). While jailbreak attacks on LLMs have been extensively studied, the security of ALMs with audio modalities remains largely unexplored. Currently, there is a lack of an adve… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  42. arXiv:2505.16532  [pdf, ps, other

    cs.IR

    Causal-Invariant Cross-Domain Out-of-Distribution Recommendation

    Authors: Jiajie Zhu, Yan Wang, Feng Zhu, Pengfei Ding, Hongyang Liu, Zhu Sun

    Abstract: Cross-Domain Recommendation (CDR) aims to leverage knowledge from a relatively data-richer source domain to address the data sparsity problem in a relatively data-sparser target domain. While CDR methods need to address the distribution shifts between different domains, i.e., cross-domain distribution shifts (CDDS), they typically assume independent and identical distribution (IID) between trainin… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

  43. arXiv:2505.16522  [pdf, ps, other

    cs.CL cs.AI

    Benchmarking and Pushing the Multi-Bias Elimination Boundary of LLMs via Causal Effect Estimation-guided Debiasing

    Authors: Zhouhao Sun, Zhiyuan Kan, Xiao Ding, Li Du, Yang Zhao, Bing Qin, Ting Liu

    Abstract: Despite significant progress, recent studies have indicated that current large language models (LLMs) may still utilize bias during inference, leading to the poor generalizability of LLMs. Some benchmarks are proposed to investigate the generalizability of LLMs, with each piece of data typically containing one type of controlled bias. However, a single piece of data may contain multiple types of b… ▽ More

    Submitted 26 May, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

  44. arXiv:2505.15644  [pdf, ps, other

    cs.CV cs.AI cs.CR

    FragFake: A Dataset for Fine-Grained Detection of Edited Images with Vision Language Models

    Authors: Zhen Sun, Ziyi Zhang, Zeren Luo, Zeyang Sha, Tianshuo Cong, Zheng Li, Shiwen Cui, Weiqiang Wang, Jiaheng Wei, Xinlei He, Qi Li, Qian Wang

    Abstract: Fine-grained edited image detection of localized edits in images is crucial for assessing content authenticity, especially given that modern diffusion models and image editing methods can produce highly realistic manipulations. However, this domain faces three challenges: (1) Binary classifiers yield only a global real-or-fake label without providing localization; (2) Traditional computer vision m… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: 14pages,15 figures

  45. arXiv:2505.15621  [pdf, ps, other

    cs.SE

    DSCodeBench: A Realistic Benchmark for Data Science Code Generation

    Authors: Shuyin Ouyang, Dong Huang, Jingwen Guo, Zeyu Sun, Qihao Zhu, Jie M. Zhang

    Abstract: We introduce DSCodeBench, a new benchmark designed to evaluate large language models (LLMs) on complicated and realistic data science code generation tasks. DSCodeBench consists of 1,000 carefully constructed problems sourced from realistic problems from GitHub across ten widely used Python data science libraries. Compared to the current state-of-the-art benchmark DS-1000, DSCodeBench offers a mor… ▽ More

    Submitted 2 July, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

  46. arXiv:2505.15276  [pdf, ps, other

    cs.AI cs.CL

    When Can Large Reasoning Models Save Thinking? Mechanistic Analysis of Behavioral Divergence in Reasoning

    Authors: Rongzhi Zhu, Yi Liu, Zequn Sun, Yiwei Wang, Wei Hu

    Abstract: Large reasoning models (LRMs) have significantly advanced performance on complex tasks, yet their tendency to overthink introduces inefficiencies. This study investigates the internal mechanisms of reinforcement learning (RL)-trained LRMs when prompted to save thinking, revealing three distinct thinking modes: no thinking (NT), explicit thinking (ET), and implicit thinking (IT). Through comprehens… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

  47. arXiv:2505.12983  [pdf, other

    cs.CL cs.AI

    An Empirical Study of Many-to-Many Summarization with Large Language Models

    Authors: Jiaan Wang, Fandong Meng, Zengkui Sun, Yunlong Liang, Yuxuan Cao, Jiarong Xu, Haoxiang Shi, Jie Zhou

    Abstract: Many-to-many summarization (M2MS) aims to process documents in any language and generate the corresponding summaries also in any language. Recently, large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform M2MS in real applications. This work presents a systematic empirical study on LLMs' M2MS ability. Specifically, we first reorganize M2MS data… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: Accepted to ACL 2025 main conference

  48. arXiv:2505.12886  [pdf, ps, other

    cs.AI cs.CL cs.CY

    Detection and Mitigation of Hallucination in Large Reasoning Models: A Mechanistic Perspective

    Authors: Zhongxiang Sun, Qipeng Wang, Haoyu Wang, Xiao Zhang, Jun Xu

    Abstract: Large Reasoning Models (LRMs) have shown impressive capabilities in multi-step reasoning tasks. However, alongside these successes, a more deceptive form of model error has emerged--Reasoning Hallucination--where logically coherent but factually incorrect reasoning traces lead to persuasive yet faulty conclusions. Unlike traditional hallucinations, these errors are embedded within structured reaso… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: 25 pages

  49. arXiv:2505.12821  [pdf, other

    cs.CL cs.AI

    SynDec: A Synthesize-then-Decode Approach for Arbitrary Textual Style Transfer via Large Language Models

    Authors: Han Sun, Zhen Sun, Zongmin Zhang, Linzhao Jia, Wei Shao, Min Zhang

    Abstract: Large Language Models (LLMs) are emerging as dominant forces for textual style transfer. However, for arbitrary style transfer, LLMs face two key challenges: (1) considerable reliance on manually-constructed prompts and (2) rigid stylistic biases inherent in LLMs. In this paper, we propose a novel Synthesize-then-Decode (SynDec) approach, which automatically synthesizes high-quality prompts and am… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  50. arXiv:2505.12635  [pdf, ps, other

    cs.CV

    MVPainter: Accurate and Detailed 3D Texture Generation via Multi-View Diffusion with Geometric Control

    Authors: Mingqi Shao, Feng Xiong, Zhaoxu Sun, Mu Xu

    Abstract: Recently, significant advances have been made in 3D object generation. Building upon the generated geometry, current pipelines typically employ image diffusion models to generate multi-view RGB images, followed by UV texture reconstruction through texture baking. While 3D geometry generation has improved significantly, supported by multiple open-source frameworks, 3D texture generation remains und… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

    Comments: Project page: https://amap-cvlab.github.io/MV-Painter