Skip to main content

Showing 1–50 of 747 results for author: Gao, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.05991  [pdf, other

    cs.CL

    Evolution without Large Models: Training Language Model with Task Principles

    Authors: Minghang Zhu, Shen Gao, Zhengliang Shi, Jiabao Fang, Pengjie Ren, Zhaochun Ren, Zhumin Chen, Shuo Shang

    Abstract: A common training approach for language models involves using a large-scale language model to expand a human-provided dataset, which is subsequently used for model training.This method significantly reduces training costs by eliminating the need for extensive human data annotation. However, it still faces challenges such as high carbon emissions during data augmentation and the risk of data leakag… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

  2. arXiv:2507.05568  [pdf, ps, other

    cs.CV cs.LG

    ReLayout: Integrating Relation Reasoning for Content-aware Layout Generation with Multi-modal Large Language Models

    Authors: Jiaxu Tian, Xuehui Yu, Yaoxing Wang, Pan Wang, Guangqian Guo, Shan Gao

    Abstract: Content-aware layout aims to arrange design elements appropriately on a given canvas to convey information effectively. Recently, the trend for this task has been to leverage large language models (LLMs) to generate layouts automatically, achieving remarkable performance. However, existing LLM-based methods fail to adequately interpret spatial relationships among visual themes and design elements,… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  3. arXiv:2507.05197  [pdf, ps, other

    cs.CL cs.LG

    Pre-Trained Policy Discriminators are General Reward Models

    Authors: Shihan Dou, Shichun Liu, Yuming Yang, Yicheng Zou, Yunhua Zhou, Shuhao Xing, Chenhao Huang, Qiming Ge, Demin Song, Haijun Lv, Songyang Gao, Chengqi Lv, Enyu Zhou, Honglin Guo, Zhiheng Xi, Wenwei Zhang, Qipeng Guo, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Tao Gui, Kai Chen

    Abstract: We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  4. arXiv:2507.02477  [pdf, ps, other

    cs.CV cs.GR

    Mesh Silksong: Auto-Regressive Mesh Generation as Weaving Silk

    Authors: Gaochao Song, Zibo Zhao, Haohan Weng, Jingbo Zeng, Rongfei Jia, Shenghua Gao

    Abstract: We introduce Mesh Silksong, a compact and efficient mesh representation tailored to generate the polygon mesh in an auto-regressive manner akin to silk weaving. Existing mesh tokenization methods always produce token sequences with repeated vertex tokens, wasting the network capability. Therefore, our approach tokenizes mesh vertices by accessing each mesh vertice only once, reduces the token sequ… ▽ More

    Submitted 4 July, 2025; v1 submitted 3 July, 2025; originally announced July 2025.

    Comments: 9 pages main text, 14 pages appendix, 23 figures

  5. arXiv:2507.01467  [pdf, ps, other

    cs.CV

    Representation Entanglement for Generation:Training Diffusion Transformers Is Much Easier Than You Think

    Authors: Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, Ming-Ming Cheng, Xiang Li

    Abstract: REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations. We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  6. arXiv:2507.00839  [pdf, ps, other

    cs.DB

    RapidStore: An Efficient Dynamic Graph Storage System for Concurrent Queries

    Authors: Chiyu Hao, Jixian Su, Shixuan Sun, Hao Zhang, Sen Gao, Jianwen Zhao, Chenyi Zhang, Jieru Zhao, Chen Chen, Minyi Guo

    Abstract: Dynamic graph storage systems are essential for real-time applications such as social networks and recommendation, where graph data continuously evolves. However, they face significant challenges in efficiently handling concurrent read and write operations. We find that existing methods suffer from write queries interfering with read efficiency, substantial time and space overhead due to per-edge… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: 17 pages, 18 figures

  7. arXiv:2506.22726  [pdf, ps, other

    cs.CV cs.LG

    XTransfer: Cross-Modality Model Transfer for Human Sensing with Few Data at the Edge

    Authors: Yu Zhang, Xi Zhang, Hualin zhou, Xinyuan Chen, Shang Gao, Hong Jia, Jianfei Yang, Yuankai Qi, Tao Gu

    Abstract: Deep learning for human sensing on edge systems offers significant opportunities for smart applications. However, its training and development are hindered by the limited availability of sensor data and resource constraints of edge systems. Current methods that rely on transferring pre-trained models often encounter issues such as modality shift and high resource demands, resulting in substantial… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  8. arXiv:2506.19610  [pdf, ps, other

    cs.CE

    V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis

    Authors: Yuan Wang, Jiaxiang Liu, Shujian Gao, Bin Feng, Zhihang Tang, Xiaotang Gai, Jian Wu, Zuozhu Liu

    Abstract: Recent advances in multimodal techniques have led to significant progress in Medical Visual Question Answering (Med-VQA). However, most existing models focus on global image features rather than localizing disease-specific regions crucial for diagnosis. Additionally, current research tends to emphasize answer accuracy at the expense of the reasoning pathway, yet both are crucial for clinical decis… ▽ More

    Submitted 27 June, 2025; v1 submitted 24 June, 2025; originally announced June 2025.

    Comments: 12 pages, 4 figures

  9. arXiv:2506.19288  [pdf, ps, other

    cs.CV cs.RO

    Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding

    Authors: Runwei Guan, Ningwei Ouyang, Tianhao Xu, Shaofeng Liang, Wei Dai, Yafeng Sun, Shang Gao, Songning Lai, Shanliang Yao, Xuming Hu, Ryan Wen Liu, Yutao Yue, Hui Xiong

    Abstract: Automated waterway environment perception is crucial for enabling unmanned surface vessels (USVs) to understand their surroundings and make informed decisions. Most existing waterway perception models primarily focus on instance-level object perception paradigms (e.g., detection, segmentation). However, due to the complexity of waterway environments, current perception datasets and models fail to… ▽ More

    Submitted 30 June, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

    Comments: 14 pages, 13 figures

  10. arXiv:2506.19263  [pdf, ps, other

    cs.CV

    3D-SSM: A Novel 3D Selective Scan Module for Remote Sensing Change Detection

    Authors: Rui Huang, Jincheng Zeng, Sen Gao, Yan Xing

    Abstract: Existing Mamba-based approaches in remote sensing change detection have enhanced scanning models, yet remain limited by their inability to capture long-range dependencies between image channels effectively, which restricts their feature representation capabilities. To address this limitation, we propose a 3D selective scan module (3D-SSM) that captures global information from both the spatial plan… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  11. arXiv:2506.16662  [pdf, ps, other

    cs.CC

    Bounded Distance Decoding for Random Lattices

    Authors: Shuhong Gao

    Abstract: The current paper investigates the bounded distance decoding (BDD) problem for ensembles of lattices whose generator matrices have sub-Gaussian entries. We first prove that, for these ensembles the BDD problem is NP-hard in the worst case. Then, we introduce a polynomial-time algorithm based on singular value decomposition (SVD) and establish, both theoretically and through extensive experiments,… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

    MSC Class: 68Q25 (primary); 11H06 (secondary) ACM Class: F.2.2

  12. arXiv:2506.13915  [pdf, ps, other

    cs.RO

    Sequence Modeling for Time-Optimal Quadrotor Trajectory Optimization with Sampling-based Robustness Analysis

    Authors: Katherine Mao, Hongzhan Yu, Ruipeng Zhang, Igor Spasojevic, M Ani Hsieh, Sicun Gao, Vijay Kumar

    Abstract: Time-optimal trajectories drive quadrotors to their dynamic limits, but computing such trajectories involves solving non-convex problems via iterative nonlinear optimization, making them prohibitively costly for real-time applications. In this work, we investigate learning-based models that imitate a model-based time-optimal trajectory planner to accelerate trajectory generation. Given a dataset o… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  13. arXiv:2506.13216  [pdf, ps, other

    cs.CL

    Capability Salience Vector: Fine-grained Alignment of Loss and Capabilities for Downstream Task Scaling Law

    Authors: Qiming Ge, Shuhao Xing, Songyang Gao, Yunhua Zhou, Yicheng Zou, Songyang Zhang, Zhi Chen, Hang Yan, Qi Zhang, Qipeng Guo, Kai Chen

    Abstract: Scaling law builds the relationship between training computation and validation loss, enabling researchers to effectively predict the loss trending of models across different levels of computation. However, a gap still remains between validation loss and the model's downstream capabilities, making it untrivial to apply scaling law to direct performance prediction for downstream tasks. The loss typ… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: 9 pages, 9 figures, ACL2025

  14. arXiv:2506.12831  [pdf, ps, other

    eess.SP cs.AI

    Synesthesia of Machines (SoM)-Enhanced Sub-THz ISAC Transmission for Air-Ground Network

    Authors: Zonghui Yang, Shijian Gao, Xiang Cheng, Liuqing Yang

    Abstract: Integrated sensing and communication (ISAC) within sub-THz frequencies is crucial for future air-ground networks, but unique propagation characteristics and hardware limitations present challenges in optimizing ISAC performance while increasing operational latency. This paper introduces a multi-modal sensing fusion framework inspired by synesthesia of machine (SoM) to enhance sub-THz ISAC transmis… ▽ More

    Submitted 15 June, 2025; originally announced June 2025.

  15. arXiv:2506.10722  [pdf, ps, other

    cs.CR cs.AI

    TED-LaST: Towards Robust Backdoor Defense Against Adaptive Attacks

    Authors: Xiaoxing Mo, Yuxuan Cheng, Nan Sun, Leo Yu Zhang, Wei Luo, Shang Gao

    Abstract: Deep Neural Networks (DNNs) are vulnerable to backdoor attacks, where attackers implant hidden triggers during training to maliciously control model behavior. Topological Evolution Dynamics (TED) has recently emerged as a powerful tool for detecting backdoor attacks in DNNs. However, TED can be vulnerable to backdoor attacks that adaptively distort topological representation distributions across n… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  16. arXiv:2506.09981  [pdf, ps, other

    cs.CV cs.RO

    ReSim: Reliable World Simulation for Autonomous Driving

    Authors: Jiazhi Yang, Kashyap Chitta, Shenyuan Gao, Long Chen, Yuqian Shao, Xiaosong Jia, Hongyang Li, Andreas Geiger, Xiangyu Yue, Li Chen

    Abstract: How can we reliably simulate future driving scenarios under a wide range of ego driving behaviors? Recent driving world models, developed exclusively on real-world driving data composed mainly of safe expert trajectories, struggle to follow hazardous or non-expert behaviors, which are rare in such data. This limitation restricts their applicability to tasks such as policy evaluation. In this work,… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: Project page: https://opendrivelab.com/ReSim

  17. arXiv:2506.09518  [pdf, other

    cs.CV

    HAIF-GS: Hierarchical and Induced Flow-Guided Gaussian Splatting for Dynamic Scene

    Authors: Jianing Chen, Zehao Li, Yujun Cai, Hao Jiang, Chengxuan Qian, Juyuan Kang, Shuqin Gao, Honglong Zhao, Tianlu Mao, Yucheng Zhang

    Abstract: Reconstructing dynamic 3D scenes from monocular videos remains a fundamental challenge in 3D vision. While 3D Gaussian Splatting (3DGS) achieves real-time rendering in static settings, extending it to dynamic scenes is challenging due to the difficulty of learning structured and temporally consistent motion representations. This challenge often manifests as three limitations in existing methods: r… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  18. arXiv:2506.09344  [pdf, ps, other

    cs.AI cs.CL cs.CV cs.LG cs.SD eess.AS

    Ming-Omni: A Unified Multimodal Model for Perception and Generation

    Authors: Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jun Peng, Kaixiang Ji, Kaiyou Song, Kaimeng Ren, Libin Wang, Lixiang Ru, Lele Xie, Longhua Tan , et al. (33 additional authors not shown)

    Abstract: We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

    Comments: 18 pages,8 figures

  19. arXiv:2506.08140  [pdf, ps, other

    cs.LG cs.CL

    AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists

    Authors: Yifei Li, Hanane Nour Moussa, Ziru Chen, Shijie Chen, Botao Yu, Mingyi Xue, Benjamin Burns, Tzu-Yao Chiu, Vishal Dey, Zitong Lu, Chen Wei, Qianheng Zhang, Tianyu Zhang, Song Gao, Xuhui Huang, Xia Ning, Nesreen K. Ahmed, Ali Payani, Huan Sun

    Abstract: Despite long-standing efforts in accelerating scientific discovery with AI, building AI co-scientists remains challenging due to limited high-quality data for training and evaluation. To tackle this data scarcity issue, we present AutoSDT, an automatic pipeline that collects high-quality coding tasks in real-world data-driven discovery workflows. AutoSDT leverages the coding capabilities and param… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  20. arXiv:2506.07751  [pdf, ps, other

    cs.CL cs.AI cs.SC

    AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking

    Authors: Silin Gao, Antoine Bosselut, Samy Bengio, Emmanuel Abbe

    Abstract: Recent studies have shown that large language models (LLMs), especially smaller ones, often lack robustness in their reasoning. I.e., they tend to experience performance drops when faced with distribution shifts, such as changes to numerical or nominal variables, or insertions of distracting clauses. A possible strategy to address this involves generating synthetic data to further "instantiate" re… ▽ More

    Submitted 10 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

    Comments: Under review

  21. arXiv:2506.07502  [pdf, other

    cs.CL

    DEBATE: A Dataset for Disentangling Textual Ambiguity in Mandarin Through Speech

    Authors: Haotian Guo, Jing Han, Yongfeng Tu, Shihao Gao, Shengfan Shen, Wulong Xiang, Weihao Gan, Zixing Zhang

    Abstract: Despite extensive research on textual and visual disambiguation, disambiguation through speech (DTS) remains underexplored. This is largely due to the lack of high-quality datasets that pair spoken sentences with richly ambiguous text. To address this gap, we present DEBATE, a unique public Chinese speech-text dataset designed to study how speech cues and patterns-pronunciation, pause, stress and… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  22. arXiv:2506.05815  [pdf, ps, other

    cs.CV

    NTIRE 2025 Challenge on HR Depth from Images of Specular and Transparent Surfaces

    Authors: Pierluigi Zama Ramirez, Fabio Tosi, Luigi Di Stefano, Radu Timofte, Alex Costanzino, Matteo Poggi, Samuele Salti, Stefano Mattoccia, Zhe Zhang, Yang Yang, Wu Chen, Anlong Ming, Mingshuai Zhao, Mengying Yu, Shida Gao, Xiangfeng Wang, Feng Xue, Jun Shi, Yong Yang, Yong A, Yixiang Jin, Dingzhe Li, Aryan Shukla, Liam Frija-Altarac, Matthew Toews , et al. (14 additional authors not shown)

    Abstract: This paper reports on the NTIRE 2025 challenge on HR Depth From images of Specular and Transparent surfaces, held in conjunction with the New Trends in Image Restoration and Enhancement (NTIRE) workshop at CVPR 2025. This challenge aims to advance the research on depth estimation, specifically to address two of the main open issues in the field: high-resolution and non-Lambertian surfaces. The cha… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

    Comments: NTIRE Workshop Challenge Report, CVPR 2025

  23. arXiv:2506.05615  [pdf, ps, other

    cs.LG cs.AI

    When Maximum Entropy Misleads Policy Optimization

    Authors: Ruipeng Zhang, Ya-Chien Chang, Sicun Gao

    Abstract: The Maximum Entropy Reinforcement Learning (MaxEnt RL) framework is a leading approach for achieving efficient learning and robust performance across many RL tasks. However, MaxEnt methods have also been shown to struggle with performance-critical control problems in practice, where non-MaxEnt algorithms can successfully learn. In this work, we analyze how the trade-off between robustness and opti… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Journal ref: ICML 2025

  24. arXiv:2506.01375  [pdf, ps, other

    cs.IR

    Generative Next POI Recommendation with Semantic ID

    Authors: Dongsheng Wang, Yuxi Huang, Shen Gao, Yifan Wang, Chengrui Huang, Shuo Shang

    Abstract: Point-of-interest (POI) recommendation systems aim to predict the next destinations of user based on their preferences and historical check-ins. Existing generative POI recommendation methods usually employ random numeric IDs for POIs, limiting the ability to model semantic relationships between similar locations. In this paper, we propose Generative Next POI Recommendation with Semantic ID (GNPR-… ▽ More

    Submitted 18 June, 2025; v1 submitted 2 June, 2025; originally announced June 2025.

    Comments: 11 pages, 4 figures, the paper has been accepted by KDD 2025

  25. arXiv:2505.23363  [pdf, ps, other

    cs.CL

    Discriminative Policy Optimization for Token-Level Reward Models

    Authors: Hongzhan Chen, Tao Yang, Shiping Gao, Ruijun Chen, Xiaojun Quan, Hongtao Tian, Ting Yao

    Abstract: Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs) for optimizing policy models, positioning them as a promising approach to enhancing the capabilities of LLMs in complex reasoning tasks. Recent efforts have advanced PRMs from step-level to token-level granularity by integrating reward modeling into the training of generative models, with reward… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: ICML 2025

  26. arXiv:2505.22400  [pdf, other

    cs.GR cs.CV

    STDR: Spatio-Temporal Decoupling for Real-Time Dynamic Scene Rendering

    Authors: Zehao Li, Hao Jiang, Yujun Cai, Jianing Chen, Baolong Bi, Shuqin Gao, Honglong Zhao, Yiwei Wang, Tianlu Mao, Zhaoqi Wang

    Abstract: Although dynamic scene reconstruction has long been a fundamental challenge in 3D vision, the recent emergence of 3D Gaussian Splatting (3DGS) offers a promising direction by enabling high-quality, real-time rendering through explicit Gaussian primitives. However, existing 3DGS-based methods for dynamic reconstruction often suffer from \textit{spatio-temporal incoherence} during initialization, wh… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  27. arXiv:2505.21817  [pdf, ps, other

    cs.CV

    ALTER: All-in-One Layer Pruning and Temporal Expert Routing for Efficient Diffusion Generation

    Authors: Xiaomeng Yang, Lei Lu, Qihui Fan, Changdi Yang, Juyi Lin, Yanzhi Wang, Xuan Zhang, Shangqian Gao

    Abstract: Diffusion models have demonstrated exceptional capabilities in generating high-fidelity images. However, their iterative denoising process results in significant computational overhead during inference, limiting their practical deployment in resource-constrained environments. Existing acceleration methods often adopt uniform strategies that fail to capture the temporal variations during diffusion… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  28. arXiv:2505.21522  [pdf, ps, other

    cs.CV cs.AI cs.LG eess.IV

    CIM-NET: A Video Denoising Deep Neural Network Model Optimized for Computing-in-Memory Architectures

    Authors: Shan Gao, Zhiqiang Wu, Yawen Niu, Xiaotao Li, Qingqing Xu

    Abstract: While deep neural network (DNN)-based video denoising has demonstrated significant performance, deploying state-of-the-art models on edge devices remains challenging due to stringent real-time and energy efficiency requirements. Computing-in-Memory (CIM) chips offer a promising solution by integrating computation within memory cells, enabling rapid matrix-vector multiplication (MVM). However, exis… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

  29. arXiv:2505.21494  [pdf, ps, other

    cs.CV

    Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment

    Authors: Xiaojun Jia, Sensen Gao, Simeng Qin, Tianyu Pang, Chao Du, Yihao Huang, Xinfeng Li, Yiming Li, Bo Li, Yang Liu

    Abstract: Multimodal large language models (MLLMs) remain vulnerable to transferable adversarial examples. While existing methods typically achieve targeted attacks by aligning global features-such as CLIP's [CLS] token-between adversarial and target samples, they often overlook the rich local information encoded in patch tokens. This leads to suboptimal alignment and limited transferability, particularly f… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  30. arXiv:2505.21106  [pdf, other

    cs.AI

    Interpreting Social Bias in LVLMs via Information Flow Analysis and Multi-Round Dialogue Evaluation

    Authors: Zhengyang Ji, Yifan Jia, Shang Gao, Yutao Yue

    Abstract: Large Vision Language Models (LVLMs) have achieved remarkable progress in multimodal tasks, yet they also exhibit notable social biases. These biases often manifest as unintended associations between neutral concepts and sensitive human attributes, leading to disparate model behaviors across demographic groups. While existing studies primarily focus on detecting and quantifying such biases, they o… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  31. arXiv:2505.20903  [pdf, other

    cs.CL

    Towards Objective Fine-tuning: How LLMs' Prior Knowledge Causes Potential Poor Calibration?

    Authors: Ziming Wang, Zeyu Shi, Haoyi Zhou, Shiqi Gao, Qingyun Sun, Jianxin Li

    Abstract: Fine-tuned Large Language Models (LLMs) often demonstrate poor calibration, with their confidence scores misaligned with actual performance. While calibration has been extensively studied in models trained from scratch, the impact of LLMs' prior knowledge on calibration during fine-tuning remains understudied. Our research reveals that LLMs' prior knowledge causes potential poor calibration due to… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: Accepted to ACL2025 Main; The code will be released soon

  32. arXiv:2505.20016  [pdf, ps, other

    cs.CL

    TTPA: Token-level Tool-use Preference Alignment Training Framework with Fine-grained Evaluation

    Authors: Chengrui Huang, Shen Gao, Zhengliang Shi, Dongsheng Wang, Shuo Shang

    Abstract: Existing tool-learning methods usually rely on supervised fine-tuning, they often overlook fine-grained optimization of internal tool call details, leading to limitations in preference alignment and error discrimination. To overcome these challenges, we propose Token-level Tool-use Preference Alignment Training Framework (TTPA), a training paradigm for constructing token-level tool-use preference… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: 16 pages, 5 figures

  33. arXiv:2505.19484  [pdf, ps, other

    cs.CL

    CulFiT: A Fine-grained Cultural-aware LLM Training Paradigm via Multilingual Critique Data Synthesis

    Authors: Ruixiang Feng, Shen Gao, Xiuying Chen, Lisi Chen, Shuo Shang

    Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they often exhibit a specific cultural biases, neglecting the values and linguistic diversity of low-resource regions. This cultural bias not only undermines universal equality, but also risks reinforcing stereotypes and perpetuating discrimination. To address this, we propose CulFiT, a novel culturall… ▽ More

    Submitted 27 May, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

    Comments: accepted by ACL 2025

  34. arXiv:2505.19334  [pdf, other

    cs.LG cs.IR

    Likert or Not: LLM Absolute Relevance Judgments on Fine-Grained Ordinal Scales

    Authors: Charles Godfrey, Ping Nie, Natalia Ostapuk, David Ken, Shang Gao, Souheil Inati

    Abstract: Large language models (LLMs) obtain state of the art zero shot relevance ranking performance on a variety of information retrieval tasks. The two most common prompts to elicit LLM relevance judgments are pointwise scoring (a.k.a. relevance generation), where the LLM sees a single query-document pair and outputs a single relevance score, and listwise ranking (a.k.a. permutation generation), where t… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

    ACM Class: H.3.3; I.2.7; H.3.1

  35. arXiv:2505.19247  [pdf, ps, other

    cs.LG cs.AI cs.RO

    Improving Value Estimation Critically Enhances Vanilla Policy Gradient

    Authors: Tao Wang, Ruipeng Zhang, Sicun Gao

    Abstract: Modern policy gradient algorithms, such as TRPO and PPO, outperform vanilla policy gradient in many RL tasks. Questioning the common belief that enforcing approximate trust regions leads to steady policy improvement in practice, we show that the more critical factor is the enhanced value estimation accuracy from more value update steps in each iteration. To demonstrate, we show that by simply incr… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

    Comments: 15 pages and 21 figures

    ACM Class: I.2.6

  36. Foundation Models for Geospatial Reasoning: Assessing Capabilities of Large Language Models in Understanding Geometries and Topological Spatial Relations

    Authors: Yuhan Ji, Song Gao, Ying Nie, Ivan Majić, Krzysztof Janowicz

    Abstract: Applying AI foundation models directly to geospatial datasets remains challenging due to their limited ability to represent and reason with geographical entities, specifically vector-based geometries and natural language descriptions of complex spatial relations. To address these issues, we investigate the extent to which a well-known-text (WKT) representation of geometries and their spatial relat… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: 33 pages, 13 figures, IJGIS GeoFM Special Issue

    ACM Class: I.2

    Journal ref: International Journal of Geographical Information Science, 2025 International Journal of Geographical Information Science International Journal of Geographical Information Science

  37. arXiv:2505.17100  [pdf, ps, other

    cs.CL

    Any Large Language Model Can Be a Reliable Judge: Debiasing with a Reasoning-based Bias Detector

    Authors: Haoyan Yang, Runxue Bao, Cao Xiao, Jun Ma, Parminder Bhatia, Shangqian Gao, Taha Kass-Hout

    Abstract: LLM-as-a-Judge has emerged as a promising tool for automatically evaluating generated outputs, but its reliability is often undermined by potential biases in judgment. Existing efforts to mitigate these biases face key limitations: in-context learning-based methods fail to address rooted biases due to the evaluator's limited capacity for self-reflection, whereas fine-tuning is not applicable to al… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

  38. arXiv:2505.15179  [pdf, ps, other

    cs.SE

    RAG or Fine-tuning? A Comparative Study on LCMs-based Code Completion in Industry

    Authors: Chaozheng Wang, Zezhou Yang, Shuzheng Gao, Cuiyun Gao, Ting Peng, Hailiang Huang, Yuetang Deng, Michael Lyu

    Abstract: Code completion, a crucial practice in industrial settings, helps developers improve programming efficiency by automatically suggesting code snippets during development. With the emergence of Large Code Models (LCMs), this field has witnessed significant advancements. Due to the natural differences between open-source and industrial codebases, such as coding patterns and unique internal dependenci… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: Accepted in FSE 25 Industry Track

  39. arXiv:2505.14049  [pdf, ps, other

    cs.CV

    Learning Concept-Driven Logical Rules for Interpretable and Generalizable Medical Image Classification

    Authors: Yibo Gao, Hangqi Zhou, Zheyao Gao, Bomin Wang, Shangqi Gao, Sihan Wang, Xiahai Zhuang

    Abstract: The pursuit of decision safety in clinical applications highlights the potential of concept-based methods in medical imaging. While these models offer active interpretability, they often suffer from concept leakages, where unintended information within soft concept representations undermines both interpretability and generalizability. Moreover, most concept-based models focus solely on local expla… ▽ More

    Submitted 9 June, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

    Comments: early accepted by MICCAI 2025

  40. arXiv:2505.13740  [pdf, ps, other

    cs.LG cs.AI cs.CV

    Improving Compositional Generation with Diffusion Models Using Lift Scores

    Authors: Chenning Yu, Sicun Gao

    Abstract: We introduce a novel resampling criterion using lift scores, for improving compositional generation in diffusion models. By leveraging the lift scores, we evaluate whether generated samples align with each single condition and then compose the results to determine whether the composed prompt is satisfied. Our key insight is that lift scores can be efficiently approximated using only the original d… ▽ More

    Submitted 25 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

    Journal ref: ICML 2025

  41. arXiv:2505.13731  [pdf, ps, other

    cs.CV

    GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization

    Authors: Pengyue Jia, Seongheon Park, Song Gao, Xiangyu Zhao, Yixuan Li

    Abstract: Worldwide image geolocalization-the task of predicting GPS coordinates from images taken anywhere on Earth-poses a fundamental challenge due to the vast diversity in visual content across regions. While recent approaches adopt a two-stage pipeline of retrieving candidates and selecting the best match, they typically rely on simplistic similarity heuristics and point-wise supervision, failing to mo… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  42. arXiv:2505.13440  [pdf, ps, other

    cs.CV

    Recollection from Pensieve: Novel View Synthesis via Learning from Uncalibrated Videos

    Authors: Ruoyu Wang, Yi Ma, Shenghua Gao

    Abstract: Currently almost all state-of-the-art novel view synthesis and reconstruction models rely on calibrated cameras or additional geometric priors for training. These prerequisites significantly limit their applicability to massive uncalibrated data. To alleviate this requirement and unlock the potential for self-supervised training on large-scale uncalibrated videos, we propose a novel two-stage stra… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: 13 pages, 4 figures

  43. arXiv:2505.12759  [pdf, ps, other

    cs.LG

    Your Offline Policy is Not Trustworthy: Bilevel Reinforcement Learning for Sequential Portfolio Optimization

    Authors: Haochen Yuan, Minting Pan, Yunbo Wang, Siyu Gao, Philip S. Yu, Xiaokang Yang

    Abstract: Reinforcement learning (RL) has shown significant promise for sequential portfolio optimization tasks, such as stock trading, where the objective is to maximize cumulative returns while minimizing risks using historical data. However, traditional RL approaches often produce policies that merely memorize the optimal yet impractical buying and selling behaviors within the fixed dataset. These offlin… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  44. arXiv:2505.12657  [pdf, ps, other

    eess.SY cs.SI math.OC

    Transmission Neural Networks: Approximation and Optimal Control

    Authors: Shuang Gao, Peter E. Caines

    Abstract: Transmission Neural Networks (TransNNs) introduced by Gao and Caines (2022) connect virus spread models over networks and neural networks with tuneable activation functions. This paper presents the approximation technique and the underlying assumptions employed by TransNNs in relation to the corresponding Markovian Susceptible-Infected-Susceptible (SIS) model with 2^n states, where n is the number… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

    Journal ref: IFAC Conference on Networked Systems, 2025

  45. arXiv:2505.11908  [pdf, other

    cs.CL

    ELITE: Embedding-Less retrieval with Iterative Text Exploration

    Authors: Zhangyu Wang, Siyuan Gao, Rong Zhou, Hao Wang, Li Ning

    Abstract: Large Language Models (LLMs) have achieved impressive progress in natural language processing, but their limited ability to retain long-term context constrains performance on document-level or multi-turn tasks. Retrieval-Augmented Generation (RAG) mitigates this by retrieving relevant information from an external corpus. However, existing RAG systems often rely on embedding-based retrieval trained… ▽ More

    Submitted 17 May, 2025; originally announced May 2025.

  46. arXiv:2505.06111  [pdf, ps, other

    cs.RO cs.AI cs.LG

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Authors: Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, Hongyang Li

    Abstract: A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a… ▽ More

    Submitted 15 May, 2025; v1 submitted 9 May, 2025; originally announced May 2025.

    Comments: Accepted to RSS 2025. Code is available at https://github.com/OpenDriveLab/UniVLA

  47. arXiv:2505.05464  [pdf, other

    cs.CL

    Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

    Authors: Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, Junxian He

    Abstract: Vision-Language Models (VLMs) combine visual perception with the general capabilities, such as reasoning, of Large Language Models (LLMs). However, the mechanisms by which these two abilities can be combined and contribute remain poorly understood. In this work, we explore to compose perception and reasoning through model merging that connects parameters of different models. Unlike previous works… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: ICML 2025. Our code is publicly available at https://github.com/shiqichen17/VLM_Merging

  48. arXiv:2505.02329  [pdf, ps, other

    cs.CY cs.HC cs.SE

    Regulating Algorithmic Management: A Multi-Stakeholder Study of Challenges in Aligning Software and the Law for Workplace Scheduling

    Authors: Jonathan Lynn, Rachel Y. Kim, Sicun Gao, Daniel Schneider, Sachin S. Pandya, Min Kyung Lee

    Abstract: Algorithmic management (AM)'s impact on worker well-being has led to calls for regulation. However, little is known about the effectiveness and challenges in real-world AM regulation across the regulatory process -- rule operationalization, software use, and enforcement. Our multi-stakeholder study addresses this gap within workplace scheduling, one of the few AM domains with implemented regulatio… ▽ More

    Submitted 1 July, 2025; v1 submitted 4 May, 2025; originally announced May 2025.

    Comments: FAccT'25

  49. arXiv:2505.00592  [pdf, other

    cs.CV cs.LG

    Uncertainty-Aware Multi-Expert Knowledge Distillation for Imbalanced Disease Grading

    Authors: Shuo Tong, Shangde Gao, Ke Liu, Zihang Huang, Hongxia Xu, Haochao Ying, Jian Wu

    Abstract: Automatic disease image grading is a significant application of artificial intelligence for healthcare, enabling faster and more accurate patient assessments. However, domain shifts, which are exacerbated by data imbalance, introduce bias into the model, posing deployment difficulties in clinical applications. To address the problem, we propose a novel \textbf{U}ncertainty-aware \textbf{M}ulti-exp… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

  50. arXiv:2504.21209  [pdf

    eess.SP cs.LG

    Generalised Label-free Artefact Cleaning for Real-time Medical Pulsatile Time Series

    Authors: Xuhang Chen, Ihsane Olakorede, Stefan Yu Bögli, Wenhao Xu, Erta Beqiri, Xuemeng Li, Chenyu Tang, Zeyu Gao, Shuo Gao, Ari Ercole, Peter Smielewski

    Abstract: Artefacts compromise clinical decision-making in the use of medical time series. Pulsatile waveforms offer probabilities for accurate artefact detection, yet most approaches rely on supervised manners and overlook patient-level distribution shifts. To address these issues, we introduce a generalised label-free framework, GenClean, for real-time artefact cleaning and leverage an in-house dataset of… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.