Skip to main content

Showing 1–50 of 624 results for author: Zheng, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.05275  [pdf, ps, other

    cs.CY cs.AI cs.HC cs.LO cs.MA

    A Fuzzy Supervisor Agent Design for Clinical Reasoning Assistance in a Multi-Agent Educational Clinical Scenario Simulation

    Authors: Weibing Zheng, Laurah Turner, Jess Kropczynski, Murat Ozer, Seth Overla, Shane Halse

    Abstract: Assisting medical students with clinical reasoning (CR) during clinical scenario training remains a persistent challenge in medical education. This paper presents the design and architecture of the Fuzzy Supervisor Agent (FSA), a novel component for the Multi-Agent Educational Clinical Scenario Simulation (MAECSS) platform. The FSA leverages a Fuzzy Inference System (FIS) to continuously interpret… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 6 pages, 3 figures, 1 table. 2025 IFSA World Congress NAFIPS Annual Meeting

    ACM Class: D.2.4; K.3.1; C.3; I.2.6

  2. arXiv:2507.04692  [pdf, ps, other

    cs.CV

    Structure-Guided Diffusion Models for High-Fidelity Portrait Shadow Removal

    Authors: Wanchang Yu, Qing Zhang, Rongjia Zheng, Wei-Shi Zheng

    Abstract: We present a diffusion-based portrait shadow removal approach that can robustly produce high-fidelity results. Unlike previous methods, we cast shadow removal as diffusion-based inpainting. To this end, we first train a shadow-independent structure extraction network on a real-world portrait dataset with various synthetic lighting conditions, which allows to generate a shadow-independent structure… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  3. arXiv:2507.04511  [pdf, ps, other

    cs.CV

    FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection

    Authors: Xinhua Lu, Runhe Lai, Yanqi Wu, Kanghao Chen, Wei-Shi Zheng, Ruixuan Wang

    Abstract: Pre-trained vision-language models (VLMs) have advanced out-of-distribution (OOD) detection recently. However, existing CLIP-based methods often focus on learning OOD-related knowledge to improve OOD detection, showing limited generalization or reliance on external large-scale auxiliary datasets. In this study, instead of delving into the intricate OOD-related knowledge, we propose an innovative C… ▽ More

    Submitted 8 July, 2025; v1 submitted 6 July, 2025; originally announced July 2025.

    Comments: 12 pages, 4 figures, Accepted by ICCV2025

  4. arXiv:2507.04243  [pdf, ps, other

    cs.CV cs.AI

    Domain Generalizable Portrait Style Transfer

    Authors: Xinbo Wang, Wenju Xu, Qing Zhang, Wei-Shi Zheng

    Abstract: This paper presents a portrait style transfer method that generalizes well to various different domains while enabling high-quality semantic-aligned stylization on regions including hair, eyes, eyelashes, skins, lips, and background. To this end, we propose to establish dense semantic correspondence between the given input and reference portraits based on a pre-trained model and a semantic adapter… ▽ More

    Submitted 7 July, 2025; v1 submitted 6 July, 2025; originally announced July 2025.

    Comments: Accepted to ICCV2025

  5. arXiv:2507.04060  [pdf, ps, other

    cs.CV cs.AI

    Temporal Continual Learning with Prior Compensation for Human Motion Prediction

    Authors: Jianwei Tang, Jiangxin Sun, Xiaotong Lin, Lifang Zhang, Wei-Shi Zheng, Jian-Fang Hu

    Abstract: Human Motion Prediction (HMP) aims to predict future poses at different moments according to past motion sequences. Previous approaches have treated the prediction of various moments equally, resulting in two main limitations: the learning of short-term predictions is hindered by the focus on long-term predictions, and the incorporation of prior information from past predictions into subsequent pr… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

    Comments: Advances in Neural Information Processing Systems 2023

    Journal ref: Advances in Neural Information Processing Systems, 2023, 36: 65837-65849

  6. arXiv:2507.03924  [pdf, ps, other

    cs.CV

    DNF-Intrinsic: Deterministic Noise-Free Diffusion for Indoor Inverse Rendering

    Authors: Rongjia Zheng, Qing Zhang, Chengjiang Long, Wei-Shi Zheng

    Abstract: Recent methods have shown that pre-trained diffusion models can be fine-tuned to enable generative inverse rendering by learning image-conditioned noise-to-intrinsic mapping. Despite their remarkable progress, they struggle to robustly produce high-quality results as the noise-to-intrinsic paradigm essentially utilizes noisy images with deteriorated structure and appearance for intrinsic predictio… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

  7. arXiv:2507.02863  [pdf, ps, other

    cs.CV cs.AI cs.LG

    Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory

    Authors: Yuqi Wu, Wenzhao Zheng, Jie Zhou, Jiwen Lu

    Abstract: Dense 3D scene reconstruction from an ordered sequence or unordered image collections is a critical step when bringing research in computer vision into practical scenarios. Following the paradigm introduced by DUSt3R, which unifies an image pair densely into a shared coordinate system, subsequent methods maintain an implicit memory to achieve dense 3D reconstruction from more images. However, such… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Code is available at: https://github.com/YkiWu/Point3R

  8. arXiv:2507.01857  [pdf, ps, other

    cs.RO

    TypeTele: Releasing Dexterity in Teleoperation by Dexterous Manipulation Types

    Authors: Yuhao Lin, Yi-Lin Wei, Haoran Liao, Mu Lin, Chengyi Xing, Hao Li, Dandan Zhang, Mark Cutkosky, Wei-Shi Zheng

    Abstract: Dexterous teleoperation plays a crucial role in robotic manipulation for real-world data collection and remote robot control. Previous dexterous teleoperation mostly relies on hand retargeting to closely mimic human hand postures. However, these approaches may fail to fully leverage the inherent dexterity of dexterous hands, which can execute unique actions through their structural advantages comp… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Project Page: https://isee-laboratory.github.io/TypeTele

  9. arXiv:2506.23074  [pdf, ps, other

    cs.CV cs.CR cs.LG

    Learning Counterfactually Decoupled Attention for Open-World Model Attribution

    Authors: Yu Zheng, Boyang Gong, Fanye Kong, Yueqi Duan, Bingyao Yu, Wenzhao Zheng, Lei Chen, Jiwen Lu, Jie Zhou

    Abstract: In this paper, we propose a Counterfactually Decoupled Attention Learning (CDAL) method for open-world model attribution. Existing methods rely on handcrafted design of region partitioning or feature space, which could be confounded by the spurious statistical correlations and struggle with novel attacks in open-world scenarios. To address this, CDAL explicitly models the causal relationships betw… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

    Comments: Accepted by ICCV 2025. Code: \url{https://github.com/yzheng97/CDAL}

  10. arXiv:2506.21714  [pdf, ps, other

    cs.LG cs.CV

    ODE$_t$(ODE$_l$): Shortcutting the Time and Length in Diffusion and Flow Models for Faster Sampling

    Authors: Denis Gudovskiy, Wenzhao Zheng, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer

    Abstract: Recently, continuous normalizing flows (CNFs) and diffusion models (DMs) have been studied using the unified theoretical framework. Although such models can generate high-quality data points from a noise distribution, the sampling demands multiple iterations to solve an ordinary differential equation (ODE) with high computational complexity. Most existing methods focus on reducing the number of ti… ▽ More

    Submitted 2 July, 2025; v1 submitted 26 June, 2025; originally announced June 2025.

    Comments: Preprint. Github page: github.com/gudovskiy/odelt

  11. arXiv:2506.17682  [pdf, ps, other

    cs.IR cs.AI

    Reinforcing User Interest Evolution in Multi-Scenario Learning for recommender systems

    Authors: Zhijian Feng, Wenhao Zheng, Xuanji Xiao

    Abstract: In real-world recommendation systems, users would engage in variety scenarios, such as homepages, search pages, and related recommendation pages. Each of these scenarios would reflect different aspects users focus on. However, the user interests may be inconsistent in different scenarios, due to differences in decision-making processes and preference expression. This variability complicates unifie… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

    MSC Class: 68T07 ACM Class: H.3.3

  12. arXiv:2506.13320  [pdf, ps, other

    cs.CV cs.LG

    Action Dubber: Timing Audible Actions via Inflectional Flow

    Authors: Wenlong Wan, Weiying Zheng, Tianyi Xiang, Guiqing Li, Shengfeng He

    Abstract: We introduce the task of Audible Action Temporal Localization, which aims to identify the spatio-temporal coordinates of audible movements. Unlike conventional tasks such as action recognition and temporal action localization, which broadly analyze video content, our task focuses on the distinct kinematic dynamics of audible actions. It is based on the premise that key actions are driven by inflec… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: Accepted by ICML2025

  13. arXiv:2506.12119  [pdf, ps, other

    cs.CL cs.AI

    Can Mixture-of-Experts Surpass Dense LLMs Under Strictly Equal Resources?

    Authors: Houyi Li, Ka Man Lo, Ziqi Wang, Zili Wang, Wenzhen Zheng, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang

    Abstract: Mixture-of-Experts (MoE) language models dramatically expand model capacity and achieve remarkable performance without increasing per-token compute. However, can MoEs surpass dense architectures under strictly equal resource constraints - that is, when the total parameter count, training compute, and data budget are identical? This question remains under-explored despite its significant practical… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

  14. arXiv:2506.11221  [pdf, ps, other

    cs.AI cs.CL cs.LO

    LLM-as-a-Fuzzy-Judge: Fine-Tuning Large Language Models as a Clinical Evaluation Judge with Fuzzy Logic

    Authors: Weibing Zheng, Laurah Turner, Jess Kropczynski, Murat Ozer, Tri Nguyen, Shane Halse

    Abstract: Clinical communication skills are critical in medical education, and practicing and assessing clinical communication skills on a scale is challenging. Although LLM-powered clinical scenario simulations have shown promise in enhancing medical students' clinical practice, providing automated and scalable clinical evaluation that follows nuanced physician judgment is difficult. This paper combines fu… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: 12 pages, 1 figure, 2025 IFSA World Congress NAFIPS Annual Meeting

    ACM Class: D.2.4; K.3.1; C.3; I.2.6

  15. arXiv:2506.10981  [pdf, ps, other

    cs.CV

    SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis

    Authors: Weiliang Chen, Jiayi Bi, Yuanhui Huang, Wenzhao Zheng, Yueqi Duan

    Abstract: Generative models have gained significant attention in novel view synthesis (NVS) by alleviating the reliance on dense multi-view captures. However, existing methods typically fall into a conventional paradigm, where generative models first complete missing areas in 2D, followed by 3D recovery techniques to reconstruct the scene, which often results in overly smooth surfaces and distorted geometry… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  16. arXiv:2506.10977  [pdf, ps, other

    cs.CV

    QuadricFormer: Scene as Superquadrics for 3D Semantic Occupancy Prediction

    Authors: Sicheng Zuo, Wenzhao Zheng, Xiaoyong Han, Longchao Yang, Yong Pan, Jiwen Lu

    Abstract: 3D occupancy prediction is crucial for robust autonomous driving systems as it enables comprehensive perception of environmental structures and semantics. Most existing methods employ dense voxel-based scene representations, ignoring the sparsity of driving scenes and resulting in inefficiency. Recent works explore object-centric representations based on sparse Gaussians, but their ellipsoidal sha… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: Project page: https://zuosc19.github.io/QuadricFormer/

  17. arXiv:2506.10975  [pdf, ps, other

    cs.CV

    GenWorld: Towards Detecting AI-generated Real-world Simulation Videos

    Authors: Weiliang Chen, Wenzhao Zheng, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu, Yueqi Duan

    Abstract: The flourishing of video generation technologies has endangered the credibility of real-world information and intensified the demand for AI-generated video detectors. Despite some progress, the lack of high-quality real-world datasets hinders the development of trustworthy detectors. In this paper, we propose GenWorld, a large-scale, high-quality, and real-world simulation dataset for AI-generated… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  18. arXiv:2506.10972  [pdf, ps, other

    cs.LG cs.AI

    Farseer: A Refined Scaling Law in Large Language Models

    Authors: Houyi Li, Wenzhen Zheng, Qiufeng Wang, Zhenyu Ding, Haoying Wang, Zili Wang, Shijie Xuyang, Ning Ding, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang

    Abstract: Training Large Language Models (LLMs) is prohibitively expensive, creating a critical scaling gap where insights from small-scale experiments often fail to transfer to resource-intensive production systems, thereby hindering efficient innovation. To bridge this, we introduce Farseer, a novel and refined scaling law offering enhanced predictive accuracy across scales. By systematically constructing… ▽ More

    Submitted 14 June, 2025; v1 submitted 12 June, 2025; originally announced June 2025.

    Comments: 34

    ACM Class: I.2

  19. arXiv:2506.10962  [pdf, ps, other

    cs.CV cs.AI cs.LG

    SpectralAR: Spectral Autoregressive Visual Generation

    Authors: Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Yueqi Duan, Jie Zhou, Jiwen Lu

    Abstract: Autoregressive visual generation has garnered increasing attention due to its scalability and compatibility with other modalities compared with diffusion models. Most existing methods construct visual sequences as spatial patches for autoregressive generation. However, image patches are inherently parallel, contradicting the causal nature of autoregressive modeling. To address this, we propose a S… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: Project Page: https://huang-yh.github.io/spectralar/

  20. arXiv:2506.09638  [pdf, ps, other

    cs.LG cs.CV

    FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models

    Authors: Weiying Zheng, Ziyue Lin, Pengxin Guo, Yuyin Zhou, Feifei Wang, Liangqiong Qu

    Abstract: Vision-Language Models (VLMs) have demonstrated remarkable capabilities in cross-modal understanding and generation by integrating visual and textual information. While instruction tuning and parameter-efficient fine-tuning methods have substantially improved the generalization of VLMs, most existing approaches rely on centralized training, posing challenges for deployment in domains with strict p… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  21. arXiv:2506.09496  [pdf, other

    cs.LG cs.AI

    EnerBridge-DPO: Energy-Guided Protein Inverse Folding with Markov Bridges and Direct Preference Optimization

    Authors: Dingyi Rong, Haotian Lu, Wenzhuo Zheng, Fan Zhang, Shuangjia Zheng, Ning Liu

    Abstract: Designing protein sequences with optimal energetic stability is a key challenge in protein inverse folding, as current deep learning methods are primarily trained by maximizing sequence recovery rates, often neglecting the energy of the generated sequences. This work aims to overcome this limitation by developing a model that directly generates low-energy, stable protein sequences. We propose Ener… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  22. arXiv:2506.07826  [pdf, other

    cs.CV cs.LG cs.RO

    R3D2: Realistic 3D Asset Insertion via Diffusion for Autonomous Driving Simulation

    Authors: William Ljungbergh, Bernardo Taveira, Wenzhao Zheng, Adam Tonderski, Chensheng Peng, Fredrik Kahl, Christoffer Petersson, Michael Felsberg, Kurt Keutzer, Masayoshi Tomizuka, Wei Zhan

    Abstract: Validating autonomous driving (AD) systems requires diverse and safety-critical testing, making photorealistic virtual environments essential. Traditional simulation platforms, while controllable, are resource-intensive to scale and often suffer from a domain gap with real-world data. In contrast, neural reconstruction methods like 3D Gaussian Splatting (3DGS) offer a scalable solution for creatin… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  23. arXiv:2506.06982  [pdf, ps, other

    cs.CL

    Chain of Methodologies: Scaling Test Time Computation without Training

    Authors: Cong Liu, Jie Wu, Weigang Wu, Xu Chen, Liang Lin, Wei-Shi Zheng

    Abstract: Large Language Models (LLMs) often struggle with complex reasoning tasks due to insufficient in-depth insights in their training data, which are typically absent in publicly available documents. This paper introduces the Chain of Methodologies (CoM), an innovative and intuitive prompting framework that enhances structured thinking by integrating human methodological insights, enabling LLMs to tack… ▽ More

    Submitted 7 June, 2025; originally announced June 2025.

    Journal ref: ACL 2025

  24. arXiv:2506.05473  [pdf, ps, other

    cs.CV

    S2GO: Streaming Sparse Gaussian Occupancy Prediction

    Authors: Jinhyung Park, Yihan Hu, Chensheng Peng, Wenzhao Zheng, Kris Kitani, Wei Zhan

    Abstract: Despite the demonstrated efficiency and performance of sparse query-based representations for perception, state-of-the-art 3D occupancy prediction methods still rely on voxel-based or dense Gaussian-based 3D representations. However, dense representations are slow, and they lack flexibility in capturing the temporal dynamics of driving scenes. Distinct from prior work, we instead summarize the sce… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  25. arXiv:2506.05204  [pdf, other

    cs.CV

    OGGSplat: Open Gaussian Growing for Generalizable Reconstruction with Expanded Field-of-View

    Authors: Yanbo Wang, Ziyi Wang, Wenzhao Zheng, Jie Zhou, Jiwen Lu

    Abstract: Reconstructing semantic-aware 3D scenes from sparse views is a challenging yet essential research direction, driven by the demands of emerging applications such as virtual reality and embodied AI. Existing per-scene optimization methods require dense input views and incur high computational costs, while generalizable approaches often struggle to reconstruct regions outside the input view cone. In… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  26. arXiv:2506.04225  [pdf, ps, other

    cs.CV

    Voyager: Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation

    Authors: Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson W. H. Lau, Wangmeng Zuo, Chunchao Guo

    Abstract: Real-world applications like video gaming and virtual reality often demand the ability to model 3D scenes that users can explore along custom camera trajectories. While significant progress has been made in generating 3D objects from text or images, creating long-range, 3D-consistent, explorable 3D scenes remains a complex and challenging problem. In this work, we present Voyager, a novel video di… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  27. arXiv:2506.03464  [pdf, ps, other

    cs.GT cs.LG math.OC

    From Average-Iterate to Last-Iterate Convergence in Games: A Reduction and Its Applications

    Authors: Yang Cai, Haipeng Luo, Chen-Yu Wei, Weiqiang Zheng

    Abstract: The convergence of online learning algorithms in games under self-play is a fundamental question in game theory and machine learning. Among various notions of convergence, last-iterate convergence is particularly desirable, as it reflects the actual decisions made by the learners and captures the day-to-day behavior of the learning dynamics. While many algorithms are known to converge in the avera… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: 21 pages

  28. arXiv:2505.24718  [pdf, ps, other

    cs.CV

    Reinforcing Video Reasoning with Focused Thinking

    Authors: Jisheng Dang, Jingze Wu, Teng Wang, Xuanhui Lin, Nannan Zhu, Hongbo Chen, Wei-Shi Zheng, Meng Wang, Tat-Seng Chua

    Abstract: Recent advancements in reinforcement learning, particularly through Group Relative Policy Optimization (GRPO), have significantly improved multimodal large language models for complex reasoning tasks. However, two critical limitations persist: 1) they often produce unfocused, verbose reasoning chains that obscure salient spatiotemporal cues and 2) binary rewarding fails to account for partially co… ▽ More

    Submitted 8 June, 2025; v1 submitted 30 May, 2025; originally announced May 2025.

  29. arXiv:2505.22421  [pdf, ps, other

    cs.CV cs.RO

    GeoDrive: 3D Geometry-Informed Driving World Model with Precise Action Control

    Authors: Anthony Chen, Wenzhao Zheng, Yida Wang, Xueyang Zhang, Kun Zhan, Peng Jia, Kurt Keutzer, Shanghang Zhang

    Abstract: Recent advancements in world models have revolutionized dynamic environment simulation, allowing systems to foresee future states and assess potential actions. In autonomous driving, these capabilities help vehicles anticipate the behavior of other road users, perform risk-aware planning, accelerate training in simulation, and adapt to novel scenarios, thereby enhancing safety and reliability. Cur… ▽ More

    Submitted 29 May, 2025; v1 submitted 28 May, 2025; originally announced May 2025.

    Comments: code will be released at https://github.com/antonioo-c/GeoDrive

  30. arXiv:2505.22335  [pdf, other

    cs.RO cs.CV

    UP-SLAM: Adaptively Structured Gaussian SLAM with Uncertainty Prediction in Dynamic Environments

    Authors: Wancai Zheng, Linlin Ou, Jiajie He, Libo Zhou, Xinyi Yu, Yan Wei

    Abstract: Recent 3D Gaussian Splatting (3DGS) techniques for Visual Simultaneous Localization and Mapping (SLAM) have significantly progressed in tracking and high-fidelity mapping. However, their sequential optimization framework and sensitivity to dynamic objects limit real-time performance and robustness in real-world scenarios. We present UP-SLAM, a real-time RGB-D SLAM system for dynamic environments t… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  31. arXiv:2505.20678  [pdf, ps, other

    eess.AS cs.SD eess.SP

    PromptEVC: Controllable Emotional Voice Conversion with Natural Language Prompts

    Authors: Tianhua Qi, Shiyan Wang, Cheng Lu, Tengfei Song, Hao Yang, Zhanglin Wu, Wenming Zheng

    Abstract: Controllable emotional voice conversion (EVC) aims to manipulate emotional expressions to increase the diversity of synthesized speech. Existing methods typically rely on predefined labels, reference audios, or prespecified factor values, often overlooking individual differences in emotion perception and expression. In this paper, we introduce PromptEVC that utilizes natural language prompts for p… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: Accepted to INTERSPEECH2025

  32. arXiv:2505.20610  [pdf, ps, other

    cs.CV cs.RO

    OmniIndoor3D: Comprehensive Indoor 3D Reconstruction

    Authors: Xiaobao Wei, Xiaoan Zhang, Hao Wang, Qingpo Wuwu, Ming Lu, Wenzhao Zheng, Shanghang Zhang

    Abstract: We propose a novel framework for comprehensive indoor 3D reconstruction using Gaussian representations, called OmniIndoor3D. This framework enables accurate appearance, geometry, and panoptic reconstruction of diverse indoor scenes captured by a consumer-level RGB-D camera. Since 3DGS is primarily optimized for photorealistic rendering, it lacks the precise geometry critical for high-quality panop… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  33. arXiv:2505.19502  [pdf, ps, other

    cs.SE cs.AI

    CODE-DITING: A Reasoning-Based Metric for Functional Alignment in Code Evaluation

    Authors: Guang Yang, Yu Zhou, Xiang Chen, Wei Zheng, Xing Hu, Xin Zhou, David Lo, Taolue Chen

    Abstract: Trustworthy evaluation methods for code snippets play a crucial role in neural code generation. Traditional methods, which either rely on reference solutions or require executable test cases, have inherent limitation in flexibility and scalability. The recent LLM-as-Judge methodology offers a promising alternative by directly evaluating functional consistency between the problem description and th… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  34. arXiv:2505.18596  [pdf, ps, other

    cs.CL cs.AI

    Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models

    Authors: Chen Han, Wenzhen Zheng, Xijin Tang

    Abstract: The proliferation of misinformation in digital platforms reveals the limitations of traditional detection methods, which mostly rely on static classification and fail to capture the intricate process of real-world fact-checking. Despite advancements in Large Language Models (LLMs) that enhance automated reasoning, their application to misinformation detection remains hindered by issues of logical… ▽ More

    Submitted 27 May, 2025; v1 submitted 24 May, 2025; originally announced May 2025.

  35. arXiv:2505.18168  [pdf, ps, other

    cs.LG cs.GR

    Emotion Knowledge Enhancement for Vision Large Language Models: A Self-Verification Approach for High-Quality Emotion Instruction Data Generation

    Authors: Feifan Wang, Tengfei Song, Minggui He, Chang Su, Zhanglin Wu, Hao Yang, Wenming Zheng, Osamu Yoshie

    Abstract: Facial emotion perception in the vision large language model (VLLM) is crucial for achieving natural human-machine interaction. However, creating high-quality annotations for both coarse- and fine-grained facial emotion analysis demands costly expertise. The lack of such high-quality instruction data limits the performance of VLLMs in facial emotion perception. To address this, we propose a self-v… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  36. arXiv:2505.15725  [pdf, ps, other

    cs.RO cs.CV

    UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning

    Authors: Xiangyu Wang, Donglin Yang, Yue Liao, Wenhao Zheng, wenjun wu, Bin Dai, Hongsheng Li, Si Liu

    Abstract: Unmanned Aerial Vehicles (UAVs) are evolving into language-interactive platforms, enabling more intuitive forms of human-drone interaction. While prior works have primarily focused on high-level planning and long-horizon navigation, we shift attention to language-guided fine-grained trajectory control, where UAVs execute short-range, reactive flight behaviors in response to language instructions.… ▽ More

    Submitted 26 May, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

  37. arXiv:2505.15576  [pdf, other

    cs.CV cs.LG

    Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models

    Authors: Xin Huang, Ruibin Li, Tong Jia, Wei Zheng, Ya Wang

    Abstract: Vision-Language Models (VLMs) are essential for multimodal tasks, especially compositional reasoning (CR) tasks, which require distinguishing fine-grained semantic differences between visual and textual embeddings. However, existing methods primarily fine-tune the model by generating text-based hard negative samples, neglecting the importance of image-based negative samples, which results in insuf… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: Accepted at the International Joint Conference on Artificial Intelligence (IJCAI 2025)

  38. arXiv:2505.12702  [pdf, ps, other

    cs.CV

    Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation

    Authors: Tianming Liang, Haichao Jiang, Yuting Yang, Chaolei Tan, Shuai Li, Wei-Shi Zheng, Jian-Fang Hu

    Abstract: Referring video object segmentation (RVOS) aims to identify, track and segment the objects in a video based on language descriptions, which has received great attention in recent years. However, existing datasets remain focus on short video clips within several seconds, with salient objects visible in most frames. To advance the task towards more practical scenarios, we introduce \textbf{Long-RVOS… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: Project Page: \url{https://isee-laboratory.github.io/Long-RVOS}

  39. arXiv:2505.10900  [pdf, other

    cs.IR cs.AI

    Explain What You Mean: Intent Augmented Knowledge Graph Recommender Built With An LLM

    Authors: Wenqing Zheng, Noah Fatsi, Daniel Barcklow, Dmitri Kalaev, Steven Yao, Owen Reinert, C. Bayan Bruss, Daniele Rosa

    Abstract: Interaction sparsity is a long-standing challenge in recommendation systems. Sparsity manifests in environments with disproportional cardinality of groupings of entities, such as users and products in an online marketplace. It is also found for newly introduced entities, described as the cold-start problem. Recent efforts to mitigate this issue either enrich the connectivity data by incorporating… ▽ More

    Submitted 21 May, 2025; v1 submitted 16 May, 2025; originally announced May 2025.

  40. arXiv:2505.09010  [pdf, ps, other

    cs.DS

    Fully Dynamic Euclidean Bi-Chromatic Matching in Sublinear Update Time

    Authors: Gramoz Goranci, Peter Kiss, Neel Patel, Martin P. Seybold, Eva Szilagyi, Da Wei Zheng

    Abstract: We consider the Euclidean bi-chromatic matching problem in the dynamic setting, where the goal is to efficiently process point insertions and deletions while maintaining a high-quality solution. Computing the minimum cost bi-chromatic matching is one of the core problems in geometric optimization that has found many applications, most notably in estimating Wasserstein distance between two distribu… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: accepted at ICML 2025

  41. VisTaxa: Developing a Taxonomy of Historical Visualizations

    Authors: Yu Zhang, Xinyue Chen, Weili Zheng, Yuhan Guo, Guozheng Li, Siming Chen, Xiaoru Yuan

    Abstract: Historical visualizations are a rich resource for visualization research. While taxonomy is commonly used to structure and understand the design space of visualizations, existing taxonomies primarily focus on contemporary visualizations and largely overlook historical visualizations. To address this gap, we describe an empirical method for taxonomy development. We introduce a coding protocol and t… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

    Comments: Accepted to IEEE TVCG (IEEE PacificVis 2025 Journal Track)

  42. arXiv:2505.01647  [pdf, ps, other

    cs.NE cs.AI

    Scalable Speed-ups for the SMS-EMOA from a Simple Aging Strategy

    Authors: Mingfeng Li, Weijie Zheng, Benjamin Doerr

    Abstract: Different from single-objective evolutionary algorithms, where non-elitism is an established concept, multi-objective evolutionary algorithms almost always select the next population in a greedy fashion. In the only notable exception, Bian, Zhou, Li, and Qian (IJCAI 2023) proposed a stochastic selection mechanism for the SMS-EMOA and proved that it can speed up computing the Pareto front of the bi… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

    Comments: Initial version of one paper accepted by IJCAI2025

  43. arXiv:2505.00010  [pdf

    cs.CL cs.AI

    Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models

    Authors: Tri Nguyen, Lohith Srikanth Pentapalli, Magnus Sieverding, Laurah Turner, Seth Overla, Weibing Zheng, Chris Zhou, David Furniss, Danielle Weber, Michael Gharib, Matt Kelleher, Michael Shukis, Cameron Pawlik, Kelly Cohen

    Abstract: Jailbreaking in Large Language Models (LLMs) threatens their safe use in sensitive domains like education by allowing users to bypass ethical safeguards. This study focuses on detecting jailbreaks in 2-Sigma, a clinical education platform that simulates patient interactions using LLMs. We annotated over 2,300 prompts across 158 conversations using four linguistic variables shown to correlate stron… ▽ More

    Submitted 21 April, 2025; originally announced May 2025.

  44. arXiv:2504.21552  [pdf, other

    cs.NE

    The First Theoretical Approximation Guarantees for the Non-Dominated Sorting Genetic Algorithm III (NSGA-III)

    Authors: Renzhong Deng, Weijie Zheng, Benjamin Doerr

    Abstract: This work conducts a first theoretical analysis studying how well the NSGA-III approximates the Pareto front when the population size $N$ is less than the Pareto front size. We show that when $N$ is at least the number $N_r$ of reference points, then the approximation quality, measured by the maximum empty interval (MEI) indicator, on the OneMinMax benchmark is such that there is no empty interval… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

    Comments: Initial version accepted by IJCAI 2025

  45. arXiv:2504.19276  [pdf, other

    cs.LG cs.AI cs.CL

    Anyprefer: An Agentic Framework for Preference Data Synthesis

    Authors: Yiyang Zhou, Zhaoyang Wang, Tianle Wang, Shangyu Xing, Peng Xia, Bo Li, Kaiyuan Zheng, Zijian Zhang, Zhaorun Chen, Wenhao Zheng, Xuchao Zhang, Chetan Bansal, Weitong Zhang, Ying Wei, Mohit Bansal, Huaxiu Yao

    Abstract: High-quality preference data is essential for aligning foundation models with human values through preference learning. However, manual annotation of such data is often time-consuming and costly. Recent methods often adopt a self-rewarding approach, where the target model generates and annotates its own preference data, but this can lead to inaccuracies since the reward model shares weights with t… ▽ More

    Submitted 27 April, 2025; originally announced April 2025.

  46. arXiv:2504.18406  [pdf, other

    cs.CL

    HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?

    Authors: Yusen Zhang, Wenliang Zheng, Aashrith Madasu, Peng Shi, Ryo Kamoi, Hao Zhou, Zhuoyang Zou, Shu Zhao, Sarkar Snigdha Sarathi Das, Vipul Gupta, Xiaoxin Lu, Nan Zhang, Ranran Haoran Zhang, Avitej Iyer, Renze Lou, Wenpeng Yin, Rui Zhang

    Abstract: High-resolution image (HRI) understanding aims to process images with a large number of pixels, such as pathological images and agricultural aerial images, both of which can exceed 1 million pixels. Vision Large Language Models (VLMs) can allegedly handle HRIs, however, there is a lack of a comprehensive benchmark for VLMs to evaluate HRI understanding. To address this gap, we introduce HRScene, a… ▽ More

    Submitted 29 April, 2025; v1 submitted 25 April, 2025; originally announced April 2025.

    Comments: 22 pages, 8 figures

  47. arXiv:2504.18152  [pdf, other

    cs.CV

    ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding

    Authors: Yi-Xing Peng, Qize Yang, Yu-Ming Tang, Shenghao Fu, Kun-Yu Lin, Xihan Wei, Wei-Shi Zheng

    Abstract: Fine-grained understanding of human actions and poses in videos is essential for human-centric AI applications. In this work, we introduce ActionArt, a fine-grained video-caption dataset designed to advance research in human-centric multimodal understanding. Our dataset comprises thousands of videos capturing a broad spectrum of human actions, human-object interactions, and diverse scenarios, each… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

  48. arXiv:2504.17349  [pdf, other

    cs.CV cs.IR

    DRC: Enhancing Personalized Image Generation via Disentangled Representation Composition

    Authors: Yiyan Xu, Wuqiang Zheng, Wenjie Wang, Fengbin Zhu, Xinting Hu, Yang Zhang, Fuli Feng, Tat-Seng Chua

    Abstract: Personalized image generation has emerged as a promising direction in multimodal content creation. It aims to synthesize images tailored to individual style preferences (e.g., color schemes, character appearances, layout) and semantic intentions (e.g., emotion, action, scene contexts) by leveraging user-interacted history images and multimodal instructions. Despite notable progress, existing metho… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

  49. arXiv:2504.15989  [pdf, ps, other

    cs.SE

    Optimizing Token Consumption in LLMs: A Nano Surge Approach for Code Reasoning Efficiency

    Authors: Junwei Hu, Weicheng Zheng, Yihan Liu, Yan Liu

    Abstract: With the increasing adoption of large language models (LLMs) in software engineering, the Chain of Thought (CoT) reasoning paradigm has become an essential approach for automated code repair. However, the explicit multi-step reasoning in CoT leads to substantial increases in token consumption, reducing inference efficiency and raising computational costs, especially for complex code repair tasks.… ▽ More

    Submitted 29 May, 2025; v1 submitted 22 April, 2025; originally announced April 2025.

  50. arXiv:2504.12805  [pdf, other

    cs.CL cs.CY cs.HC

    Assesing LLMs in Art Contexts: Critique Generation and Theory of Mind Evaluation

    Authors: Takaya Arita, Wenxian Zheng, Reiji Suzuki, Fuminori Akiba

    Abstract: This study explored how large language models (LLMs) perform in two areas related to art: writing critiques of artworks and reasoning about mental states (Theory of Mind, or ToM) in art-related situations. For the critique generation part, we built a system that combines Noel Carroll's evaluative framework with a broad selection of art criticism theories. The model was prompted to first write a fu… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    Comments: 30 pages, 13 figures, 1 table