Skip to main content

Showing 1–50 of 930 results for author: Gu, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2509.26506  [pdf, ps, other

    cs.AI

    SCUBA: Salesforce Computer Use Benchmark

    Authors: Yutong Dai, Krithika Ramakrishnan, Jing Gu, Matthew Fernandez, Yanqi Luo, Viraj Prabhu, Zhenyu Hu, Silvio Savarese, Caiming Xiong, Zeyuan Chen, Ran Xu

    Abstract: We introduce SCUBA, a benchmark designed to evaluate computer-use agents on customer relationship management (CRM) workflows within the Salesforce platform. SCUBA contains 300 task instances derived from real user interviews, spanning three primary personas, platform administrators, sales representatives, and service agents. The tasks test a range of enterprise-critical abilities, including Enterp… ▽ More

    Submitted 30 September, 2025; originally announced September 2025.

  2. arXiv:2509.25534  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning

    Authors: Zhiling Ye, Yun Yue, Haowen Wang, Xudong Han, Jiadi Jiang, Cheng Wei, Lei Fan, Jiaxin Liang, Shuowen Zhang, Ji Li, Chunxiao Guo, Jian Wang, Peng Wei, Jinjie Gu

    Abstract: Open-ended evaluation is essential for deploying large language models in real-world settings. In studying HealthBench, we observe that using the model itself as a grader and generating rubric-based reward signals substantially improves reasoning performance. Remarkably, the trained model also becomes a stronger grader. Motivated by this, we introduce Self-Rewarding Rubric-Based Reinforcement Lear… ▽ More

    Submitted 19 September, 2025; originally announced September 2025.

  3. arXiv:2509.25143  [pdf, ps, other

    cs.CV cs.CL

    TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models

    Authors: Junyi Zhang, Jia-Chen Gu, Wenbo Hu, Yu Zhou, Robinson Piramuthu, Nanyun Peng

    Abstract: Existing medical reasoning benchmarks for vision-language models primarily focus on analyzing a patient's condition based on an image from a single visit. However, this setting deviates significantly from real-world clinical practice, where doctors typically refer to a patient's historical conditions to provide a comprehensive assessment by tracking their changes over time. In this paper, we intro… ▽ More

    Submitted 29 September, 2025; originally announced September 2025.

  4. arXiv:2509.24741  [pdf, ps, other

    cs.CV

    Collaborating Vision, Depth, and Thermal Signals for Multi-Modal Tracking: Dataset and Algorithm

    Authors: Xue-Feng Zhu, Tianyang Xu, Yifan Pan, Jinjie Gu, Xi Li, Jiwen Lu, Xiao-Jun Wu, Josef Kittler

    Abstract: Existing multi-modal object tracking approaches primarily focus on dual-modal paradigms, such as RGB-Depth or RGB-Thermal, yet remain challenged in complex scenarios due to limited input modalities. To address this gap, this work introduces a novel multi-modal tracking task that leverages three complementary modalities, including visible RGB, Depth (D), and Thermal Infrared (TIR), aiming to enhanc… ▽ More

    Submitted 29 September, 2025; originally announced September 2025.

  5. arXiv:2509.24386  [pdf, ps, other

    cs.CV

    PCICF: A Pedestrian Crossing Identification and Classification Framework

    Authors: Junyi Gu, Beatriz Cabrero-Daniel, Ali Nouri, Lydia Armini, Christian Berger

    Abstract: We have recently observed the commercial roll-out of robotaxis in various countries. They are deployed within an operational design domain (ODD) on specific routes and environmental conditions, and are subject to continuous monitoring to regain control in safety-critical situations. Since ODDs typically cover urban areas, robotaxis must reliably detect vulnerable road users (VRUs) such as pedestri… ▽ More

    Submitted 29 September, 2025; originally announced September 2025.

  6. arXiv:2509.23764  [pdf, ps, other

    physics.optics cs.ET

    Photonics-Aware Planning-Guided Automated Electrical Routing for Large-Scale Active Photonic Integrated Circuits

    Authors: Hongjian Zhou, Haoyu Yang, Nicholas Gangi, Bowen Liu, Meng Zhang, Haoxing Ren, Xu Wang, Rena Huang, Jiaqi Gu

    Abstract: The rising demand for AI training and inference, as well as scientific computing, combined with stringent latency and energy budgets, is driving the adoption of integrated photonics for computing, sensing, and communications. As active photonic integrated circuits (PICs) scale in device count and functional heterogeneity, physical implementation by manual scripting and ad-hoc edits is no longer te… ▽ More

    Submitted 28 September, 2025; originally announced September 2025.

    Comments: 9 pages

  7. arXiv:2509.21072  [pdf, ps, other

    cs.AI

    Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution

    Authors: Kaiwen He, Zhiwei Wang, Chenyi Zhuang, Jinjie Gu

    Abstract: Recent years, multimodal models have made remarkable strides and pave the way for intelligent browser use agents. However, when solving tasks on real world webpages in multi-turn, long-horizon trajectories, current agents still suffer from disordered action sequencing and excessive trial and error during execution. This paper introduces Recon-Act, a self-evolving multi-agent framework grounded in… ▽ More

    Submitted 25 September, 2025; originally announced September 2025.

  8. arXiv:2509.20358  [pdf, ps, other

    cs.CV

    PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation

    Authors: Chen Wang, Chuhao Chen, Yiming Huang, Zhiyang Dou, Yuan Liu, Jiatao Gu, Lingjie Liu

    Abstract: Existing video generation models excel at producing photo-realistic videos from text or images, but often lack physical plausibility and 3D controllability. To overcome these limitations, we introduce PhysCtrl, a novel framework for physics-grounded image-to-video generation with physical parameters and force control. At its core is a generative physics network that learns the distribution of phys… ▽ More

    Submitted 24 September, 2025; originally announced September 2025.

    Comments: Accepted by NeurIPS 2025. This is the preview version; the camera-ready version is still in preparation

  9. arXiv:2509.19244  [pdf, ps, other

    cs.CV

    Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation

    Authors: Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, Jason Kuen

    Abstract: We propose Lavida-O, a unified Masked Diffusion Model (MDM) for multimodal understanding and generation. Unlike existing multimodal MDMs such as MMaDa and Muddit which only support simple image-level understanding tasks and low-resolution image generation, Lavida-O presents a single framework that enables image-level understanding, object grounding, image editing, and high-resolution (1024px) text… ▽ More

    Submitted 24 September, 2025; v1 submitted 23 September, 2025; originally announced September 2025.

    Comments: 31 pages, 15 figures

  10. arXiv:2509.17677  [pdf, ps, other

    cs.AI

    EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

    Authors: Xiyuan Zhou, Xinlei Wang, Yirui He, Yang Wu, Ruixi Zou, Yuheng Cheng, Yulu Xie, Wenxuan Liu, Huan Zhao, Yan Xu, Jinjin Gu, Junhua Zhao

    Abstract: Large language models (LLMs) have shown strong performance on mathematical reasoning under well-posed conditions. However, real-world engineering problems require more than mathematical symbolic computation -- they need to deal with uncertainty, context, and open-ended scenarios. Existing benchmarks fail to capture these complexities. We introduce EngiBench, a hierarchical benchmark designed to ev… ▽ More

    Submitted 22 September, 2025; originally announced September 2025.

  11. arXiv:2509.16494  [pdf, ps, other

    cs.CL cs.AI

    Can an Individual Manipulate the Collective Decisions of Multi-Agents?

    Authors: Fengyuan Liu, Rui Zhao, Shuo Chen, Guohao Li, Philip Torr, Lei Han, Jindong Gu

    Abstract: Individual Large Language Models (LLMs) have demonstrated significant capabilities across various domains, such as healthcare and law. Recent studies also show that coordinated multi-agent systems exhibit enhanced decision-making and reasoning abilities through collaboration. However, due to the vulnerabilities of individual LLMs and the difficulty of accessing all agents in a multi-agent system,… ▽ More

    Submitted 19 September, 2025; originally announced September 2025.

  12. arXiv:2509.14558  [pdf, ps, other

    cs.CR cs.AI cs.CL

    LLM Jailbreak Detection for (Almost) Free!

    Authors: Guorui Chen, Yifan Xia, Xiaojun Jia, Zhijiang Li, Philip Torr, Jindong Gu

    Abstract: Large language models (LLMs) enhance security through alignment when widely used, but remain susceptible to jailbreak attacks capable of producing inappropriate content. Jailbreak detection methods show promise in mitigating jailbreak attacks through the assistance of other models or multiple model inferences. However, existing methods entail significant computational costs. In this paper, we firs… ▽ More

    Submitted 17 September, 2025; originally announced September 2025.

  13. arXiv:2509.14242  [pdf, ps, other

    eess.SP cs.LG

    Artificial Intelligence-derived Cardiotocography Age as a Digital Biomarker for Predicting Future Adverse Pregnancy Outcomes

    Authors: Jinshuai Gu, Zenghui Lin, Jingying Ma, Jingyu Wang, Linyan Zhang, Rui Bai, Zelin Tu, Youyou Jiang, Donglin Xie, Yuxi Zhou, Guoli Liu, Shenda Hong

    Abstract: Cardiotocography (CTG) is a low-cost, non-invasive fetal health assessment technique used globally, especially in underdeveloped countries. However, it is currently mainly used to identify the fetus's current status (e.g., fetal acidosis or hypoxia), and the potential of CTG in predicting future adverse pregnancy outcomes has not been fully explored. We aim to develop an AI-based model that predic… ▽ More

    Submitted 3 September, 2025; originally announced September 2025.

  14. arXiv:2509.12474  [pdf, ps, other

    cs.CV

    Image Tokenizer Needs Post-Training

    Authors: Kai Qiu, Xiang Li, Hao Chen, Jason Kuen, Xiaohao Xu, Jiuxiang Gu, Yinyi Luo, Bhiksha Raj, Zhe Lin, Marios Savvides

    Abstract: Recent image generative models typically capture the image distribution in a pre-constructed latent space, relying on a frozen image tokenizer. However, there exists a significant discrepancy between the reconstruction and generation distribution, where current tokenizers only prioritize the reconstruction task that happens before generative training without considering the generation errors durin… ▽ More

    Submitted 15 September, 2025; originally announced September 2025.

    Comments: 21 pages, 16 figures, 10 tables. arXiv admin note: substantial text overlap with arXiv:2503.08354

  15. arXiv:2509.12289  [pdf, ps, other

    cs.LG cs.AI

    C3DE: Causal-Aware Collaborative Neural Controlled Differential Equation for Long-Term Urban Crowd Flow Prediction

    Authors: Yuting Liu, Qiang Zhou, Hanzhe Li, Chenqi Gong, Jingjing Gu

    Abstract: Long-term urban crowd flow prediction suffers significantly from cumulative sampling errors, due to increased sequence lengths and sampling intervals, which inspired us to leverage Neural Controlled Differential Equations (NCDEs) to mitigate this issue. However, regarding the crucial influence of Points of Interest (POIs) evolution on long-term crowd flow, the multi-timescale asynchronous dynamics… ▽ More

    Submitted 15 September, 2025; originally announced September 2025.

  16. arXiv:2509.09713  [pdf, ps, other

    cs.CL cs.AI

    HANRAG: Heuristic Accurate Noise-resistant Retrieval-Augmented Generation for Multi-hop Question Answering

    Authors: Duolin Sun, Dan Yang, Yue Shen, Yihan Jiao, Zhehao Tan, Jie Feng, Lianzhen Zhong, Jian Wang, Peng Wei, Jinjie Gu

    Abstract: The Retrieval-Augmented Generation (RAG) approach enhances question-answering systems and dialogue generation tasks by integrating information retrieval (IR) technologies with large language models (LLMs). This strategy, which retrieves information from external knowledge bases to bolster the response capabilities of generative models, has achieved certain successes. However, current RAG methods s… ▽ More

    Submitted 8 September, 2025; originally announced September 2025.

  17. arXiv:2509.07396  [pdf, ps, other

    physics.optics cs.AI cs.ET

    Toward Lifelong-Sustainable Electronic-Photonic AI Systems via Extreme Efficiency, Reconfigurability, and Robustness

    Authors: Ziang Yin, Hongjian Zhou, Chetan Choppali Sudarshan, Vidya Chhabria, Jiaqi Gu

    Abstract: The relentless growth of large-scale artificial intelligence (AI) has created unprecedented demand for computational power, straining the energy, bandwidth, and scaling limits of conventional electronic platforms. Electronic-photonic integrated circuits (EPICs) have emerged as a compelling platform for next-generation AI systems, offering inherent advantages in ultra-high bandwidth, low latency, a… ▽ More

    Submitted 9 September, 2025; originally announced September 2025.

    Comments: 8 pages

  18. arXiv:2509.04091  [pdf, ps, other

    cs.CR

    Revisiting Third-Party Library Detection: A Ground Truth Dataset and Its Implications Across Security Tasks

    Authors: Jintao Gu, Haolang Lu, Guoshun Nan, Yihan Lin, Kun Wang, Yuchun Guo, Yigui Cao, Yang Liu

    Abstract: Accurate detection of third-party libraries (TPLs) is fundamental to Android security, supporting vulnerability tracking, malware detection, and supply chain auditing. Despite many proposed tools, their real-world effectiveness remains unclear. We present the first large-scale empirical study of ten state-of-the-art TPL detection techniques across over 6,000 apps, enabled by a new ground truth dat… ▽ More

    Submitted 5 September, 2025; v1 submitted 4 September, 2025; originally announced September 2025.

    Comments: 20pages, 7figures

    MSC Class: 68M25 ACM Class: K.6.5; D.2.7

  19. arXiv:2509.01790  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs

    Authors: Andong Hua, Kenan Tang, Chenhe Gu, Jindong Gu, Eric Wong, Yao Qin

    Abstract: Prompt sensitivity, referring to the phenomenon where paraphrasing (i.e., repeating something written or spoken using different words) leads to significant changes in large language model (LLM) performance, has been widely accepted as a core limitation of LLMs. In this work, we revisit this issue and ask: Is the widely reported high prompt sensitivity truly an inherent weakness of LLMs, or is it l… ▽ More

    Submitted 1 September, 2025; originally announced September 2025.

    Comments: Accepted to EMNLP 2025 Main Conference

  20. arXiv:2509.01158  [pdf, ps, other

    cs.CL

    Joint Information Extraction Across Classical and Modern Chinese with Tea-MOELoRA

    Authors: Xuemei Tang, Chengxi Yan, Jinghang Gu, Chu-Ren Huang

    Abstract: Chinese information extraction (IE) involves multiple tasks across diverse temporal domains, including Classical and Modern documents. Fine-tuning a single model on heterogeneous tasks and across different eras may lead to interference and reduced performance. Therefore, in this paper, we propose Tea-MOELoRA, a parameter-efficient multi-task framework that combines LoRA with a Mixture-of-Experts (… ▽ More

    Submitted 9 September, 2025; v1 submitted 1 September, 2025; originally announced September 2025.

    Comments: 9 pages, 3 figures

  21. arXiv:2508.20404  [pdf, ps, other

    cs.AI

    AWorld: Orchestrating the Training Recipe for Agentic AI

    Authors: Chengyue Yu, Siyuan Lu, Chenyi Zhuang, Dong Wang, Qintong Wu, Zongyue Li, Runsheng Gan, Chunfeng Wang, Siqi Hou, Gaochi Huang, Wenlong Yan, Lifeng Hong, Aohui Xue, Yanfeng Wang, Jinjie Gu, David Tsai, Tao Lin

    Abstract: The learning from practice paradigm is crucial for developing capable Agentic AI systems, yet it is severely hampered by inefficient experience generation, a bottleneck especially pronounced in complex benchmarks like GAIA. To address this, we introduce AWorld, an open-source system engineered for large-scale agent-environment interaction. By distributing tasks across a cluster, AWorld accelerates… ▽ More

    Submitted 31 August, 2025; v1 submitted 28 August, 2025; originally announced August 2025.

  22. arXiv:2508.19679  [pdf, ps, other

    cs.AI

    InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning

    Authors: Qihang Ai, Pi Bu, Yue Cao, Yingyao Wang, Jihao Gu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Zhicheng Zheng, Jun Song, Yuning Jiang, Bo Zheng

    Abstract: Recent advances in Vision-Language Models (VLMs) have enabled mobile agents to perceive and interact with real-world mobile environments based on human instructions. However, the current fully autonomous paradigm poses potential safety risks when model understanding or reasoning capabilities are insufficient. To address this challenge, we first introduce \textbf{InquireBench}, a comprehensive benc… ▽ More

    Submitted 27 August, 2025; originally announced August 2025.

  23. arXiv:2508.17922  [pdf, ps, other

    cs.RO cs.CV

    Egocentric Instruction-oriented Affordance Prediction via Large Multimodal Model

    Authors: Bokai Ji, Jie Gu, Xiaokang Ma, Chu Tang, Jingmin Chen, Guangxia Li

    Abstract: Affordance is crucial for intelligent robots in the context of object manipulation. In this paper, we argue that affordance should be task-/instruction-dependent, which is overlooked by many previous works. That is, different instructions can lead to different manipulation regions and directions even for the same object. According to this observation, we present a new dataset comprising fifteen th… ▽ More

    Submitted 25 August, 2025; originally announced August 2025.

  24. arXiv:2508.17855  [pdf, ps, other

    cs.CL cs.CY

    Beyond Demographics: Enhancing Cultural Value Survey Simulation with Multi-Stage Personality-Driven Cognitive Reasoning

    Authors: Haijiang Liu, Qiyuan Li, Chao Gao, Yong Cao, Xiangyu Xu, Xun Wu, Daniel Hershcovich, Jinguang Gu

    Abstract: Introducing MARK, the Multi-stAge Reasoning frameworK for cultural value survey response simulation, designed to enhance the accuracy, steerability, and interpretability of large language models in this task. The system is inspired by the type dynamics theory in the MBTI psychological framework for personality research. It effectively predicts and utilizes human demographic information for simulat… ▽ More

    Submitted 25 August, 2025; originally announced August 2025.

    Comments: 23 pages, 6 figures, accepted to EMNLP 2025 main

  25. arXiv:2508.14880  [pdf, ps, other

    cs.CL

    MedResearcher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework

    Authors: Ailing Yu, Lan Yao, Jingnan Liu, Zhe Chen, Jiajun Yin, Yuan Wang, Xinhao Liao, Zhiling Ye, Ji Li, Yun Yue, Hansong Xiao, Hualei Zhou, Chunxiao Guo, Peng Wei, Junwei Liu, Jinjie Gu

    Abstract: Recent developments in Large Language Model (LLM)-based agents have shown impressive capabilities spanning multiple domains, exemplified by deep research systems that demonstrate superior performance on complex information-seeking and synthesis tasks. While general-purpose deep research agents have shown impressive capabilities, they struggle significantly with medical domain challenges, as eviden… ▽ More

    Submitted 1 September, 2025; v1 submitted 20 August, 2025; originally announced August 2025.

    Comments: 13 pages, 5 figures

  26. arXiv:2508.14812  [pdf, ps, other

    cs.CV

    Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives

    Authors: Haoyu Zhao, Jiaxi Gu, Shicong Wang, Xing Zhang, Hang Xu, Zuxuan Wu, Yu-Gang Jiang

    Abstract: The explosive growth of video streaming presents challenges in achieving high accuracy and low training costs for video-language retrieval. However, existing methods rely on large-scale pre-training to improve video retrieval performance, resulting in significant computational demands. Additionally, the fine-grained information in videos and texts remains underexplored. To alleviate these problems… ▽ More

    Submitted 20 August, 2025; originally announced August 2025.

    Comments: 11 pages, 4 figures

  27. arXiv:2508.13634  [pdf, ps, other

    cs.AI

    V2P: From Background Suppression to Center Peaking for Robust GUI Grounding Task

    Authors: Jikai Chen, Long Chen, Dong Wang, Leilei Gan, Chenyi Zhuang, Jinjie Gu

    Abstract: Precise localization of GUI elements is crucial for the development of GUI agents. Traditional methods rely on bounding box or center-point regression, neglecting spatial interaction uncertainty and visual-semantic hierarchies. Recent methods incorporate attention mechanisms but still face two key issues: (1) ignoring processing background regions causes attention drift from the desired area, and… ▽ More

    Submitted 19 August, 2025; originally announced August 2025.

  28. arXiv:2508.13186  [pdf, ps, other

    cs.CL cs.AI cs.CV

    MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents

    Authors: Shilong Li, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong, Haoyang He, Hao Lu, Haozhe Zhang, Chenchen Jing, Zhen Li, Chuanhao Li, Jiayi Tian, Chenchen Zhang, Tianhao Peng, Yancheng He, Jihao Gu, Yuanxing Zhang, Jian Yang, Ge Zhang, Wenhao Huang, Wangchunshu Zhou, Zhaoxiang Zhang, Ruizhe Ding, Shilei Wen

    Abstract: AI agents with advanced reasoning and tool use capabilities have demonstrated impressive performance in web browsing for deep search. While existing benchmarks such as BrowseComp evaluate these browsing abilities, they primarily focus on textual information, overlooking the prevalence of multimodal content. To bridge this gap, we introduce MM-BrowseComp, a novel benchmark comprising 224 challengin… ▽ More

    Submitted 14 August, 2025; originally announced August 2025.

    Comments: The first two authors contribute equally, 26 pages, repo at https://github.com/MMBrowseComp/MM-BrowseComp

  29. arXiv:2508.11939  [pdf, ps, other

    cs.CR

    Design and Implementation of a Controlled Ransomware Framework for Educational Purposes Using Flutter Cryptographic APIs on Desktop PCs and Android Devices

    Authors: James Gu, Ahmed Sartaj, Mohammed Akram Taher Khan, Rashid Hussain Khokhar

    Abstract: This study focuses on the creation and implementation of ransomware for educational purposes that leverages Python's native cryptographic APIs in a controlled environment. Additionally, an Android version of the framework is implemented using Flutter and Dart. For both versions, open-source cryptographic libraries are utilized. With this framework, researchers can systematically explore the functi… ▽ More

    Submitted 16 August, 2025; originally announced August 2025.

    Comments: 6 pages, 1 figure, 1 table, 2 algorithms

  30. arXiv:2508.10881  [pdf, ps, other

    cs.CV cs.AI

    ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing

    Authors: Lingen Li, Guangzhi Wang, Zhaoyang Zhang, Yaowei Li, Xiaoyu Li, Qi Dou, Jinwei Gu, Tianfan Xue, Ying Shan

    Abstract: Traditional cartoon and anime production involves keyframing, inbetweening, and colorization stages, which require intensive manual effort. Despite recent advances in AI, existing methods often handle these stages separately, leading to error accumulation and artifacts. For instance, inbetweening approaches struggle with large motions, while colorization methods require dense per-frame sketches. T… ▽ More

    Submitted 14 August, 2025; originally announced August 2025.

    Comments: Project Page: https://lg-li.github.io/project/tooncomposer

  31. arXiv:2508.10259  [pdf, ps, other

    cs.OS

    Leveraging OS-Level Primitives for Robotic Action Management

    Authors: Wenxin Zheng, Boyang Li, Bin Xu, Erhu Feng, Jinyu Gu, Haibo Chen

    Abstract: End-to-end imitation learning frameworks (e.g., VLA) are increasingly prominent in robotics, as they enable rapid task transfer by learning directly from perception to control, eliminating the need for complex hand-crafted features. However, even when employing SOTA VLA-based models, they still exhibit limited generalization capabilities and suboptimal action efficiency, due to the constraints imp… ▽ More

    Submitted 13 August, 2025; originally announced August 2025.

  32. arXiv:2508.09889  [pdf, ps, other

    cs.AI

    Profile-Aware Maneuvering: A Dynamic Multi-Agent System for Robust GAIA Problem Solving by AWorld

    Authors: Zhitian Xie, Qintong Wu, Chengyue Yu, Chenyi Zhuang, Jinjie Gu

    Abstract: The rapid advancement of large language models (LLMs) has empowered intelligent agents to leverage diverse external tools for solving complex real-world problems. However, this reliance introduces new challenges, as extended contexts and noisy tool outputs can undermine system reliability. To address this, we propose a dynamic Multi-Agent System (MAS) in our AWorld framework, where an Execution Ag… ▽ More

    Submitted 31 August, 2025; v1 submitted 13 August, 2025; originally announced August 2025.

  33. arXiv:2508.07995  [pdf, ps, other

    cs.IR cs.AI

    DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval

    Authors: Meixiu Long, Duolin Sun, Dan Yang, Junjie Wang, Yue Shen, Jian Wang, Peng Wei, Jinjie Gu, Jiahai Wang

    Abstract: Retrieval-augmented generation has achieved strong performance on knowledge-intensive tasks where query-document relevance can be identified through direct lexical or semantic matches. However, many real-world queries involve abstract reasoning, analogical thinking, or multi-step inference, which existing retrievers often struggle to capture. To address this challenge, we present DIVER, a retrieva… ▽ More

    Submitted 25 August, 2025; v1 submitted 11 August, 2025; originally announced August 2025.

  34. arXiv:2508.07750  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment

    Authors: Haowen Wang, Yun Yue, Zhiling Ye, Shuowen Zhang, Lei Fan, Jiaxin Liang, Jiadi Jiang, Cheng Wei, Jingyuan Deng, Xudong Han, Ji Li, Chunxiao Guo, Peng Wei, Jian Wang, Jinjie Gu

    Abstract: Alignment methodologies have emerged as a critical pathway for enhancing language model alignment capabilities. While SFT (supervised fine-tuning) accelerates convergence through direct token-level loss intervention, its efficacy is constrained by offline policy trajectory. In contrast, RL(reinforcement learning) facilitates exploratory policy optimization, but suffers from low sample efficiency a… ▽ More

    Submitted 11 August, 2025; originally announced August 2025.

    Comments: 12 pages, 5 figures, 7 tables

  35. arXiv:2508.07597  [pdf, ps, other

    cs.CV cs.AI

    ShoulderShot: Generating Over-the-Shoulder Dialogue Videos

    Authors: Yuang Zhang, Junqi Cheng, Haoyu Zhao, Jiaxi Gu, Fangyuan Zou, Zenghui Lu, Peng Shu

    Abstract: Over-the-shoulder dialogue videos are essential in films, short dramas, and advertisements, providing visual variety and enhancing viewers' emotional connection. Despite their importance, such dialogue scenes remain largely underexplored in video generation research. The main challenges include maintaining character consistency across different shots, creating a sense of spatial continuity, and ge… ▽ More

    Submitted 15 August, 2025; v1 submitted 10 August, 2025; originally announced August 2025.

  36. arXiv:2508.07318  [pdf, ps, other

    cs.CV

    RORPCap: Retrieval-based Objects and Relations Prompt for Image Captioning

    Authors: Jinjing Gu, Tianbao Qin, Yuanyuan Pu, Zhengpeng Zhao

    Abstract: Image captioning aims to generate natural language descriptions for input images in an open-form manner. To accurately generate descriptions related to the image, a critical step in image captioning is to identify objects and understand their relations within the image. Modern approaches typically capitalize on object detectors or combine detectors with Graph Convolutional Network (GCN). However,… ▽ More

    Submitted 10 August, 2025; originally announced August 2025.

  37. arXiv:2508.05383  [pdf, ps, other

    cs.AI

    StructVRM: Aligning Multimodal Reasoning with Structured and Verifiable Reward Models

    Authors: Xiangxiang Zhang, Jingxuan Wei, Donghong Zhong, Qi Chen, Caijun Jia, Cheng Tan, Jinming Gu, Xiaobo Qin, Zhiping Liu, Liang Hu, Tong Sun, Yuchen Wu, Zewei Sun, Chenwei Lou, Hua Zheng, Tianyang Zhan, Changbao Wang, Shuangzhi Wu, Zefa Lin, Chang Guo, Sihang Yuan, Riwei Chen, Shixiong Zhao, Yingping Zhang, Gaowei Wu , et al. (9 additional authors not shown)

    Abstract: Existing Vision-Language Models often struggle with complex, multi-question reasoning tasks where partial correctness is crucial for effective learning. Traditional reward mechanisms, which provide a single binary score for an entire response, are too coarse to guide models through intricate problems with multiple sub-parts. To address this, we introduce StructVRM, a method that aligns multimodal… ▽ More

    Submitted 7 August, 2025; originally announced August 2025.

  38. arXiv:2508.05118  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Exploring Superior Function Calls via Reinforcement Learning

    Authors: Bingguang Hao, Maolin Wang, Zengzhuang Xu, Yicheng Chen, Cunyin Peng, Jinjie GU, Chenyi Zhuang

    Abstract: Function calling capabilities are crucial for deploying Large Language Models in real-world applications, yet current training approaches fail to develop robust reasoning strategies. Supervised fine-tuning produces models that rely on superficial pattern matching, while standard reinforcement learning methods struggle with the complex action space of structured function calls. We present a novel r… ▽ More

    Submitted 15 August, 2025; v1 submitted 7 August, 2025; originally announced August 2025.

  39. arXiv:2508.04155  [pdf, ps, other

    cs.CR cs.LG

    Evaluating Selective Encryption Against Gradient Inversion Attacks

    Authors: Jiajun Gu, Yuhang Yao, Shuaiqi Wang, Carlee Joe-Wong

    Abstract: Gradient inversion attacks pose significant privacy threats to distributed training frameworks such as federated learning, enabling malicious parties to reconstruct sensitive local training data from gradient communications between clients and an aggregation server during the aggregation process. While traditional encryption-based defenses, such as homomorphic encryption, offer strong privacy guar… ▽ More

    Submitted 6 August, 2025; originally announced August 2025.

  40. arXiv:2508.00593  [pdf, ps, other

    cs.SE

    Can User Feedback Help Issue Detection? An Empirical Study on a One-billion-user Online Service System

    Authors: Shuyao Jiang, Jiazhen Gu, Wujie Zheng, Yangfan Zhou, Michael R. Lyu

    Abstract: Background: It has long been suggested that user feedback, typically written in natural language by end-users, can help issue detection. However, for large-scale online service systems that receive a tremendous amount of feedback, it remains a challenging task to identify severe issues from user feedback. Aims: To develop a better feedback-based issue detection approach, it is crucial first to gai… ▽ More

    Submitted 1 August, 2025; originally announced August 2025.

    Comments: Accepted by the 19th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM 2025)

  41. arXiv:2508.00205  [pdf, ps, other

    cs.CV

    Learning Personalised Human Internal Cognition from External Expressive Behaviours for Real Personality Recognition

    Authors: Xiangyu Kong, Hengde Zhu, Haoqin Sun, Zhihao Guo, Jiayan Gu, Xinyi Ni, Wei Zhang, Shizhe Liu, Siyang Song

    Abstract: Automatic real personality recognition (RPR) aims to evaluate human real personality traits from their expressive behaviours. However, most existing solutions generally act as external observers to infer observers' personality impressions based on target individuals' expressive behaviours, which significantly deviate from their real personalities and consistently lead to inferior recognition perfo… ▽ More

    Submitted 31 July, 2025; originally announced August 2025.

    Comments: 10 pages, 4 figures

  42. arXiv:2507.23726  [pdf, ps, other

    cs.AI cs.CL

    Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

    Authors: Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Kaijing Ma, Cheng Ren, Jiawei Shen, Wenlei Shi, Tong Sun, He Sun, Jiahui Wang, Siran Wang, Zhihong Wang, Chenrui Wei, Shufa Wei, Yonghui Wu, Yuchen Wu, Yihang Xia, Huajian Xin, Fan Yang , et al. (11 additional authors not shown)

    Abstract: LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought, yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training throu… ▽ More

    Submitted 31 July, 2025; v1 submitted 31 July, 2025; originally announced July 2025.

  43. arXiv:2507.22927  [pdf, ps, other

    cs.CL

    PRGB Benchmark: A Robust Placeholder-Assisted Algorithm for Benchmarking Retrieval-Augmented Generation

    Authors: Zhehao Tan, Yihan Jiao, Dan Yang, Lei Liu, Jie Feng, Duolin Sun, Yue Shen, Jian Wang, Peng Wei, Jinjie Gu

    Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge, where the LLM's ability to generate responses based on the combination of a given query and retrieved documents is crucial. However, most benchmarks focus on overall RAG system performance, rarely assessing LLM-specific capabilities. Current benchmarks emphasize broad aspects such as noise… ▽ More

    Submitted 23 July, 2025; originally announced July 2025.

  44. arXiv:2507.22301  [pdf, ps, other

    cs.ET

    Toward Intelligent Electronic-Photonic Design Automation for Large-Scale Photonic Integrated Circuits: from Device Inverse Design to Physical Layout Generation

    Authors: Hongjian Zhou, Pingchuan Ma, Jiaqi Gu

    Abstract: Photonic Integrated Circuits (PICs) offer tremendous advantages in bandwidth, parallelism, and energy efficiency, making them essential for emerging applications in artificial intelligence (AI), high-performance computing (HPC), sensing, and communications. However, the design of modern PICs, which now integrate hundreds to thousands of components, remains largely manual, resulting in inefficiency… ▽ More

    Submitted 29 July, 2025; originally announced July 2025.

    Comments: 10 pages. SPIE Optical Design Automation (ODA) 2025

  45. arXiv:2507.21977  [pdf, ps, other

    cs.CV

    Motion Matters: Motion-guided Modulation Network for Skeleton-based Micro-Action Recognition

    Authors: Jihao Gu, Kun Li, Fei Wang, Yanyan Wei, Zhiliang Wu, Hehe Fan, Meng Wang

    Abstract: Micro-Actions (MAs) are an important form of non-verbal communication in social interactions, with potential applications in human emotional analysis. However, existing methods in Micro-Action Recognition often overlook the inherent subtle changes in MAs, which limits the accuracy of distinguishing MAs with subtle changes. To address this issue, we present a novel Motion-guided Modulation Network… ▽ More

    Submitted 14 August, 2025; v1 submitted 29 July, 2025; originally announced July 2025.

    Comments: Accepted by ACM MM 2025

  46. arXiv:2507.21391  [pdf, ps, other

    cs.CV cs.AI cs.CL

    Multimodal LLMs as Customized Reward Models for Text-to-Image Generation

    Authors: Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu, Branislav Kveton, Yufan Zhou, Jiuxiang Gu, Jian Chen, Changyou Chen

    Abstract: We introduce LLaVA-Reward, an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives, leveraging pretrained multimodal large language models (MLLMs). Existing MLLM-based approaches require instruction-following data for supervised fine-tuning and evaluate generation quality on analyzing text response, which is time-consuming and diffi… ▽ More

    Submitted 30 July, 2025; v1 submitted 28 July, 2025; originally announced July 2025.

    Comments: Accepted at ICCV 2025. Code available at https://github.com/sjz5202/LLaVA-Reward

  47. arXiv:2507.20590  [pdf, ps, other

    cs.CV

    Harnessing Diffusion-Yielded Score Priors for Image Restoration

    Authors: Xinqi Lin, Fanghua Yu, Jinfan Hu, Zhiyuan You, Wu Shi, Jimmy S. Ren, Jinjin Gu, Chao Dong

    Abstract: Deep image restoration models aim to learn a mapping from degraded image space to natural image space. However, they face several critical challenges: removing degradation, generating realistic details, and ensuring pixel-level consistency. Over time, three major classes of methods have emerged, including MSE-based, GAN-based, and diffusion-based methods. However, they fail to achieve a good balan… ▽ More

    Submitted 29 July, 2025; v1 submitted 28 July, 2025; originally announced July 2025.

  48. arXiv:2507.15807  [pdf, ps, other

    cs.CV cs.AI

    True Multimodal In-Context Learning Needs Attention to the Visual Context

    Authors: Shuo Chen, Jianzhe Liu, Zhen Han, Yan Xia, Daniel Cremers, Philip Torr, Volker Tresp, Jindong Gu

    Abstract: Multimodal Large Language Models (MLLMs), built on powerful language backbones, have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks from a few multimodal demonstrations consisting of images, questions, and answers. Despite showing noticeable improvement on standard vision-language datasets, current MLLMs struggle to leverage visual information in the demonstrations. Specifical… ▽ More

    Submitted 6 August, 2025; v1 submitted 21 July, 2025; originally announced July 2025.

    Comments: Accepted to COLM 2025

  49. arXiv:2507.13706  [pdf, ps, other

    cs.CV math.ST

    GOSPA and T-GOSPA quasi-metrics for evaluation of multi-object tracking algorithms

    Authors: Ángel F. García-Fernández, Jinhao Gu, Lennart Svensson, Yuxuan Xia, Jan Krejčí, Oliver Kost, Ondřej Straka

    Abstract: This paper introduces two quasi-metrics for performance assessment of multi-object tracking (MOT) algorithms. In particular, one quasi-metric is an extension of the generalised optimal subpattern assignment (GOSPA) metric and measures the discrepancy between sets of objects. The other quasi-metric is an extension of the trajectory GOSPA (T-GOSPA) metric and measures the discrepancy between sets of… ▽ More

    Submitted 18 July, 2025; originally announced July 2025.

  50. arXiv:2507.13527  [pdf, ps, other

    cs.CV cond-mat.mtrl-sci

    SparseC-AFM: a deep learning method for fast and accurate characterization of MoS$_2$ with C-AFM

    Authors: Levi Harris, Md Jayed Hossain, Mufan Qiu, Ruichen Zhang, Pingchuan Ma, Tianlong Chen, Jiaqi Gu, Seth Ariel Tongay, Umberto Celano

    Abstract: The increasing use of two-dimensional (2D) materials in nanoelectronics demands robust metrology techniques for electrical characterization, especially for large-scale production. While atomic force microscopy (AFM) techniques like conductive AFM (C-AFM) offer high accuracy, they suffer from slow data acquisition speeds due to the raster scanning process. To address this, we introduce SparseC-AFM,… ▽ More

    Submitted 17 July, 2025; originally announced July 2025.