Search | arXiv e-print repository

SCUBA: Salesforce Computer Use Benchmark

Authors: Yutong Dai, Krithika Ramakrishnan, Jing Gu, Matthew Fernandez, Yanqi Luo, Viraj Prabhu, Zhenyu Hu, Silvio Savarese, Caiming Xiong, Zeyuan Chen, Ran Xu

Abstract: We introduce SCUBA, a benchmark designed to evaluate computer-use agents on customer relationship management (CRM) workflows within the Salesforce platform. SCUBA contains 300 task instances derived from real user interviews, spanning three primary personas, platform administrators, sales representatives, and service agents. The tasks test a range of enterprise-critical abilities, including Enterp… ▽ More We introduce SCUBA, a benchmark designed to evaluate computer-use agents on customer relationship management (CRM) workflows within the Salesforce platform. SCUBA contains 300 task instances derived from real user interviews, spanning three primary personas, platform administrators, sales representatives, and service agents. The tasks test a range of enterprise-critical abilities, including Enterprise Software UI navigation, data manipulation, workflow automation, information retrieval, and troubleshooting. To ensure realism, SCUBA operates in Salesforce sandbox environments with support for parallel execution and fine-grained evaluation metrics to capture milestone progress. We benchmark a diverse set of agents under both zero-shot and demonstration-augmented settings. We observed huge performance gaps in different agent design paradigms and gaps between the open-source model and the closed-source model. In the zero-shot setting, open-source model powered computer-use agents that have strong performance on related benchmarks like OSWorld only have less than 5\% success rate on SCUBA, while methods built on closed-source models can still have up to 39% task success rate. In the demonstration-augmented settings, task success rates can be improved to 50\% while simultaneously reducing time and costs by 13% and 16%, respectively. These findings highlight both the challenges of enterprise tasks automation and the promise of agentic solutions. By offering a realistic benchmark with interpretable evaluation, SCUBA aims to accelerate progress in building reliable computer-use agents for complex business software ecosystems. △ Less

Submitted 30 September, 2025; originally announced September 2025.

arXiv:2509.25534 [pdf, ps, other]

Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning

Authors: Zhiling Ye, Yun Yue, Haowen Wang, Xudong Han, Jiadi Jiang, Cheng Wei, Lei Fan, Jiaxin Liang, Shuowen Zhang, Ji Li, Chunxiao Guo, Jian Wang, Peng Wei, Jinjie Gu

Abstract: Open-ended evaluation is essential for deploying large language models in real-world settings. In studying HealthBench, we observe that using the model itself as a grader and generating rubric-based reward signals substantially improves reasoning performance. Remarkably, the trained model also becomes a stronger grader. Motivated by this, we introduce Self-Rewarding Rubric-Based Reinforcement Lear… ▽ More Open-ended evaluation is essential for deploying large language models in real-world settings. In studying HealthBench, we observe that using the model itself as a grader and generating rubric-based reward signals substantially improves reasoning performance. Remarkably, the trained model also becomes a stronger grader. Motivated by this, we introduce Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning, a lightweight framework that enables faster and more resource-efficient training while surpassing baselines. Remarkably, on Qwen3-32B, training with just the 4000-sample HealthBench Easy subset is sufficient to obtain a model that exceeds GPT-5 on HealthBench Hard. Incorporating a small amount of teacher-graded data further enhances performance for less capable models. △ Less

Submitted 19 September, 2025; originally announced September 2025.

arXiv:2509.25143 [pdf, ps, other]

TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models

Authors: Junyi Zhang, Jia-Chen Gu, Wenbo Hu, Yu Zhou, Robinson Piramuthu, Nanyun Peng

Abstract: Existing medical reasoning benchmarks for vision-language models primarily focus on analyzing a patient's condition based on an image from a single visit. However, this setting deviates significantly from real-world clinical practice, where doctors typically refer to a patient's historical conditions to provide a comprehensive assessment by tracking their changes over time. In this paper, we intro… ▽ More Existing medical reasoning benchmarks for vision-language models primarily focus on analyzing a patient's condition based on an image from a single visit. However, this setting deviates significantly from real-world clinical practice, where doctors typically refer to a patient's historical conditions to provide a comprehensive assessment by tracking their changes over time. In this paper, we introduce TemMed-Bench, the first benchmark designed for analyzing changes in patients' conditions between different clinical visits, which challenges large vision-language models (LVLMs) to reason over temporal medical images. TemMed-Bench consists of a test set comprising three tasks - visual question-answering (VQA), report generation, and image-pair selection - and a supplementary knowledge corpus of over 17,000 instances. With TemMed-Bench, we conduct an evaluation of six proprietary and six open-source LVLMs. Our results show that most LVLMs lack the ability to analyze patients' condition changes over temporal medical images, and a large proportion perform only at a random-guessing level in the closed-book setting. In contrast, GPT o3, o4-mini and Claude 3.5 Sonnet demonstrate comparatively decent performance, though they have yet to reach the desired level. Furthermore, we explore augmenting the input with both retrieved visual and textual modalities in the medical domain. We also show that multi-modal retrieval augmentation yields notably higher performance gains than no retrieval and textual retrieval alone across most models on our benchmark, with the VQA task showing an average improvement of 2.59%. Overall, we compose a benchmark grounded on real-world clinical practice, and it reveals LVLMs' limitations in temporal medical image reasoning, as well as highlighting the use of multi-modal retrieval augmentation as a potentially promising direction worth exploring to address this challenge. △ Less

Submitted 29 September, 2025; originally announced September 2025.

arXiv:2509.24741 [pdf, ps, other]

Collaborating Vision, Depth, and Thermal Signals for Multi-Modal Tracking: Dataset and Algorithm

Authors: Xue-Feng Zhu, Tianyang Xu, Yifan Pan, Jinjie Gu, Xi Li, Jiwen Lu, Xiao-Jun Wu, Josef Kittler

Abstract: Existing multi-modal object tracking approaches primarily focus on dual-modal paradigms, such as RGB-Depth or RGB-Thermal, yet remain challenged in complex scenarios due to limited input modalities. To address this gap, this work introduces a novel multi-modal tracking task that leverages three complementary modalities, including visible RGB, Depth (D), and Thermal Infrared (TIR), aiming to enhanc… ▽ More Existing multi-modal object tracking approaches primarily focus on dual-modal paradigms, such as RGB-Depth or RGB-Thermal, yet remain challenged in complex scenarios due to limited input modalities. To address this gap, this work introduces a novel multi-modal tracking task that leverages three complementary modalities, including visible RGB, Depth (D), and Thermal Infrared (TIR), aiming to enhance robustness in complex scenarios. To support this task, we construct a new multi-modal tracking dataset, coined RGBDT500, which consists of 500 videos with synchronised frames across the three modalities. Each frame provides spatially aligned RGB, depth, and thermal infrared images with precise object bounding box annotations. Furthermore, we propose a novel multi-modal tracker, dubbed RDTTrack. RDTTrack integrates tri-modal information for robust tracking by leveraging a pretrained RGB-only tracking model and prompt learning techniques. In specific, RDTTrack fuses thermal infrared and depth modalities under a proposed orthogonal projection constraint, then integrates them with RGB signals as prompts for the pre-trained foundation tracking model, effectively harmonising tri-modal complementary cues. The experimental results demonstrate the effectiveness and advantages of the proposed method, showing significant improvements over existing dual-modal approaches in terms of tracking accuracy and robustness in complex scenarios. △ Less

Submitted 29 September, 2025; originally announced September 2025.

arXiv:2509.24386 [pdf, ps, other]

PCICF: A Pedestrian Crossing Identification and Classification Framework

Authors: Junyi Gu, Beatriz Cabrero-Daniel, Ali Nouri, Lydia Armini, Christian Berger

Abstract: We have recently observed the commercial roll-out of robotaxis in various countries. They are deployed within an operational design domain (ODD) on specific routes and environmental conditions, and are subject to continuous monitoring to regain control in safety-critical situations. Since ODDs typically cover urban areas, robotaxis must reliably detect vulnerable road users (VRUs) such as pedestri… ▽ More We have recently observed the commercial roll-out of robotaxis in various countries. They are deployed within an operational design domain (ODD) on specific routes and environmental conditions, and are subject to continuous monitoring to regain control in safety-critical situations. Since ODDs typically cover urban areas, robotaxis must reliably detect vulnerable road users (VRUs) such as pedestrians, bicyclists, or e-scooter riders. To better handle such varied traffic situations, end-to-end AI, which directly compute vehicle control actions from multi-modal sensor data instead of only for perception, is on the rise. High quality data is needed for systematically training and evaluating such systems within their OOD. In this work, we propose PCICF, a framework to systematically identify and classify VRU situations to support ODD's incident analysis. We base our work on the existing synthetic dataset SMIRK, and enhance it by extending its single-pedestrian-only design into the MoreSMIRK dataset, a structured dictionary of multi-pedestrian crossing situations constructed systematically. We then use space-filling curves (SFCs) to transform multi-dimensional features of scenarios into characteristic patterns, which we match with corresponding entries in MoreSMIRK. We evaluate PCICF with the large real-world dataset PIE, which contains more than 150 manually annotated pedestrian crossing videos. We show that PCICF can successfully identify and classify complex pedestrian crossings, even when groups of pedestrians merge or split. By leveraging computationally efficient components like SFCs, PCICF has even potential to be used onboard of robotaxis for OOD detection for example. We share an open-source replication package for PCICF containing its algorithms, the complete MoreSMIRK dataset and dictionary, as well as our experiment results presented in: https://github.com/Claud1234/PCICF △ Less

Submitted 29 September, 2025; originally announced September 2025.

arXiv:2509.23764 [pdf, ps, other]

Photonics-Aware Planning-Guided Automated Electrical Routing for Large-Scale Active Photonic Integrated Circuits

Authors: Hongjian Zhou, Haoyu Yang, Nicholas Gangi, Bowen Liu, Meng Zhang, Haoxing Ren, Xu Wang, Rena Huang, Jiaqi Gu

Abstract: The rising demand for AI training and inference, as well as scientific computing, combined with stringent latency and energy budgets, is driving the adoption of integrated photonics for computing, sensing, and communications. As active photonic integrated circuits (PICs) scale in device count and functional heterogeneity, physical implementation by manual scripting and ad-hoc edits is no longer te… ▽ More The rising demand for AI training and inference, as well as scientific computing, combined with stringent latency and energy budgets, is driving the adoption of integrated photonics for computing, sensing, and communications. As active photonic integrated circuits (PICs) scale in device count and functional heterogeneity, physical implementation by manual scripting and ad-hoc edits is no longer tenable. This creates an immediate need for an electronic-photonic design automation (EPDA) stack in which physical design automation is a core capability. However, there is currently no end-to-end fully automated routing flow that coordinates photonic waveguides and on-chip metal interconnect. Critically, available digital VLSI and analog/custom routers are not directly applicable to PIC metal routing due to a lack of customization to handle constraints induced by photonic devices and waveguides. We present, to our knowledge, the first end-to-end routing framework for large-scale active PICs that jointly addresses waveguides and metal wires within a unified flow. We introduce a physically-aware global planner that generates congestion- and crossing-aware routing guides while explicitly accounting for the placement of photonic components and waveguides. We further propose a sequence-consistent track assignment and a soft guidance-assisted detailed routing to speed up the routing process with significantly optimized routability and via usage. Evaluated on various large PIC designs, our router delivers fast, high-quality active PIC routing solutions with fewer vias, lower congestion, and competitive runtime relative to manual and existing VLSI router baselines; on average it reduce via count by ~99%, user-specified design rule violation by ~98%, and runtime by 17x, establishing a practical foundation for EPDA at system scale. △ Less

Submitted 28 September, 2025; originally announced September 2025.

Comments: 9 pages

arXiv:2509.21072 [pdf, ps, other]

Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution

Authors: Kaiwen He, Zhiwei Wang, Chenyi Zhuang, Jinjie Gu

Abstract: Recent years, multimodal models have made remarkable strides and pave the way for intelligent browser use agents. However, when solving tasks on real world webpages in multi-turn, long-horizon trajectories, current agents still suffer from disordered action sequencing and excessive trial and error during execution. This paper introduces Recon-Act, a self-evolving multi-agent framework grounded in… ▽ More Recent years, multimodal models have made remarkable strides and pave the way for intelligent browser use agents. However, when solving tasks on real world webpages in multi-turn, long-horizon trajectories, current agents still suffer from disordered action sequencing and excessive trial and error during execution. This paper introduces Recon-Act, a self-evolving multi-agent framework grounded in Reconnaissance-Action behavioral paradigm. The system comprises a Reconnaissance Team and an Action Team: the former conducts comparative analysis and tool generation, while the latter handles intent decomposition, tool orchestration, and execution. By contrasting the erroneous trajectories with successful ones, the Reconnaissance Team infers remedies, and abstracts them into a unified notion of generalized tools, either expressed as hints or as rule-based codes, and register to the tool archive in real time. The Action Team reinference the process empowered with these targeting tools, thus establishing a closed-loop training pipeline of data-tools-action-feedback. Following the 6 level implementation roadmap proposed in this work, we have currently reached Level 3 (with limited human-in-the-loop intervention). Leveraging generalized tools obtained through reconnaissance, Recon-Act substantially improves adaptability to unseen websites and solvability on long-horizon tasks, and achieves state-of-the-art performance on the challenging VisualWebArena dataset. △ Less

Submitted 25 September, 2025; originally announced September 2025.

arXiv:2509.20358 [pdf, ps, other]

PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation

Authors: Chen Wang, Chuhao Chen, Yiming Huang, Zhiyang Dou, Yuan Liu, Jiatao Gu, Lingjie Liu

Abstract: Existing video generation models excel at producing photo-realistic videos from text or images, but often lack physical plausibility and 3D controllability. To overcome these limitations, we introduce PhysCtrl, a novel framework for physics-grounded image-to-video generation with physical parameters and force control. At its core is a generative physics network that learns the distribution of phys… ▽ More Existing video generation models excel at producing photo-realistic videos from text or images, but often lack physical plausibility and 3D controllability. To overcome these limitations, we introduce PhysCtrl, a novel framework for physics-grounded image-to-video generation with physical parameters and force control. At its core is a generative physics network that learns the distribution of physical dynamics across four materials (elastic, sand, plasticine, and rigid) via a diffusion model conditioned on physics parameters and applied forces. We represent physical dynamics as 3D point trajectories and train on a large-scale synthetic dataset of 550K animations generated by physics simulators. We enhance the diffusion model with a novel spatiotemporal attention block that emulates particle interactions and incorporates physics-based constraints during training to enforce physical plausibility. Experiments show that PhysCtrl generates realistic, physics-grounded motion trajectories which, when used to drive image-to-video models, yield high-fidelity, controllable videos that outperform existing methods in both visual quality and physical plausibility. Project Page: https://cwchenwang.github.io/physctrl △ Less

Submitted 24 September, 2025; originally announced September 2025.

Comments: Accepted by NeurIPS 2025. This is the preview version; the camera-ready version is still in preparation

arXiv:2509.19244 [pdf, ps, other]

Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation

Authors: Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, Jason Kuen

Abstract: We propose Lavida-O, a unified Masked Diffusion Model (MDM) for multimodal understanding and generation. Unlike existing multimodal MDMs such as MMaDa and Muddit which only support simple image-level understanding tasks and low-resolution image generation, Lavida-O presents a single framework that enables image-level understanding, object grounding, image editing, and high-resolution (1024px) text… ▽ More We propose Lavida-O, a unified Masked Diffusion Model (MDM) for multimodal understanding and generation. Unlike existing multimodal MDMs such as MMaDa and Muddit which only support simple image-level understanding tasks and low-resolution image generation, Lavida-O presents a single framework that enables image-level understanding, object grounding, image editing, and high-resolution (1024px) text-to-image synthesis. Lavida-O incorporates a novel Elastic Mixture-of-Transformers (Elastic-MoT) architecture that couples a lightweight generation branch with a larger understanding branch, supported by token compression, universal text conditioning and stratified sampling for efficient and high-quality generation. Lavida-O further incorporates planning and iterative self-reflection in image generation and editing tasks, seamlessly boosting generation quality with its understanding capabilities. Lavida-O achieves state-of-the-art performance on a wide range of benchmarks including RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing, outperforming existing autoregressive models and continuous diffusion models such as Qwen2.5-VL and FluxKontext-dev, while offering considerable speedup at inference. These advances establish Lavida-O as a new paradigm for scalable multimodal reasoning and generation. △ Less

Submitted 24 September, 2025; v1 submitted 23 September, 2025; originally announced September 2025.

Comments: 31 pages, 15 figures

arXiv:2509.17677 [pdf, ps, other]

EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

Authors: Xiyuan Zhou, Xinlei Wang, Yirui He, Yang Wu, Ruixi Zou, Yuheng Cheng, Yulu Xie, Wenxuan Liu, Huan Zhao, Yan Xu, Jinjin Gu, Junhua Zhao

Abstract: Large language models (LLMs) have shown strong performance on mathematical reasoning under well-posed conditions. However, real-world engineering problems require more than mathematical symbolic computation -- they need to deal with uncertainty, context, and open-ended scenarios. Existing benchmarks fail to capture these complexities. We introduce EngiBench, a hierarchical benchmark designed to ev… ▽ More Large language models (LLMs) have shown strong performance on mathematical reasoning under well-posed conditions. However, real-world engineering problems require more than mathematical symbolic computation -- they need to deal with uncertainty, context, and open-ended scenarios. Existing benchmarks fail to capture these complexities. We introduce EngiBench, a hierarchical benchmark designed to evaluate LLMs on solving engineering problems. It spans three levels of increasing difficulty (foundational knowledge retrieval, multi-step contextual reasoning, and open-ended modeling) and covers diverse engineering subfields. To facilitate a deeper understanding of model performance, we systematically rewrite each problem into three controlled variants (perturbed, knowledge-enhanced, and math abstraction), enabling us to separately evaluate the model's robustness, domain-specific knowledge, and mathematical reasoning abilities. Experiment results reveal a clear performance gap across levels: models struggle more as tasks get harder, perform worse when problems are slightly changed, and fall far behind human experts on the high-level engineering tasks. These findings reveal that current LLMs still lack the high-level reasoning needed for real-world engineering, highlighting the need for future models with deeper and more reliable problem-solving capabilities. Our source code and data are available at https://github.com/EngiBench/EngiBench. △ Less

Submitted 22 September, 2025; originally announced September 2025.

arXiv:2509.16494 [pdf, ps, other]

Can an Individual Manipulate the Collective Decisions of Multi-Agents?

Authors: Fengyuan Liu, Rui Zhao, Shuo Chen, Guohao Li, Philip Torr, Lei Han, Jindong Gu

Abstract: Individual Large Language Models (LLMs) have demonstrated significant capabilities across various domains, such as healthcare and law. Recent studies also show that coordinated multi-agent systems exhibit enhanced decision-making and reasoning abilities through collaboration. However, due to the vulnerabilities of individual LLMs and the difficulty of accessing all agents in a multi-agent system,… ▽ More Individual Large Language Models (LLMs) have demonstrated significant capabilities across various domains, such as healthcare and law. Recent studies also show that coordinated multi-agent systems exhibit enhanced decision-making and reasoning abilities through collaboration. However, due to the vulnerabilities of individual LLMs and the difficulty of accessing all agents in a multi-agent system, a key question arises: If attackers only know one agent, could they still generate adversarial samples capable of misleading the collective decision? To explore this question, we formulate it as a game with incomplete information, where attackers know only one target agent and lack knowledge of the other agents in the system. With this formulation, we propose M-Spoiler, a framework that simulates agent interactions within a multi-agent system to generate adversarial samples. These samples are then used to manipulate the target agent in the target system, misleading the system's collaborative decision-making process. More specifically, M-Spoiler introduces a stubborn agent that actively aids in optimizing adversarial samples by simulating potential stubborn responses from agents in the target system. This enhances the effectiveness of the generated adversarial samples in misleading the system. Through extensive experiments across various tasks, our findings confirm the risks posed by the knowledge of an individual agent in multi-agent systems and demonstrate the effectiveness of our framework. We also explore several defense mechanisms, showing that our proposed attack framework remains more potent than baselines, underscoring the need for further research into defensive strategies. △ Less

Submitted 19 September, 2025; originally announced September 2025.

arXiv:2509.14558 [pdf, ps, other]

LLM Jailbreak Detection for (Almost) Free!

Authors: Guorui Chen, Yifan Xia, Xiaojun Jia, Zhijiang Li, Philip Torr, Jindong Gu

Abstract: Large language models (LLMs) enhance security through alignment when widely used, but remain susceptible to jailbreak attacks capable of producing inappropriate content. Jailbreak detection methods show promise in mitigating jailbreak attacks through the assistance of other models or multiple model inferences. However, existing methods entail significant computational costs. In this paper, we firs… ▽ More Large language models (LLMs) enhance security through alignment when widely used, but remain susceptible to jailbreak attacks capable of producing inappropriate content. Jailbreak detection methods show promise in mitigating jailbreak attacks through the assistance of other models or multiple model inferences. However, existing methods entail significant computational costs. In this paper, we first present a finding that the difference in output distributions between jailbreak and benign prompts can be employed for detecting jailbreak prompts. Based on this finding, we propose a Free Jailbreak Detection (FJD) which prepends an affirmative instruction to the input and scales the logits by temperature to further distinguish between jailbreak and benign prompts through the confidence of the first token. Furthermore, we enhance the detection performance of FJD through the integration of virtual instruction learning. Extensive experiments on aligned LLMs show that our FJD can effectively detect jailbreak prompts with almost no additional computational costs during LLM inference. △ Less

Submitted 17 September, 2025; originally announced September 2025.

arXiv:2509.14242 [pdf, ps, other]

Artificial Intelligence-derived Cardiotocography Age as a Digital Biomarker for Predicting Future Adverse Pregnancy Outcomes

Authors: Jinshuai Gu, Zenghui Lin, Jingying Ma, Jingyu Wang, Linyan Zhang, Rui Bai, Zelin Tu, Youyou Jiang, Donglin Xie, Yuxi Zhou, Guoli Liu, Shenda Hong

Abstract: Cardiotocography (CTG) is a low-cost, non-invasive fetal health assessment technique used globally, especially in underdeveloped countries. However, it is currently mainly used to identify the fetus's current status (e.g., fetal acidosis or hypoxia), and the potential of CTG in predicting future adverse pregnancy outcomes has not been fully explored. We aim to develop an AI-based model that predic… ▽ More Cardiotocography (CTG) is a low-cost, non-invasive fetal health assessment technique used globally, especially in underdeveloped countries. However, it is currently mainly used to identify the fetus's current status (e.g., fetal acidosis or hypoxia), and the potential of CTG in predicting future adverse pregnancy outcomes has not been fully explored. We aim to develop an AI-based model that predicts biological age from CTG time series (named CTGage), then calculate the age gap between CTGage and actual age (named CTGage-gap), and use this gap as a new digital biomarker for future adverse pregnancy outcomes. The CTGage model is developed using 61,140 records from 11,385 pregnant women, collected at Peking University People's Hospital between 2018 and 2022. For model training, a structurally designed 1D convolutional neural network is used, incorporating distribution-aligned augmented regression technology. The CTGage-gap is categorized into five groups: < -21 days (underestimation group), -21 to -7 days, -7 to 7 days (normal group), 7 to 21 days, and > 21 days (overestimation group). We further defined the underestimation group and overestimation group together as the high-risk group. We then compare the incidence of adverse outcomes and maternal diseases across these groups. The average absolute error of the CTGage model is 10.91 days. When comparing the overestimation group with the normal group, premature infants incidence is 5.33% vs. 1.42% (p < 0.05) and gestational diabetes mellitus (GDM) incidence is 31.93% vs. 20.86% (p < 0.05). When comparing the underestimation group with the normal group, low birth weight incidence is 0.17% vs. 0.15% (p < 0.05) and anaemia incidence is 37.51% vs. 34.74% (p < 0.05). Artificial intelligence-derived CTGage can predict the future risk of adverse pregnancy outcomes and hold potential as a novel, non-invasive, and easily accessible digital biomarker. △ Less

Submitted 3 September, 2025; originally announced September 2025.

arXiv:2509.12474 [pdf, ps, other]

Image Tokenizer Needs Post-Training

Authors: Kai Qiu, Xiang Li, Hao Chen, Jason Kuen, Xiaohao Xu, Jiuxiang Gu, Yinyi Luo, Bhiksha Raj, Zhe Lin, Marios Savvides

Abstract: Recent image generative models typically capture the image distribution in a pre-constructed latent space, relying on a frozen image tokenizer. However, there exists a significant discrepancy between the reconstruction and generation distribution, where current tokenizers only prioritize the reconstruction task that happens before generative training without considering the generation errors durin… ▽ More Recent image generative models typically capture the image distribution in a pre-constructed latent space, relying on a frozen image tokenizer. However, there exists a significant discrepancy between the reconstruction and generation distribution, where current tokenizers only prioritize the reconstruction task that happens before generative training without considering the generation errors during sampling. In this paper, we comprehensively analyze the reason for this discrepancy in a discrete latent space, and, from which, we propose a novel tokenizer training scheme including both main-training and post-training, focusing on improving latent space construction and decoding respectively. During the main training, a latent perturbation strategy is proposed to simulate sampling noises, \ie, the unexpected tokens generated in generative inference. Specifically, we propose a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer, thus boosting the generation quality and convergence speed, and a novel tokenizer evaluation metric, \ie, pFID, which successfully correlates the tokenizer performance to generation quality. During post-training, we further optimize the tokenizer decoder regarding a well-trained generative model to mitigate the distribution difference between generated and reconstructed tokens. With a $\sim$400M generator, a discrete tokenizer trained with our proposed main training achieves a notable 1.60 gFID and further obtains 1.36 gFID with the additional post-training. Further experiments are conducted to broadly validate the effectiveness of our post-training strategy on off-the-shelf discrete and continuous tokenizers, coupled with autoregressive and diffusion-based generators. △ Less

Submitted 15 September, 2025; originally announced September 2025.

Comments: 21 pages, 16 figures, 10 tables. arXiv admin note: substantial text overlap with arXiv:2503.08354

arXiv:2509.12289 [pdf, ps, other]

C3DE: Causal-Aware Collaborative Neural Controlled Differential Equation for Long-Term Urban Crowd Flow Prediction

Authors: Yuting Liu, Qiang Zhou, Hanzhe Li, Chenqi Gong, Jingjing Gu

Abstract: Long-term urban crowd flow prediction suffers significantly from cumulative sampling errors, due to increased sequence lengths and sampling intervals, which inspired us to leverage Neural Controlled Differential Equations (NCDEs) to mitigate this issue. However, regarding the crucial influence of Points of Interest (POIs) evolution on long-term crowd flow, the multi-timescale asynchronous dynamics… ▽ More Long-term urban crowd flow prediction suffers significantly from cumulative sampling errors, due to increased sequence lengths and sampling intervals, which inspired us to leverage Neural Controlled Differential Equations (NCDEs) to mitigate this issue. However, regarding the crucial influence of Points of Interest (POIs) evolution on long-term crowd flow, the multi-timescale asynchronous dynamics between crowd flow and POI distribution, coupled with latent spurious causality, poses challenges to applying NCDEs for long-term urban crowd flow prediction. To this end, we propose Causal-aware Collaborative neural CDE (C3DE) to model the long-term dynamic of crowd flow. Specifically, we introduce a dual-path NCDE as the backbone to effectively capture the asynchronous evolution of collaborative signals across multiple time scales. Then, we design a dynamic correction mechanism with the counterfactual-based causal effect estimator to quantify the causal impact of POIs on crowd flow and minimize the accumulation of spurious correlations. Finally, we leverage a predictor for long-term prediction with the fused collaborative signals of POI and crowd flow. Extensive experiments on three real-world datasets demonstrate the superior performance of C3DE, particularly in cities with notable flow fluctuations. △ Less

Submitted 15 September, 2025; originally announced September 2025.

arXiv:2509.09713 [pdf, ps, other]

HANRAG: Heuristic Accurate Noise-resistant Retrieval-Augmented Generation for Multi-hop Question Answering

Authors: Duolin Sun, Dan Yang, Yue Shen, Yihan Jiao, Zhehao Tan, Jie Feng, Lianzhen Zhong, Jian Wang, Peng Wei, Jinjie Gu

Abstract: The Retrieval-Augmented Generation (RAG) approach enhances question-answering systems and dialogue generation tasks by integrating information retrieval (IR) technologies with large language models (LLMs). This strategy, which retrieves information from external knowledge bases to bolster the response capabilities of generative models, has achieved certain successes. However, current RAG methods s… ▽ More The Retrieval-Augmented Generation (RAG) approach enhances question-answering systems and dialogue generation tasks by integrating information retrieval (IR) technologies with large language models (LLMs). This strategy, which retrieves information from external knowledge bases to bolster the response capabilities of generative models, has achieved certain successes. However, current RAG methods still face numerous challenges when dealing with multi-hop queries. For instance, some approaches overly rely on iterative retrieval, wasting too many retrieval steps on compound queries. Additionally, using the original complex query for retrieval may fail to capture content relevant to specific sub-queries, resulting in noisy retrieved content. If the noise is not managed, it can lead to the problem of noise accumulation. To address these issues, we introduce HANRAG, a novel heuristic-based framework designed to efficiently tackle problems of varying complexity. Driven by a powerful revelator, HANRAG routes queries, decomposes them into sub-queries, and filters noise from retrieved documents. This enhances the system's adaptability and noise resistance, making it highly capable of handling diverse queries. We compare the proposed framework against other leading industry methods across various benchmarks. The results demonstrate that our framework obtains superior performance in both single-hop and multi-hop question-answering tasks. △ Less

Submitted 8 September, 2025; originally announced September 2025.

arXiv:2509.07396 [pdf, ps, other]

Toward Lifelong-Sustainable Electronic-Photonic AI Systems via Extreme Efficiency, Reconfigurability, and Robustness

Authors: Ziang Yin, Hongjian Zhou, Chetan Choppali Sudarshan, Vidya Chhabria, Jiaqi Gu

Abstract: The relentless growth of large-scale artificial intelligence (AI) has created unprecedented demand for computational power, straining the energy, bandwidth, and scaling limits of conventional electronic platforms. Electronic-photonic integrated circuits (EPICs) have emerged as a compelling platform for next-generation AI systems, offering inherent advantages in ultra-high bandwidth, low latency, a… ▽ More The relentless growth of large-scale artificial intelligence (AI) has created unprecedented demand for computational power, straining the energy, bandwidth, and scaling limits of conventional electronic platforms. Electronic-photonic integrated circuits (EPICs) have emerged as a compelling platform for next-generation AI systems, offering inherent advantages in ultra-high bandwidth, low latency, and energy efficiency for computing and interconnection. Beyond performance, EPICs also hold unique promises for sustainability. Fabricated in relaxed process nodes with fewer metal layers and lower defect densities, photonic devices naturally reduce embodied carbon footprint (CFP) compared to advanced digital electronic integrated circuits, while delivering orders-of-magnitude higher computing performance and interconnect bandwidth. To further advance the sustainability of photonic AI systems, we explore how electronic-photonic design automation (EPDA) and cross-layer co-design methodologies can amplify these inherent benefits. We present how advanced EPDA tools enable more compact layout generation, reducing both chip area and metal layer usage. We will also demonstrate how cross-layer device-circuit-architecture co-design unlocks new sustainability gains for photonic hardware: ultra-compact photonic circuit designs that minimize chip area cost, reconfigurable hardware topology that adapts to evolving AI workloads, and intelligent resilience mechanisms that prolong lifetime by tolerating variations and faults. By uniting intrinsic photonic efficiency with EPDA- and co-design-driven gains in area efficiency, reconfigurability, and robustness, we outline a vision for lifelong-sustainable electronic-photonic AI systems. This perspective highlights how EPIC AI systems can simultaneously meet the performance demands of modern AI and the urgent imperative for sustainable computing. △ Less

Submitted 9 September, 2025; originally announced September 2025.

Comments: 8 pages

arXiv:2509.04091 [pdf, ps, other]

Revisiting Third-Party Library Detection: A Ground Truth Dataset and Its Implications Across Security Tasks

Authors: Jintao Gu, Haolang Lu, Guoshun Nan, Yihan Lin, Kun Wang, Yuchun Guo, Yigui Cao, Yang Liu

Abstract: Accurate detection of third-party libraries (TPLs) is fundamental to Android security, supporting vulnerability tracking, malware detection, and supply chain auditing. Despite many proposed tools, their real-world effectiveness remains unclear. We present the first large-scale empirical study of ten state-of-the-art TPL detection techniques across over 6,000 apps, enabled by a new ground truth dat… ▽ More Accurate detection of third-party libraries (TPLs) is fundamental to Android security, supporting vulnerability tracking, malware detection, and supply chain auditing. Despite many proposed tools, their real-world effectiveness remains unclear. We present the first large-scale empirical study of ten state-of-the-art TPL detection techniques across over 6,000 apps, enabled by a new ground truth dataset with precise version-level annotations for both remote and local dependencies. Our evaluation exposes tool fragility to R8-era transformations, weak version discrimination, inaccurate correspondence of candidate libraries, difficulty in generalizing similarity thresholds, and prohibitive runtime/memory overheads at scale. Beyond tool assessment, we further analyze how TPLs shape downstream tasks, including vulnerability analysis, malware detection, secret leakage assessment, and LLM-based evaluation. From this perspective, our study provides concrete insights into how TPL characteristics affect these tasks and informs future improvements in security analysis. △ Less

Submitted 5 September, 2025; v1 submitted 4 September, 2025; originally announced September 2025.

Comments: 20pages, 7figures

MSC Class: 68M25 ACM Class: K.6.5; D.2.7

arXiv:2509.01790 [pdf, ps, other]

Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs

Authors: Andong Hua, Kenan Tang, Chenhe Gu, Jindong Gu, Eric Wong, Yao Qin

Abstract: Prompt sensitivity, referring to the phenomenon where paraphrasing (i.e., repeating something written or spoken using different words) leads to significant changes in large language model (LLM) performance, has been widely accepted as a core limitation of LLMs. In this work, we revisit this issue and ask: Is the widely reported high prompt sensitivity truly an inherent weakness of LLMs, or is it l… ▽ More Prompt sensitivity, referring to the phenomenon where paraphrasing (i.e., repeating something written or spoken using different words) leads to significant changes in large language model (LLM) performance, has been widely accepted as a core limitation of LLMs. In this work, we revisit this issue and ask: Is the widely reported high prompt sensitivity truly an inherent weakness of LLMs, or is it largely an artifact of evaluation processes? To answer this question, we systematically evaluate 7 LLMs (e.g., GPT and Gemini family) across 6 benchmarks, including both multiple-choice and open-ended tasks on 12 diverse prompt templates. We find that much of the prompt sensitivity stems from heuristic evaluation methods, including log-likelihood scoring and rigid answer matching, which often overlook semantically correct responses expressed through alternative phrasings, such as synonyms or paraphrases. When we adopt LLM-as-a-Judge evaluations, we observe a substantial reduction in performance variance and a consistently higher correlation in model rankings across prompts. Our findings suggest that modern LLMs are more robust to prompt templates than previously believed, and that prompt sensitivity may be more an artifact of evaluation than a flaw in the models. △ Less

Submitted 1 September, 2025; originally announced September 2025.

Comments: Accepted to EMNLP 2025 Main Conference

arXiv:2509.01158 [pdf, ps, other]

Joint Information Extraction Across Classical and Modern Chinese with Tea-MOELoRA

Authors: Xuemei Tang, Chengxi Yan, Jinghang Gu, Chu-Ren Huang

Abstract: Chinese information extraction (IE) involves multiple tasks across diverse temporal domains, including Classical and Modern documents. Fine-tuning a single model on heterogeneous tasks and across different eras may lead to interference and reduced performance. Therefore, in this paper, we propose Tea-MOELoRA, a parameter-efficient multi-task framework that combines LoRA with a Mixture-of-Experts (… ▽ More Chinese information extraction (IE) involves multiple tasks across diverse temporal domains, including Classical and Modern documents. Fine-tuning a single model on heterogeneous tasks and across different eras may lead to interference and reduced performance. Therefore, in this paper, we propose Tea-MOELoRA, a parameter-efficient multi-task framework that combines LoRA with a Mixture-of-Experts (MoE) design. Multiple low-rank LoRA experts specialize in different IE tasks and eras, while a task-era-aware router mechanism dynamically allocates expert contributions. Experiments show that Tea-MOELoRA outperforms both single-task and joint LoRA baselines, demonstrating its ability to leverage task and temporal knowledge effectively. △ Less

Submitted 9 September, 2025; v1 submitted 1 September, 2025; originally announced September 2025.

Comments: 9 pages, 3 figures

arXiv:2508.20404 [pdf, ps, other]

AWorld: Orchestrating the Training Recipe for Agentic AI

Authors: Chengyue Yu, Siyuan Lu, Chenyi Zhuang, Dong Wang, Qintong Wu, Zongyue Li, Runsheng Gan, Chunfeng Wang, Siqi Hou, Gaochi Huang, Wenlong Yan, Lifeng Hong, Aohui Xue, Yanfeng Wang, Jinjie Gu, David Tsai, Tao Lin

Abstract: The learning from practice paradigm is crucial for developing capable Agentic AI systems, yet it is severely hampered by inefficient experience generation, a bottleneck especially pronounced in complex benchmarks like GAIA. To address this, we introduce AWorld, an open-source system engineered for large-scale agent-environment interaction. By distributing tasks across a cluster, AWorld accelerates… ▽ More The learning from practice paradigm is crucial for developing capable Agentic AI systems, yet it is severely hampered by inefficient experience generation, a bottleneck especially pronounced in complex benchmarks like GAIA. To address this, we introduce AWorld, an open-source system engineered for large-scale agent-environment interaction. By distributing tasks across a cluster, AWorld accelerates experience collection by 14.6x compared to standard single-node, sequential execution. This critical speedup makes extensive reinforcement learning practical and scalable. Leveraging this capability, we trained a Qwen3-32B-based agent that achieves pass@1 accuracy of 32.23% on the GAIA test set, which surpasses GPT-4o (27.91%) and rivals DeepSeek-V3 (31.89%). Our open-source system and the resulting agent provide a practical blueprint for a complete agentic AI training pipeline, from efficient interaction to demonstrable model improvement. △ Less

Submitted 31 August, 2025; v1 submitted 28 August, 2025; originally announced August 2025.

arXiv:2508.19679 [pdf, ps, other]

InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning

Authors: Qihang Ai, Pi Bu, Yue Cao, Yingyao Wang, Jihao Gu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Zhicheng Zheng, Jun Song, Yuning Jiang, Bo Zheng

Abstract: Recent advances in Vision-Language Models (VLMs) have enabled mobile agents to perceive and interact with real-world mobile environments based on human instructions. However, the current fully autonomous paradigm poses potential safety risks when model understanding or reasoning capabilities are insufficient. To address this challenge, we first introduce \textbf{InquireBench}, a comprehensive benc… ▽ More Recent advances in Vision-Language Models (VLMs) have enabled mobile agents to perceive and interact with real-world mobile environments based on human instructions. However, the current fully autonomous paradigm poses potential safety risks when model understanding or reasoning capabilities are insufficient. To address this challenge, we first introduce \textbf{InquireBench}, a comprehensive benchmark specifically designed to evaluate mobile agents' capabilities in safe interaction and proactive inquiry with users, encompassing 5 categories and 22 sub-categories, where most existing VLM-based agents demonstrate near-zero performance. In this paper, we aim to develop an interactive system that actively seeks human confirmation at critical decision points. To achieve this, we propose \textbf{InquireMobile}, a novel model inspired by reinforcement learning, featuring a two-stage training strategy and an interactive pre-action reasoning mechanism. Finally, our model achieves an 46.8% improvement in inquiry success rate and the best overall success rate among existing baselines on InquireBench. We will open-source all datasets, models, and evaluation codes to facilitate development in both academia and industry. △ Less

Submitted 27 August, 2025; originally announced August 2025.

arXiv:2508.17922 [pdf, ps, other]

Egocentric Instruction-oriented Affordance Prediction via Large Multimodal Model

Authors: Bokai Ji, Jie Gu, Xiaokang Ma, Chu Tang, Jingmin Chen, Guangxia Li

Abstract: Affordance is crucial for intelligent robots in the context of object manipulation. In this paper, we argue that affordance should be task-/instruction-dependent, which is overlooked by many previous works. That is, different instructions can lead to different manipulation regions and directions even for the same object. According to this observation, we present a new dataset comprising fifteen th… ▽ More Affordance is crucial for intelligent robots in the context of object manipulation. In this paper, we argue that affordance should be task-/instruction-dependent, which is overlooked by many previous works. That is, different instructions can lead to different manipulation regions and directions even for the same object. According to this observation, we present a new dataset comprising fifteen thousand object-instruction-affordance triplets. All scenes in the dataset are from an egocentric viewpoint, designed to approximate the perspective of a human-like robot. Furthermore, we investigate how to enable large multimodal models (LMMs) to serve as affordance predictors by implementing a ``search against verifiers'' pipeline. An LMM is asked to progressively predict affordances, with the output at each step being verified by itself during the iterative process, imitating a reasoning process. Experiments show that our method not only unlocks new instruction-oriented affordance prediction capabilities, but also achieves outstanding performance broadly. △ Less

Submitted 25 August, 2025; originally announced August 2025.

arXiv:2508.17855 [pdf, ps, other]

Beyond Demographics: Enhancing Cultural Value Survey Simulation with Multi-Stage Personality-Driven Cognitive Reasoning

Authors: Haijiang Liu, Qiyuan Li, Chao Gao, Yong Cao, Xiangyu Xu, Xun Wu, Daniel Hershcovich, Jinguang Gu

Abstract: Introducing MARK, the Multi-stAge Reasoning frameworK for cultural value survey response simulation, designed to enhance the accuracy, steerability, and interpretability of large language models in this task. The system is inspired by the type dynamics theory in the MBTI psychological framework for personality research. It effectively predicts and utilizes human demographic information for simulat… ▽ More Introducing MARK, the Multi-stAge Reasoning frameworK for cultural value survey response simulation, designed to enhance the accuracy, steerability, and interpretability of large language models in this task. The system is inspired by the type dynamics theory in the MBTI psychological framework for personality research. It effectively predicts and utilizes human demographic information for simulation: life-situational stress analysis, group-level personality prediction, and self-weighted cognitive imitation. Experiments on the World Values Survey show that MARK outperforms existing baselines by 10% accuracy and reduces the divergence between model predictions and human preferences. This highlights the potential of our framework to improve zero-shot personalization and help social scientists interpret model predictions. △ Less

Submitted 25 August, 2025; originally announced August 2025.

Comments: 23 pages, 6 figures, accepted to EMNLP 2025 main

arXiv:2508.14880 [pdf, ps, other]

MedResearcher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework

Authors: Ailing Yu, Lan Yao, Jingnan Liu, Zhe Chen, Jiajun Yin, Yuan Wang, Xinhao Liao, Zhiling Ye, Ji Li, Yun Yue, Hansong Xiao, Hualei Zhou, Chunxiao Guo, Peng Wei, Junwei Liu, Jinjie Gu

Abstract: Recent developments in Large Language Model (LLM)-based agents have shown impressive capabilities spanning multiple domains, exemplified by deep research systems that demonstrate superior performance on complex information-seeking and synthesis tasks. While general-purpose deep research agents have shown impressive capabilities, they struggle significantly with medical domain challenges, as eviden… ▽ More Recent developments in Large Language Model (LLM)-based agents have shown impressive capabilities spanning multiple domains, exemplified by deep research systems that demonstrate superior performance on complex information-seeking and synthesis tasks. While general-purpose deep research agents have shown impressive capabilities, they struggle significantly with medical domain challenges, as evidenced by leading proprietary systems achieving limited accuracy on complex medical benchmarks. The key limitations are: (1) the model lacks sufficient dense medical knowledge for clinical reasoning, and (2) the framework is constrained by the absence of specialized retrieval tools tailored for medical contexts. We present a medical deep research agent that addresses these challenges through two core innovations. First, we develop a novel data synthesis framework using medical knowledge graphs, extracting the longest chains from subgraphs around rare medical entities to generate complex multi-hop question-answer pairs. Second, we integrate a custom-built private medical retrieval engine alongside general-purpose tools, enabling accurate medical information synthesis. Our approach generates 2100+ diverse trajectories across 12 medical specialties, each averaging 4.2 tool interactions. Through a two-stage training paradigm combining supervised fine-tuning and online reinforcement learning with composite rewards, our MedResearcher-R1-32B model demonstrates exceptional performance, establishing new state-of-the-art results on medical benchmarks while maintaining competitive performance on general deep research tasks. Our work demonstrates that strategic domain-specific innovations in architecture, tool design, and training data construction can enable smaller open-source models to outperform much larger proprietary systems in specialized domains. △ Less

Submitted 1 September, 2025; v1 submitted 20 August, 2025; originally announced August 2025.

Comments: 13 pages, 5 figures

arXiv:2508.14812 [pdf, ps, other]

Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives

Authors: Haoyu Zhao, Jiaxi Gu, Shicong Wang, Xing Zhang, Hang Xu, Zuxuan Wu, Yu-Gang Jiang

Abstract: The explosive growth of video streaming presents challenges in achieving high accuracy and low training costs for video-language retrieval. However, existing methods rely on large-scale pre-training to improve video retrieval performance, resulting in significant computational demands. Additionally, the fine-grained information in videos and texts remains underexplored. To alleviate these problems… ▽ More The explosive growth of video streaming presents challenges in achieving high accuracy and low training costs for video-language retrieval. However, existing methods rely on large-scale pre-training to improve video retrieval performance, resulting in significant computational demands. Additionally, the fine-grained information in videos and texts remains underexplored. To alleviate these problems, we propose a novel framework to learn fine-grained features for better alignment and introduce an inference pipeline to improve performance without additional training. Specifically, we employ coarse-to-fine objectives to understand the semantic information of video-text pairs, including contrastive and matching learning. The fine-grained data used for training is obtained through the Granularity-Aware Representation module, which is designed based on similarity analysis between video frames and words in captions. Furthermore, we observe that the repetition of keywords in the original captions, referred to as "Repetition", can enhance retrieval performance and improve alignment between video and text. Based on this insight, we propose a novel and effective inference pipeline that incorporates a voting mechanism and a new Matching Entropy metric to achieve better retrieval performance without requiring additional pre-training. Experimental results on four benchmarks demonstrate that the proposed method outperforms previous approaches. Additionally, our inference pipeline achieves significant performance improvements, with a 2.1% increase in Recall@1 on the MSR-VTT dataset and a 1.6% increase on the DiDeMo dataset. △ Less

Submitted 20 August, 2025; originally announced August 2025.

Comments: 11 pages, 4 figures

arXiv:2508.13634 [pdf, ps, other]

V2P: From Background Suppression to Center Peaking for Robust GUI Grounding Task

Authors: Jikai Chen, Long Chen, Dong Wang, Leilei Gan, Chenyi Zhuang, Jinjie Gu

Abstract: Precise localization of GUI elements is crucial for the development of GUI agents. Traditional methods rely on bounding box or center-point regression, neglecting spatial interaction uncertainty and visual-semantic hierarchies. Recent methods incorporate attention mechanisms but still face two key issues: (1) ignoring processing background regions causes attention drift from the desired area, and… ▽ More Precise localization of GUI elements is crucial for the development of GUI agents. Traditional methods rely on bounding box or center-point regression, neglecting spatial interaction uncertainty and visual-semantic hierarchies. Recent methods incorporate attention mechanisms but still face two key issues: (1) ignoring processing background regions causes attention drift from the desired area, and (2) uniform labeling fails to distinguish between center and edges of the target UI element, leading to click imprecision. Inspired by how humans visually process and interact with GUI elements, we propose the Valley-to-Peak (V2P) method to address these issues. To mitigate background distractions, V2P introduces a suppression attention mechanism that minimizes the model's focus on irrelevant regions to highlight the intended region. For the issue of center-edge distinction, V2P applies a Fitts' Law-inspired approach by modeling GUI interactions as 2D Gaussian heatmaps where the weight gradually decreases from the center towards the edges. The weight distribution follows a Gaussian function, with the variance determined by the target's size. Consequently, V2P effectively isolates the target area and teaches the model to concentrate on the most essential point of the UI element. The model trained by V2P achieves the performance with 92.3% and 50.5% on two benchmarks ScreenSpot-v2 and ScreenSpot-Pro. Ablations further confirm each component's contribution, highlighting V2P's generalizability for precise GUI grounding tasks. △ Less

Submitted 19 August, 2025; originally announced August 2025.

arXiv:2508.13186 [pdf, ps, other]

MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents

Authors: Shilong Li, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong, Haoyang He, Hao Lu, Haozhe Zhang, Chenchen Jing, Zhen Li, Chuanhao Li, Jiayi Tian, Chenchen Zhang, Tianhao Peng, Yancheng He, Jihao Gu, Yuanxing Zhang, Jian Yang, Ge Zhang, Wenhao Huang, Wangchunshu Zhou, Zhaoxiang Zhang, Ruizhe Ding, Shilei Wen

Abstract: AI agents with advanced reasoning and tool use capabilities have demonstrated impressive performance in web browsing for deep search. While existing benchmarks such as BrowseComp evaluate these browsing abilities, they primarily focus on textual information, overlooking the prevalence of multimodal content. To bridge this gap, we introduce MM-BrowseComp, a novel benchmark comprising 224 challengin… ▽ More AI agents with advanced reasoning and tool use capabilities have demonstrated impressive performance in web browsing for deep search. While existing benchmarks such as BrowseComp evaluate these browsing abilities, they primarily focus on textual information, overlooking the prevalence of multimodal content. To bridge this gap, we introduce MM-BrowseComp, a novel benchmark comprising 224 challenging, hand-crafted questions specifically designed to assess agents' multimodal retrieval and reasoning capabilities. These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages. Consequently, methods relying solely on text prove insufficient for our benchmark. Additionally, we provide a verified checklist for each question, enabling fine-grained analysis of multimodal dependencies and reasoning paths. Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp reveals that even top models like OpenAI o3 with tools achieve only 29.02\% accuracy, highlighting the suboptimal multimodal capabilities and lack of native multimodal reasoning in current models. △ Less

Submitted 14 August, 2025; originally announced August 2025.

Comments: The first two authors contribute equally, 26 pages, repo at https://github.com/MMBrowseComp/MM-BrowseComp

arXiv:2508.11939 [pdf, ps, other]

Design and Implementation of a Controlled Ransomware Framework for Educational Purposes Using Flutter Cryptographic APIs on Desktop PCs and Android Devices

Authors: James Gu, Ahmed Sartaj, Mohammed Akram Taher Khan, Rashid Hussain Khokhar

Abstract: This study focuses on the creation and implementation of ransomware for educational purposes that leverages Python's native cryptographic APIs in a controlled environment. Additionally, an Android version of the framework is implemented using Flutter and Dart. For both versions, open-source cryptographic libraries are utilized. With this framework, researchers can systematically explore the functi… ▽ More This study focuses on the creation and implementation of ransomware for educational purposes that leverages Python's native cryptographic APIs in a controlled environment. Additionally, an Android version of the framework is implemented using Flutter and Dart. For both versions, open-source cryptographic libraries are utilized. With this framework, researchers can systematically explore the functionalities of ransomware, including file encryption processes, cryptographic key management, and victim interaction dynamics. To ensure safe experimentation, multiple safeguards are incorporated, such as the ability to restrict the encryption process to a specific directory, providing the RSA private key for immediate decryption, and narrowing the scope of targetable files to a carefully curated list (.txt, .jpg, .csv, .doc). This paper draws inspiration from the infamous WannaCry ransomware and aims to simulate its behaviour on Android devices. By making the codebase open-source, it enables users to study, modify, and extend the program for pedagogical purposes and offers a hands-on tool that can be used to train the next generation of cybersecurity professionals. △ Less

Submitted 16 August, 2025; originally announced August 2025.

Comments: 6 pages, 1 figure, 1 table, 2 algorithms

arXiv:2508.10881 [pdf, ps, other]

ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing

Authors: Lingen Li, Guangzhi Wang, Zhaoyang Zhang, Yaowei Li, Xiaoyu Li, Qi Dou, Jinwei Gu, Tianfan Xue, Ying Shan

Abstract: Traditional cartoon and anime production involves keyframing, inbetweening, and colorization stages, which require intensive manual effort. Despite recent advances in AI, existing methods often handle these stages separately, leading to error accumulation and artifacts. For instance, inbetweening approaches struggle with large motions, while colorization methods require dense per-frame sketches. T… ▽ More Traditional cartoon and anime production involves keyframing, inbetweening, and colorization stages, which require intensive manual effort. Despite recent advances in AI, existing methods often handle these stages separately, leading to error accumulation and artifacts. For instance, inbetweening approaches struggle with large motions, while colorization methods require dense per-frame sketches. To address this, we introduce ToonComposer, a generative model that unifies inbetweening and colorization into a single post-keyframing stage. ToonComposer employs a sparse sketch injection mechanism to provide precise control using keyframe sketches. Additionally, it uses a cartoon adaptation method with the spatial low-rank adapter to tailor a modern video foundation model to the cartoon domain while keeping its temporal prior intact. Requiring as few as a single sketch and a colored reference frame, ToonComposer excels with sparse inputs, while also supporting multiple sketches at any temporal location for more precise motion control. This dual capability reduces manual workload and improves flexibility, empowering artists in real-world scenarios. To evaluate our model, we further created PKBench, a benchmark featuring human-drawn sketches that simulate real-world use cases. Our evaluation demonstrates that ToonComposer outperforms existing methods in visual quality, motion consistency, and production efficiency, offering a superior and more flexible solution for AI-assisted cartoon production. △ Less

Submitted 14 August, 2025; originally announced August 2025.

Comments: Project Page: https://lg-li.github.io/project/tooncomposer

arXiv:2508.10259 [pdf, ps, other]

Leveraging OS-Level Primitives for Robotic Action Management

Authors: Wenxin Zheng, Boyang Li, Bin Xu, Erhu Feng, Jinyu Gu, Haibo Chen

Abstract: End-to-end imitation learning frameworks (e.g., VLA) are increasingly prominent in robotics, as they enable rapid task transfer by learning directly from perception to control, eliminating the need for complex hand-crafted features. However, even when employing SOTA VLA-based models, they still exhibit limited generalization capabilities and suboptimal action efficiency, due to the constraints imp… ▽ More End-to-end imitation learning frameworks (e.g., VLA) are increasingly prominent in robotics, as they enable rapid task transfer by learning directly from perception to control, eliminating the need for complex hand-crafted features. However, even when employing SOTA VLA-based models, they still exhibit limited generalization capabilities and suboptimal action efficiency, due to the constraints imposed by insufficient robotic training datasets. In addition to addressing this problem using model-based approaches, we observe that robotic action slices, which consist of contiguous action steps, exhibit strong analogies to the time slices of threads in traditional operating systems. This insight presents a novel opportunity to tackle the problem at the system level. In this paper, we propose AMS, a robot action management system enhanced with OS-level primitives like exception, context switch and record-and-replay, that improves both execution efficiency and success rates of robotic tasks. AMS first introduces action exception, which facilitates the immediate interruption of robotic actions to prevent error propagation. Secondly, AMS proposes action context, which eliminates redundant computations for VLA-based models, thereby accelerating execution efficiency in robotic actions. Finally, AMS leverages action replay to facilitate repetitive or similar robotic tasks without the need for re-training efforts. We implement AMS in both an emulated environment and on a real robot platform. The evaluation results demonstrate that AMS significantly enhances the model's generalization ability and action efficiency, achieving task success rate improvements ranging from 7x to 24x and saving end-to-end execution time ranging from 29% to 74% compared to existing robotic system without AMS support. △ Less

Submitted 13 August, 2025; originally announced August 2025.

arXiv:2508.09889 [pdf, ps, other]

Profile-Aware Maneuvering: A Dynamic Multi-Agent System for Robust GAIA Problem Solving by AWorld

Authors: Zhitian Xie, Qintong Wu, Chengyue Yu, Chenyi Zhuang, Jinjie Gu

Abstract: The rapid advancement of large language models (LLMs) has empowered intelligent agents to leverage diverse external tools for solving complex real-world problems. However, this reliance introduces new challenges, as extended contexts and noisy tool outputs can undermine system reliability. To address this, we propose a dynamic Multi-Agent System (MAS) in our AWorld framework, where an Execution Ag… ▽ More The rapid advancement of large language models (LLMs) has empowered intelligent agents to leverage diverse external tools for solving complex real-world problems. However, this reliance introduces new challenges, as extended contexts and noisy tool outputs can undermine system reliability. To address this, we propose a dynamic Multi-Agent System (MAS) in our AWorld framework, where an Execution Agent is supervised by a Guard Agent that provides on-demand dynamic maneuvering, verifying and correcting the reasoning process to improve robustness over single-agent systems. To move beyond this generic supervision, we enhance the architecture with a methodology inspired by System Identification from control theory. This method first profiles the Execution Agent offline on a benchmark dataset to create a "performance fingerprint" of its unique weaknesses. The Guard Agent then leverages this fingerprint online to deliver profile-aware supervision, making targeted interventions based on known failure patterns rather than merely reacting to immediate logical flaws. Extensive experiments on the GAIA dataset demonstrate that this profile-aware MAS significantly improves both effectiveness and stability, outperforming not only single-agent systems but also its naive counterpart. This superior performance led our system to achieve first place among open-source projects on the prestigious GAIA leaderboard. These findings highlight that building truly trustworthy intelligent systems requires not just collaboration, but a deep, empirically-grounded understanding of each agent's unique capabilities and limitations. △ Less

Submitted 31 August, 2025; v1 submitted 13 August, 2025; originally announced August 2025.

arXiv:2508.07995 [pdf, ps, other]

DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval

Authors: Meixiu Long, Duolin Sun, Dan Yang, Junjie Wang, Yue Shen, Jian Wang, Peng Wei, Jinjie Gu, Jiahai Wang

Abstract: Retrieval-augmented generation has achieved strong performance on knowledge-intensive tasks where query-document relevance can be identified through direct lexical or semantic matches. However, many real-world queries involve abstract reasoning, analogical thinking, or multi-step inference, which existing retrievers often struggle to capture. To address this challenge, we present DIVER, a retrieva… ▽ More Retrieval-augmented generation has achieved strong performance on knowledge-intensive tasks where query-document relevance can be identified through direct lexical or semantic matches. However, many real-world queries involve abstract reasoning, analogical thinking, or multi-step inference, which existing retrievers often struggle to capture. To address this challenge, we present DIVER, a retrieval pipeline designed for reasoning-intensive information retrieval. It consists of four components. The document preprocessing stage enhances readability and preserves content by cleaning noisy texts and segmenting long documents. The query expansion stage leverages large language models to iteratively refine user queries with explicit reasoning and evidence from retrieved documents. The retrieval stage employs a model fine-tuned on synthetic data spanning medical and mathematical domains, along with hard negatives, enabling effective handling of reasoning-intensive queries. Finally, the reranking stage combines pointwise and listwise strategies to produce both fine-grained and globally consistent rankings. On the BRIGHT benchmark, DIVER achieves state-of-the-art nDCG@10 scores of 45.8 overall and 28.9 on original queries, consistently outperforming competitive reasoning-aware models. These results demonstrate the effectiveness of reasoning-aware retrieval strategies in complex real-world tasks. △ Less

Submitted 25 August, 2025; v1 submitted 11 August, 2025; originally announced August 2025.

arXiv:2508.07750 [pdf, ps, other]

Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment

Authors: Haowen Wang, Yun Yue, Zhiling Ye, Shuowen Zhang, Lei Fan, Jiaxin Liang, Jiadi Jiang, Cheng Wei, Jingyuan Deng, Xudong Han, Ji Li, Chunxiao Guo, Peng Wei, Jian Wang, Jinjie Gu

Abstract: Alignment methodologies have emerged as a critical pathway for enhancing language model alignment capabilities. While SFT (supervised fine-tuning) accelerates convergence through direct token-level loss intervention, its efficacy is constrained by offline policy trajectory. In contrast, RL(reinforcement learning) facilitates exploratory policy optimization, but suffers from low sample efficiency a… ▽ More Alignment methodologies have emerged as a critical pathway for enhancing language model alignment capabilities. While SFT (supervised fine-tuning) accelerates convergence through direct token-level loss intervention, its efficacy is constrained by offline policy trajectory. In contrast, RL(reinforcement learning) facilitates exploratory policy optimization, but suffers from low sample efficiency and stringent dependency on high-quality base models. To address these dual challenges, we propose GRAO (Group Relative Alignment Optimization), a unified framework that synergizes the respective strengths of SFT and RL through three key innovations: 1) A multi-sample generation strategy enabling comparative quality assessment via reward feedback; 2) A novel Group Direct Alignment Loss formulation leveraging intra-group relative advantage weighting; 3) Reference-aware parameter updates guided by pairwise preference dynamics. Our theoretical analysis establishes GRAO's convergence guarantees and sample efficiency advantages over conventional approaches. Comprehensive evaluations across complex human alignment tasks demonstrate GRAO's superior performance, achieving 57.70\%,17.65\% 7.95\% and 5.18\% relative improvements over SFT, DPO, PPO and GRPO baselines respectively. This work provides both a theoretically grounded alignment framework and empirical evidence for efficient capability evolution in language models. △ Less

Submitted 11 August, 2025; originally announced August 2025.

Comments: 12 pages, 5 figures, 7 tables

arXiv:2508.07597 [pdf, ps, other]

ShoulderShot: Generating Over-the-Shoulder Dialogue Videos

Authors: Yuang Zhang, Junqi Cheng, Haoyu Zhao, Jiaxi Gu, Fangyuan Zou, Zenghui Lu, Peng Shu

Abstract: Over-the-shoulder dialogue videos are essential in films, short dramas, and advertisements, providing visual variety and enhancing viewers' emotional connection. Despite their importance, such dialogue scenes remain largely underexplored in video generation research. The main challenges include maintaining character consistency across different shots, creating a sense of spatial continuity, and ge… ▽ More Over-the-shoulder dialogue videos are essential in films, short dramas, and advertisements, providing visual variety and enhancing viewers' emotional connection. Despite their importance, such dialogue scenes remain largely underexplored in video generation research. The main challenges include maintaining character consistency across different shots, creating a sense of spatial continuity, and generating long, multi-turn dialogues within limited computational budgets. Here, we present ShoulderShot, a framework that combines dual-shot generation with looping video, enabling extended dialogues while preserving character consistency. Our results demonstrate capabilities that surpass existing methods in terms of shot-reverse-shot layout, spatial continuity, and flexibility in dialogue length, thereby opening up new possibilities for practical dialogue video generation. Videos and comparisons are available at https://shouldershot.github.io. △ Less

Submitted 15 August, 2025; v1 submitted 10 August, 2025; originally announced August 2025.

arXiv:2508.07318 [pdf, ps, other]

RORPCap: Retrieval-based Objects and Relations Prompt for Image Captioning

Authors: Jinjing Gu, Tianbao Qin, Yuanyuan Pu, Zhengpeng Zhao

Abstract: Image captioning aims to generate natural language descriptions for input images in an open-form manner. To accurately generate descriptions related to the image, a critical step in image captioning is to identify objects and understand their relations within the image. Modern approaches typically capitalize on object detectors or combine detectors with Graph Convolutional Network (GCN). However,… ▽ More Image captioning aims to generate natural language descriptions for input images in an open-form manner. To accurately generate descriptions related to the image, a critical step in image captioning is to identify objects and understand their relations within the image. Modern approaches typically capitalize on object detectors or combine detectors with Graph Convolutional Network (GCN). However, these models suffer from redundant detection information, difficulty in GCN construction, and high training costs. To address these issues, a Retrieval-based Objects and Relations Prompt for Image Captioning (RORPCap) is proposed, inspired by the fact that image-text retrieval can provide rich semantic information for input images. RORPCap employs an Objects and relations Extraction Model to extract object and relation words from the image. These words are then incorporate into predefined prompt templates and encoded as prompt embeddings. Next, a Mamba-based mapping network is designed to quickly map image embeddings extracted by CLIP to visual-text embeddings. Finally, the resulting prompt embeddings and visual-text embeddings are concatenated to form textual-enriched feature embeddings, which are fed into a GPT-2 model for caption generation. Extensive experiments conducted on the widely used MS-COCO dataset show that the RORPCap requires only 2.6 hours under cross-entropy loss training, achieving 120.5% CIDEr score and 22.0% SPICE score on the "Karpathy" test split. RORPCap achieves comparable performance metrics to detector-based and GCN-based models with the shortest training time and demonstrates its potential as an alternative for image captioning. △ Less

Submitted 10 August, 2025; originally announced August 2025.

arXiv:2508.05383 [pdf, ps, other]

StructVRM: Aligning Multimodal Reasoning with Structured and Verifiable Reward Models

Authors: Xiangxiang Zhang, Jingxuan Wei, Donghong Zhong, Qi Chen, Caijun Jia, Cheng Tan, Jinming Gu, Xiaobo Qin, Zhiping Liu, Liang Hu, Tong Sun, Yuchen Wu, Zewei Sun, Chenwei Lou, Hua Zheng, Tianyang Zhan, Changbao Wang, Shuangzhi Wu, Zefa Lin, Chang Guo, Sihang Yuan, Riwei Chen, Shixiong Zhao, Yingping Zhang, Gaowei Wu , et al. (9 additional authors not shown)

Abstract: Existing Vision-Language Models often struggle with complex, multi-question reasoning tasks where partial correctness is crucial for effective learning. Traditional reward mechanisms, which provide a single binary score for an entire response, are too coarse to guide models through intricate problems with multiple sub-parts. To address this, we introduce StructVRM, a method that aligns multimodal… ▽ More Existing Vision-Language Models often struggle with complex, multi-question reasoning tasks where partial correctness is crucial for effective learning. Traditional reward mechanisms, which provide a single binary score for an entire response, are too coarse to guide models through intricate problems with multiple sub-parts. To address this, we introduce StructVRM, a method that aligns multimodal reasoning with Structured and Verifiable Reward Models. At its core is a model-based verifier trained to provide fine-grained, sub-question-level feedback, assessing semantic and mathematical equivalence rather than relying on rigid string matching. This allows for nuanced, partial credit scoring in previously intractable problem formats. Extensive experiments demonstrate the effectiveness of StructVRM. Our trained model, Seed-StructVRM, achieves state-of-the-art performance on six out of twelve public multimodal benchmarks and our newly curated, high-difficulty STEM-Bench. The success of StructVRM validates that training with structured, verifiable rewards is a highly effective approach for advancing the capabilities of multimodal models in complex, real-world reasoning domains. △ Less

Submitted 7 August, 2025; originally announced August 2025.

arXiv:2508.05118 [pdf, ps, other]

Exploring Superior Function Calls via Reinforcement Learning

Authors: Bingguang Hao, Maolin Wang, Zengzhuang Xu, Yicheng Chen, Cunyin Peng, Jinjie GU, Chenyi Zhuang

Abstract: Function calling capabilities are crucial for deploying Large Language Models in real-world applications, yet current training approaches fail to develop robust reasoning strategies. Supervised fine-tuning produces models that rely on superficial pattern matching, while standard reinforcement learning methods struggle with the complex action space of structured function calls. We present a novel r… ▽ More Function calling capabilities are crucial for deploying Large Language Models in real-world applications, yet current training approaches fail to develop robust reasoning strategies. Supervised fine-tuning produces models that rely on superficial pattern matching, while standard reinforcement learning methods struggle with the complex action space of structured function calls. We present a novel reinforcement learning framework designed to enhance group relative policy optimization through strategic entropy based exploration specifically tailored for function calling tasks. Our approach addresses three critical challenges in function calling: insufficient exploration during policy learning, lack of structured reasoning in chain-of-thought generation, and inadequate verification of parameter extraction. Our two-stage data preparation pipeline ensures high-quality training samples through iterative LLM evaluation and abstract syntax tree validation. Extensive experiments on the Berkeley Function Calling Leaderboard demonstrate that this framework achieves state-of-the-art performance among open-source models with 86.02\% overall accuracy, outperforming standard GRPO by up to 6\% on complex multi-function scenarios. Notably, our method shows particularly strong improvements on code-pretrained models, suggesting that structured language generation capabilities provide an advantageous starting point for reinforcement learning in function calling tasks. We will release all the code, models and dataset to benefit the community. △ Less

Submitted 15 August, 2025; v1 submitted 7 August, 2025; originally announced August 2025.

arXiv:2508.04155 [pdf, ps, other]

Evaluating Selective Encryption Against Gradient Inversion Attacks

Authors: Jiajun Gu, Yuhang Yao, Shuaiqi Wang, Carlee Joe-Wong

Abstract: Gradient inversion attacks pose significant privacy threats to distributed training frameworks such as federated learning, enabling malicious parties to reconstruct sensitive local training data from gradient communications between clients and an aggregation server during the aggregation process. While traditional encryption-based defenses, such as homomorphic encryption, offer strong privacy guar… ▽ More Gradient inversion attacks pose significant privacy threats to distributed training frameworks such as federated learning, enabling malicious parties to reconstruct sensitive local training data from gradient communications between clients and an aggregation server during the aggregation process. While traditional encryption-based defenses, such as homomorphic encryption, offer strong privacy guarantees without compromising model utility, they often incur prohibitive computational overheads. To mitigate this, selective encryption has emerged as a promising approach, encrypting only a subset of gradient data based on the data's significance under a certain metric. However, there have been few systematic studies on how to specify this metric in practice. This paper systematically evaluates selective encryption methods with different significance metrics against state-of-the-art attacks. Our findings demonstrate the feasibility of selective encryption in reducing computational overhead while maintaining resilience against attacks. We propose a distance-based significance analysis framework that provides theoretical foundations for selecting critical gradient elements for encryption. Through extensive experiments on different model architectures (LeNet, CNN, BERT, GPT-2) and attack types, we identify gradient magnitude as a generally effective metric for protection against optimization-based gradient inversions. However, we also observe that no single selective encryption strategy is universally optimal across all attack scenarios, and we provide guidelines for choosing appropriate strategies for different model architectures and privacy requirements. △ Less

Submitted 6 August, 2025; originally announced August 2025.

arXiv:2508.00593 [pdf, ps, other]

Can User Feedback Help Issue Detection? An Empirical Study on a One-billion-user Online Service System

Authors: Shuyao Jiang, Jiazhen Gu, Wujie Zheng, Yangfan Zhou, Michael R. Lyu

Abstract: Background: It has long been suggested that user feedback, typically written in natural language by end-users, can help issue detection. However, for large-scale online service systems that receive a tremendous amount of feedback, it remains a challenging task to identify severe issues from user feedback. Aims: To develop a better feedback-based issue detection approach, it is crucial first to gai… ▽ More Background: It has long been suggested that user feedback, typically written in natural language by end-users, can help issue detection. However, for large-scale online service systems that receive a tremendous amount of feedback, it remains a challenging task to identify severe issues from user feedback. Aims: To develop a better feedback-based issue detection approach, it is crucial first to gain a comprehensive understanding of the characteristics of user feedback in real production systems. Method: In this paper, we conduct an empirical study on 50,378,766 user feedback items from six real-world services in a one-billion-user online service system. We first study what users provide in their feedback. We then examine whether certain features of feedback items can be good indicators of severe issues. Finally, we investigate whether adopting machine learning techniques to analyze user feedback is reasonable. Results: Our results show that a large proportion of user feedback provides irrelevant information about system issues. As a result, it is crucial to filter out issue-irrelevant information when processing user feedback. Moreover, we find severe issues that cannot be easily detected based solely on user feedback characteristics. Finally, we find that the distributions of the feedback topics in different time intervals are similar. This confirms that designing machine learning-based approaches is a viable direction for better analyzing user feedback. Conclusions: We consider that our findings can serve as an empirical foundation for feedback-based issue detection in large-scale service systems, which sheds light on the design and implementation of practical issue detection approaches. △ Less

Submitted 1 August, 2025; originally announced August 2025.

Comments: Accepted by the 19th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM 2025)

arXiv:2508.00205 [pdf, ps, other]

Learning Personalised Human Internal Cognition from External Expressive Behaviours for Real Personality Recognition

Authors: Xiangyu Kong, Hengde Zhu, Haoqin Sun, Zhihao Guo, Jiayan Gu, Xinyi Ni, Wei Zhang, Shizhe Liu, Siyang Song

Abstract: Automatic real personality recognition (RPR) aims to evaluate human real personality traits from their expressive behaviours. However, most existing solutions generally act as external observers to infer observers' personality impressions based on target individuals' expressive behaviours, which significantly deviate from their real personalities and consistently lead to inferior recognition perfo… ▽ More Automatic real personality recognition (RPR) aims to evaluate human real personality traits from their expressive behaviours. However, most existing solutions generally act as external observers to infer observers' personality impressions based on target individuals' expressive behaviours, which significantly deviate from their real personalities and consistently lead to inferior recognition performance. Inspired by the association between real personality and human internal cognition underlying the generation of expressive behaviours, we propose a novel RPR approach that efficiently simulates personalised internal cognition from easy-accessible external short audio-visual behaviours expressed by the target individual. The simulated personalised cognition, represented as a set of network weights that enforce the personalised network to reproduce the individual-specific facial reactions, is further encoded as a novel graph containing two-dimensional node and edge feature matrices, with a novel 2D Graph Neural Network (2D-GNN) proposed for inferring real personality traits from it. To simulate real personality-related cognition, an end-to-end strategy is designed to jointly train our cognition simulation, 2D graph construction, and personality recognition modules. △ Less

Submitted 31 July, 2025; originally announced August 2025.

Comments: 10 pages, 4 figures

arXiv:2507.23726 [pdf, ps, other]

Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

Authors: Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Kaijing Ma, Cheng Ren, Jiawei Shen, Wenlei Shi, Tong Sun, He Sun, Jiahui Wang, Siran Wang, Zhihong Wang, Chenrui Wei, Shufa Wei, Yonghui Wu, Yuchen Wu, Yihang Xia, Huajian Xin, Fan Yang , et al. (11 additional authors not shown)

Abstract: LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought, yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training throu… ▽ More LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought, yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training through reinforcement learning. In this work, we propose \textbf{Seed-Prover}, a lemma-style whole-proof reasoning model. Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization. To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning. Seed-Prover proves $78.1\%$ of formalized past IMO problems, saturates MiniF2F, and achieves over 50\% on PutnamBench, outperforming the previous state-of-the-art by a large margin. To address the lack of geometry support in Lean, we introduce a geometry reasoning engine \textbf{Seed-Geometry}, which outperforms previous formal geometry engines. We use these two systems to participate in IMO 2025 and fully prove 5 out of 6 problems. This work represents a significant advancement in automated mathematical reasoning, demonstrating the effectiveness of formal verification with long chain-of-thought reasoning. △ Less

Submitted 31 July, 2025; v1 submitted 31 July, 2025; originally announced July 2025.

arXiv:2507.22927 [pdf, ps, other]

PRGB Benchmark: A Robust Placeholder-Assisted Algorithm for Benchmarking Retrieval-Augmented Generation

Authors: Zhehao Tan, Yihan Jiao, Dan Yang, Lei Liu, Jie Feng, Duolin Sun, Yue Shen, Jian Wang, Peng Wei, Jinjie Gu

Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge, where the LLM's ability to generate responses based on the combination of a given query and retrieved documents is crucial. However, most benchmarks focus on overall RAG system performance, rarely assessing LLM-specific capabilities. Current benchmarks emphasize broad aspects such as noise… ▽ More Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge, where the LLM's ability to generate responses based on the combination of a given query and retrieved documents is crucial. However, most benchmarks focus on overall RAG system performance, rarely assessing LLM-specific capabilities. Current benchmarks emphasize broad aspects such as noise robustness, but lack a systematic and granular evaluation framework on document utilization. To this end, we introduce \textit{Placeholder-RAG-Benchmark}, a multi-level fine-grained benchmark, emphasizing the following progressive dimensions: (1) multi-level filtering abilities, (2) combination abilities, and (3) reference reasoning. To provide a more nuanced understanding of LLMs' roles in RAG systems, we formulate an innovative placeholder-based approach to decouple the contributions of the LLM's parametric knowledge and the external knowledge. Experiments demonstrate the limitations of representative LLMs in the RAG system's generation capabilities, particularly in error resilience and context faithfulness. Our benchmark provides a reproducible framework for developing more reliable and efficient RAG systems. Our code is available in https://github.com/Alipay-Med/PRGB. △ Less

Submitted 23 July, 2025; originally announced July 2025.

arXiv:2507.22301 [pdf, ps, other]

Toward Intelligent Electronic-Photonic Design Automation for Large-Scale Photonic Integrated Circuits: from Device Inverse Design to Physical Layout Generation

Authors: Hongjian Zhou, Pingchuan Ma, Jiaqi Gu

Abstract: Photonic Integrated Circuits (PICs) offer tremendous advantages in bandwidth, parallelism, and energy efficiency, making them essential for emerging applications in artificial intelligence (AI), high-performance computing (HPC), sensing, and communications. However, the design of modern PICs, which now integrate hundreds to thousands of components, remains largely manual, resulting in inefficiency… ▽ More Photonic Integrated Circuits (PICs) offer tremendous advantages in bandwidth, parallelism, and energy efficiency, making them essential for emerging applications in artificial intelligence (AI), high-performance computing (HPC), sensing, and communications. However, the design of modern PICs, which now integrate hundreds to thousands of components, remains largely manual, resulting in inefficiency, poor scalability, and susceptibility to errors. To address these challenges, we propose PoLaRIS, a comprehensive Intelligent Electronic-Photonic Design Automation (EPDA) framework that spans both device-level synthesis and system-level physical layout. PoLaRIS combines a robust, fabrication-aware inverse design engine with a routing-informed placement and curvy-aware detailed router, enabling the automated generation of design rule violation (DRV)-free and performance-optimized layouts. By unifying physics-driven optimization with machine learning and domain-specific algorithms, PoLaRIS significantly accelerates PIC development, lowers design barriers, and lays the groundwork for scalable photonic system design automation. △ Less

Submitted 29 July, 2025; originally announced July 2025.

Comments: 10 pages. SPIE Optical Design Automation (ODA) 2025

arXiv:2507.21977 [pdf, ps, other]

Motion Matters: Motion-guided Modulation Network for Skeleton-based Micro-Action Recognition

Authors: Jihao Gu, Kun Li, Fei Wang, Yanyan Wei, Zhiliang Wu, Hehe Fan, Meng Wang

Abstract: Micro-Actions (MAs) are an important form of non-verbal communication in social interactions, with potential applications in human emotional analysis. However, existing methods in Micro-Action Recognition often overlook the inherent subtle changes in MAs, which limits the accuracy of distinguishing MAs with subtle changes. To address this issue, we present a novel Motion-guided Modulation Network… ▽ More Micro-Actions (MAs) are an important form of non-verbal communication in social interactions, with potential applications in human emotional analysis. However, existing methods in Micro-Action Recognition often overlook the inherent subtle changes in MAs, which limits the accuracy of distinguishing MAs with subtle changes. To address this issue, we present a novel Motion-guided Modulation Network (MMN) that implicitly captures and modulates subtle motion cues to enhance spatial-temporal representation learning. Specifically, we introduce a Motion-guided Skeletal Modulation module (MSM) to inject motion cues at the skeletal level, acting as a control signal to guide spatial representation modeling. In parallel, we design a Motion-guided Temporal Modulation module (MTM) to incorporate motion information at the frame level, facilitating the modeling of holistic motion patterns in micro-actions. Finally, we propose a motion consistency learning strategy to aggregate the motion cues from multi-scale features for micro-action classification. Experimental results on the Micro-Action 52 and iMiGUE datasets demonstrate that MMN achieves state-of-the-art performance in skeleton-based micro-action recognition, underscoring the importance of explicitly modeling subtle motion cues. The code will be available at https://github.com/momiji-bit/MMN. △ Less

Submitted 14 August, 2025; v1 submitted 29 July, 2025; originally announced July 2025.

Comments: Accepted by ACM MM 2025

arXiv:2507.21391 [pdf, ps, other]

Multimodal LLMs as Customized Reward Models for Text-to-Image Generation

Authors: Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu, Branislav Kveton, Yufan Zhou, Jiuxiang Gu, Jian Chen, Changyou Chen

Abstract: We introduce LLaVA-Reward, an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives, leveraging pretrained multimodal large language models (MLLMs). Existing MLLM-based approaches require instruction-following data for supervised fine-tuning and evaluate generation quality on analyzing text response, which is time-consuming and diffi… ▽ More We introduce LLaVA-Reward, an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives, leveraging pretrained multimodal large language models (MLLMs). Existing MLLM-based approaches require instruction-following data for supervised fine-tuning and evaluate generation quality on analyzing text response, which is time-consuming and difficult to train. To address this problem, we propose LLaVA-Reward, which directly utilizes the hidden states of MLLMs given text-image pairs. To enhance the bidirectional interaction between visual and textual representations in decoder-only MLLMs, we further propose adding a Skip-connection Cross Attention (SkipCA) module. This design enhances text-image correlation reasoning by connecting early-layer visual features with later-layer hidden representations. In addition, LLaVA-Reward supports different types of preference data for efficient fine-tuning, including paired preference data and unpaired data. We train LLaVA-Reward on four evaluation perspectives: text-image alignment, fidelity/artifact, safety, and overall ranking. Empirical results demonstrate that LLaVA-Reward outperforms conventional and MLLM-based methods in generating human-aligned scores for automatic evaluations and inference-time scaling in text-to-image generations. △ Less

Submitted 30 July, 2025; v1 submitted 28 July, 2025; originally announced July 2025.

Comments: Accepted at ICCV 2025. Code available at https://github.com/sjz5202/LLaVA-Reward

arXiv:2507.20590 [pdf, ps, other]

Harnessing Diffusion-Yielded Score Priors for Image Restoration

Authors: Xinqi Lin, Fanghua Yu, Jinfan Hu, Zhiyuan You, Wu Shi, Jimmy S. Ren, Jinjin Gu, Chao Dong

Abstract: Deep image restoration models aim to learn a mapping from degraded image space to natural image space. However, they face several critical challenges: removing degradation, generating realistic details, and ensuring pixel-level consistency. Over time, three major classes of methods have emerged, including MSE-based, GAN-based, and diffusion-based methods. However, they fail to achieve a good balan… ▽ More Deep image restoration models aim to learn a mapping from degraded image space to natural image space. However, they face several critical challenges: removing degradation, generating realistic details, and ensuring pixel-level consistency. Over time, three major classes of methods have emerged, including MSE-based, GAN-based, and diffusion-based methods. However, they fail to achieve a good balance between restoration quality, fidelity, and speed. We propose a novel method, HYPIR, to address these challenges. Our solution pipeline is straightforward: it involves initializing the image restoration model with a pre-trained diffusion model and then fine-tuning it with adversarial training. This approach does not rely on diffusion loss, iterative sampling, or additional adapters. We theoretically demonstrate that initializing adversarial training from a pre-trained diffusion model positions the initial restoration model very close to the natural image distribution. Consequently, this initialization improves numerical stability, avoids mode collapse, and substantially accelerates the convergence of adversarial training. Moreover, HYPIR inherits the capabilities of diffusion models with rich user control, enabling text-guided restoration and adjustable texture richness. Requiring only a single forward pass, it achieves faster convergence and inference speed than diffusion-based methods. Extensive experiments show that HYPIR outperforms previous state-of-the-art methods, achieving efficient and high-quality image restoration. △ Less

Submitted 29 July, 2025; v1 submitted 28 July, 2025; originally announced July 2025.

arXiv:2507.15807 [pdf, ps, other]

True Multimodal In-Context Learning Needs Attention to the Visual Context

Authors: Shuo Chen, Jianzhe Liu, Zhen Han, Yan Xia, Daniel Cremers, Philip Torr, Volker Tresp, Jindong Gu

Abstract: Multimodal Large Language Models (MLLMs), built on powerful language backbones, have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks from a few multimodal demonstrations consisting of images, questions, and answers. Despite showing noticeable improvement on standard vision-language datasets, current MLLMs struggle to leverage visual information in the demonstrations. Specifical… ▽ More Multimodal Large Language Models (MLLMs), built on powerful language backbones, have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks from a few multimodal demonstrations consisting of images, questions, and answers. Despite showing noticeable improvement on standard vision-language datasets, current MLLMs struggle to leverage visual information in the demonstrations. Specifically, they tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation. This behavior makes MICL still unimodal and largely restricts its practical utility. More importantly, this limitation is often concealed by the improved performance on tasks that do not require understanding the visual context. As a result, how to effectively enhance MICL ability and reliably evaluate the MICL performance remains underexplored. To address these issues, we first introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context by rebalancing attention across visual and textual tokens. In addition, we present TrueMICL, an MICL-dedicated dataset with both support and test sets that explicitly requires the integration of multimodal information-particularly visual content-for correct task completion. Extensive experiments demonstrate the effectiveness of our holistic solution, showcasing substantial improvements in the true multimodal in-context learning capabilities. Code and datasets are available at https://chenxshuo.github.io/true-micl-colm . △ Less

Submitted 6 August, 2025; v1 submitted 21 July, 2025; originally announced July 2025.

Comments: Accepted to COLM 2025

arXiv:2507.13706 [pdf, ps, other]

GOSPA and T-GOSPA quasi-metrics for evaluation of multi-object tracking algorithms

Authors: Ángel F. García-Fernández, Jinhao Gu, Lennart Svensson, Yuxuan Xia, Jan Krejčí, Oliver Kost, Ondřej Straka

Abstract: This paper introduces two quasi-metrics for performance assessment of multi-object tracking (MOT) algorithms. In particular, one quasi-metric is an extension of the generalised optimal subpattern assignment (GOSPA) metric and measures the discrepancy between sets of objects. The other quasi-metric is an extension of the trajectory GOSPA (T-GOSPA) metric and measures the discrepancy between sets of… ▽ More This paper introduces two quasi-metrics for performance assessment of multi-object tracking (MOT) algorithms. In particular, one quasi-metric is an extension of the generalised optimal subpattern assignment (GOSPA) metric and measures the discrepancy between sets of objects. The other quasi-metric is an extension of the trajectory GOSPA (T-GOSPA) metric and measures the discrepancy between sets of trajectories. Similar to the GOSPA-based metrics, these quasi-metrics include costs for localisation error for properly detected objects, the number of false objects and the number of missed objects. The T-GOSPA quasi-metric also includes a track switching cost. Differently from the GOSPA and T-GOSPA metrics, the proposed quasi-metrics have the flexibility of penalising missed and false objects with different costs, and the localisation costs are not required to be symmetric. These properties can be useful in MOT evaluation in certain applications. The performance of several Bayesian MOT algorithms is assessed with the T-GOSPA quasi-metric via simulations. △ Less

Submitted 18 July, 2025; originally announced July 2025.

arXiv:2507.13527 [pdf, ps, other]

SparseC-AFM: a deep learning method for fast and accurate characterization of MoS$_2$ with C-AFM

Authors: Levi Harris, Md Jayed Hossain, Mufan Qiu, Ruichen Zhang, Pingchuan Ma, Tianlong Chen, Jiaqi Gu, Seth Ariel Tongay, Umberto Celano

Abstract: The increasing use of two-dimensional (2D) materials in nanoelectronics demands robust metrology techniques for electrical characterization, especially for large-scale production. While atomic force microscopy (AFM) techniques like conductive AFM (C-AFM) offer high accuracy, they suffer from slow data acquisition speeds due to the raster scanning process. To address this, we introduce SparseC-AFM,… ▽ More The increasing use of two-dimensional (2D) materials in nanoelectronics demands robust metrology techniques for electrical characterization, especially for large-scale production. While atomic force microscopy (AFM) techniques like conductive AFM (C-AFM) offer high accuracy, they suffer from slow data acquisition speeds due to the raster scanning process. To address this, we introduce SparseC-AFM, a deep learning model that rapidly and accurately reconstructs conductivity maps of 2D materials like MoS$_2$ from sparse C-AFM scans. Our approach is robust across various scanning modes, substrates, and experimental conditions. We report a comparison between (a) classic flow implementation, where a high pixel density C-AFM image (e.g., 15 minutes to collect) is manually parsed to extract relevant material parameters, and (b) our SparseC-AFM method, which achieves the same operation using data that requires substantially less acquisition time (e.g., under 5 minutes). SparseC-AFM enables efficient extraction of critical material parameters in MoS$_2$, including film coverage, defect density, and identification of crystalline island boundaries, edges, and cracks. We achieve over 11x reduction in acquisition time compared to manual extraction from a full-resolution C-AFM image. Moreover, we demonstrate that our model-predicted samples exhibit remarkably similar electrical properties to full-resolution data gathered using classic-flow scanning. This work represents a significant step toward translating AI-assisted 2D material characterization from laboratory research to industrial fabrication. Code and model weights are available at github.com/UNITES-Lab/sparse-cafm. △ Less

Submitted 17 July, 2025; originally announced July 2025.

Showing 1–50 of 930 results for author: Gu, J