Search | arXiv e-print repository

Interpretable Clustering with Adaptive Heterogeneous Causal Structure Learning in Mixed Observational Data

Authors: Wenrui Li, Qinghao Zhang, Xiaowo Wang

Abstract: Understanding causal heterogeneity is essential for scientific discovery in domains such as biology and medicine. However, existing methods lack causal awareness, with insufficient modeling of heterogeneity, confounding, and observational constraints, leading to poor interpretability and difficulty distinguishing true causal heterogeneity from spurious associations. We propose an unsupervised fram… ▽ More Understanding causal heterogeneity is essential for scientific discovery in domains such as biology and medicine. However, existing methods lack causal awareness, with insufficient modeling of heterogeneity, confounding, and observational constraints, leading to poor interpretability and difficulty distinguishing true causal heterogeneity from spurious associations. We propose an unsupervised framework, HCL (Interpretable Causal Mechanism-Aware Clustering with Adaptive Heterogeneous Causal Structure Learning), that jointly infers latent clusters and their associated causal structures from mixed-type observational data without requiring temporal ordering, environment labels, interventions or other prior knowledge. HCL relaxes the homogeneity and sufficiency assumptions by introducing an equivalent representation that encodes both structural heterogeneity and confounding. It further develops a bi-directional iterative strategy to alternately refine causal clustering and structure learning, along with a self-supervised regularization that balance cross-cluster universality and specificity. Together, these components enable convergence toward interpretable, heterogeneous causal patterns. Theoretically, we show identifiability of heterogeneous causal structures under mild conditions. Empirically, HCL achieves superior performance in both clustering and structure learning tasks, and recovers biologically meaningful mechanisms in real-world single-cell perturbation data, demonstrating its utility for discovering interpretable, mechanism-level causal heterogeneity. △ Less

Submitted 4 September, 2025; originally announced September 2025.

arXiv:2509.04292 [pdf, ps, other]

Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

Authors: Qinyan Zhang, Xinping Lei, Ruijie Miao, Yu Fu, Haojie Fan, Le Chang, Jiafan Hou, Dingling Zhang, Zhongfei Hou, Ziqiang Yang, Changxin Pu, Fei Hu, Jingkai Liu, Mengyun Liu, Yang Liu, Xiang Gao, Jiaheng Liu, Tong Yang, Zaiyuan Wang, Ge Zhang, Wenhao Huang

Abstract: Large Language Models (LLMs) achieve strong performance on diverse tasks but often exhibit cognitive inertia, struggling to follow instructions that conflict with the standardized patterns learned during supervised fine-tuning (SFT). To evaluate this limitation, we propose Inverse IFEval, a benchmark that measures models Counter-intuitive Abilitytheir capacity to override training-induced biases a… ▽ More Large Language Models (LLMs) achieve strong performance on diverse tasks but often exhibit cognitive inertia, struggling to follow instructions that conflict with the standardized patterns learned during supervised fine-tuning (SFT). To evaluate this limitation, we propose Inverse IFEval, a benchmark that measures models Counter-intuitive Abilitytheir capacity to override training-induced biases and comply with adversarial instructions. Inverse IFEval introduces eight types of such challenges, including Question Correction, Intentional Textual Flaws, Code without Comments, and Counterfactual Answering. Using a human-in-the-loop pipeline, we construct a dataset of 1012 high-quality Chinese and English questions across 23 domains, evaluated under an optimized LLM-as-a-Judge framework. Experiments on existing leading LLMs demonstrate the necessity of our proposed Inverse IFEval benchmark. Our findings emphasize that future alignment efforts should not only pursue fluency and factual correctness but also account for adaptability under unconventional contexts. We hope that Inverse IFEval serves as both a diagnostic tool and a foundation for developing methods that mitigate cognitive inertia, reduce overfitting to narrow patterns, and ultimately enhance the instruction-following reliability of LLMs in diverse and unpredictable real-world scenarios. △ Less

Submitted 4 September, 2025; originally announced September 2025.

arXiv:2509.04093 [pdf, ps, other]

Open-Source Full-Duplex Conversational Datasets for Natural and Interactive Speech Synthesis

Authors: Zhitong Zhou, Qingqing Zhang, Lei Luo, Jiechen Liu, Ruohua Zhou

Abstract: Full-duplex, spontaneous conversational data are essential for enhancing the naturalness and interactivity of synthesized speech in conversational TTS systems. We present two open-source dual-track conversational speech datasets, one in Chinese and one in English, designed to enhance the naturalness of synthesized speech by providing more realistic conversational data. The two datasets contain a t… ▽ More Full-duplex, spontaneous conversational data are essential for enhancing the naturalness and interactivity of synthesized speech in conversational TTS systems. We present two open-source dual-track conversational speech datasets, one in Chinese and one in English, designed to enhance the naturalness of synthesized speech by providing more realistic conversational data. The two datasets contain a total of 15 hours of natural, spontaneous conversations recorded in isolated rooms, which produces separate high-quality audio tracks for each speaker. The conversations cover diverse daily topics and domains, capturing realistic interaction patterns including frequent overlaps, backchannel responses, laughter, and other non-verbal vocalizations. We introduce the data collection procedure, transcription and annotation methods. We demonstrate the utility of these corpora by fine-tuning a baseline TTS model with the proposed datasets. The fine-tuned TTS model achieves higher subjective and objective evaluation metrics compared to the baseline, indicating improved naturalness and conversational realism in synthetic speech. All data, annotations, and supporting code for fine-tuning and evaluation are made available to facilitate further research in conversational speech synthesis. △ Less

Submitted 4 September, 2025; originally announced September 2025.

arXiv:2509.03887 [pdf, ps, other]

OccTENS: 3D Occupancy World Model via Temporal Next-Scale Prediction

Authors: Bu Jin, Songen Gu, Xiaotao Hu, Yupeng Zheng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, Wei Yin

Abstract: In this paper, we propose OccTENS, a generative occupancy world model that enables controllable, high-fidelity long-term occupancy generation while maintaining computational efficiency. Different from visual generation, the occupancy world model must capture the fine-grained 3D geometry and dynamic evolution of the 3D scenes, posing great challenges for the generative models. Recent approaches bas… ▽ More In this paper, we propose OccTENS, a generative occupancy world model that enables controllable, high-fidelity long-term occupancy generation while maintaining computational efficiency. Different from visual generation, the occupancy world model must capture the fine-grained 3D geometry and dynamic evolution of the 3D scenes, posing great challenges for the generative models. Recent approaches based on autoregression (AR) have demonstrated the potential to predict vehicle movement and future occupancy scenes simultaneously from historical observations, but they typically suffer from \textbf{inefficiency}, \textbf{temporal degradation} in long-term generation and \textbf{lack of controllability}. To holistically address these issues, we reformulate the occupancy world model as a temporal next-scale prediction (TENS) task, which decomposes the temporal sequence modeling problem into the modeling of spatial scale-by-scale generation and temporal scene-by-scene prediction. With a \textbf{TensFormer}, OccTENS can effectively manage the temporal causality and spatial relationships of occupancy sequences in a flexible and scalable way. To enhance the pose controllability, we further propose a holistic pose aggregation strategy, which features a unified sequence modeling for occupancy and ego-motion. Experiments show that OccTENS outperforms the state-of-the-art method with both higher occupancy quality and faster inference time. △ Less

Submitted 4 September, 2025; originally announced September 2025.

arXiv:2509.03244 [pdf, ps, other]

FoMEMO: Towards Foundation Models for Expensive Multi-objective Optimization

Authors: Yiming Yao, Fei Liu, Liang Zhao, Xi Lin, Qingfu Zhang

Abstract: Expensive multi-objective optimization is a prevalent and crucial concern in many real-world scenarios, where sample-efficiency is vital due to the limited evaluations to recover the true Pareto front for decision making. Existing works either involve rebuilding Gaussian process surrogates from scratch for each objective in each new problem encountered, or rely on extensive past domain experiments… ▽ More Expensive multi-objective optimization is a prevalent and crucial concern in many real-world scenarios, where sample-efficiency is vital due to the limited evaluations to recover the true Pareto front for decision making. Existing works either involve rebuilding Gaussian process surrogates from scratch for each objective in each new problem encountered, or rely on extensive past domain experiments for pre-training deep learning models, making them hard to generalize and impractical to cope with various emerging applications in the real world. To address this issue, we propose a new paradigm named FoMEMO (Foundation Models for Expensive Multi-objective Optimization), which enables the establishment of a foundation model conditioned on any domain trajectory and user preference, and facilitates fast in-context optimization based on the predicted preference-wise aggregation posteriors. Rather than accessing extensive domain experiments in the real world, we demonstrate that pre-training the foundation model with a diverse set of hundreds of millions of synthetic data can lead to superior adaptability to unknown problems, without necessitating any subsequent model training or updates in the optimization process. We evaluate our method across a variety of synthetic benchmarks and real-word applications, and demonstrate its superior generality and competitive performance compared to existing methods. △ Less

Submitted 3 September, 2025; originally announced September 2025.

arXiv:2509.01875 [pdf, ps, other]

RadioDiff-Loc: Diffusion Model Enhanced Scattering Congnition for NLoS Localization with Sparse Radio Map Estimation

Authors: Xiucheng Wang, Qiming Zhang, Nan Cheng

Abstract: Accurate localization of non-cooperative signal sources in non-line-of-sight (NLoS) environments remains a critical challenge with a wide range of applications, including autonomous navigation, industrial automation, and emergency response. In such settings, traditional positioning techniques relying on line-of-sight (LoS) or cooperative signaling fail due to severe multipath propagation and unkno… ▽ More Accurate localization of non-cooperative signal sources in non-line-of-sight (NLoS) environments remains a critical challenge with a wide range of applications, including autonomous navigation, industrial automation, and emergency response. In such settings, traditional positioning techniques relying on line-of-sight (LoS) or cooperative signaling fail due to severe multipath propagation and unknown transmit power. This paper proposes a novel generative inference framework for NLoS localization based on conditional diffusion models. By leveraging the physical insight that diffracted electromagnetic energy concentrates near building edges, we develop a sampling strategy that collects sparse received signal strength (RSS) measurements at the geometric vertices of obstacles--locations that maximize Fisher information and mutual information with respect to the unknown source. To overcome the lack of known transmission power, we normalize all sampled RSS values relative to the maximum observed intensity, enabling the construction of a power-invariant radio map (RM). A conditional diffusion model is trained to reconstruct the full RM based on environmental layout and sparse RSS observations. Localization is then achieved by identifying the brightest point on the generated RM. Moreover, the proposed framework is compatible with existing RSS-based localization algorithms, enabling a dual-driven paradigm that fuses physical knowledge and data-driven inference for improved accuracy. Extensive theoretical analysis and empirical validation demonstrate that our approach achieves high localization accuracy with significantly reduced sampling cost, offering a scalable and physically grounded solution for non-cooperative NLoS emitter localization. △ Less

Submitted 4 September, 2025; v1 submitted 1 September, 2025; originally announced September 2025.

arXiv:2509.01364 [pdf, ps, other]

TopoNav: Topological Graphs as a Key Enabler for Advanced Object Navigation

Authors: Peiran Liu, Qiang Zhang, Daojie Peng, Lingfeng Zhang, Yihao Qin, Hang Zhou, Jun Ma, Renjing Xu, Yiding Ji

Abstract: Object Navigation (ObjectNav) has made great progress with large language models (LLMs), but still faces challenges in memory management, especially in long-horizon tasks and dynamic scenes. To address this, we propose TopoNav, a new framework that leverages topological structures as spatial memory. By building and updating a topological graph that captures scene connections, adjacency, and semant… ▽ More Object Navigation (ObjectNav) has made great progress with large language models (LLMs), but still faces challenges in memory management, especially in long-horizon tasks and dynamic scenes. To address this, we propose TopoNav, a new framework that leverages topological structures as spatial memory. By building and updating a topological graph that captures scene connections, adjacency, and semantic meaning, TopoNav helps agents accumulate spatial knowledge over time, retrieve key information, and reason effectively toward distant goals. Our experiments show that TopoNav achieves state-of-the-art performance on benchmark ObjectNav datasets, with higher success rates and more efficient paths. It particularly excels in diverse and complex environments, as it connects temporary visual inputs with lasting spatial understanding. △ Less

Submitted 1 September, 2025; originally announced September 2025.

arXiv:2509.01106 [pdf, ps, other]

Robix: A Unified Model for Robot Interaction, Reasoning and Planning

Authors: Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, Hang Li

Abstract: We introduce Robix, a unified model that integrates robot reasoning, task planning, and natural language interaction within a single vision-language architecture. Acting as the high-level cognitive layer in a hierarchical robot system, Robix dynamically generates atomic commands for the low-level controller and verbal responses for human interaction, enabling robots to follow complex instructions,… ▽ More We introduce Robix, a unified model that integrates robot reasoning, task planning, and natural language interaction within a single vision-language architecture. Acting as the high-level cognitive layer in a hierarchical robot system, Robix dynamically generates atomic commands for the low-level controller and verbal responses for human interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally with human within an end-to-end framework. Robix further introduces novel capabilities such as proactive dialogue, real-time interruption handling, and context-aware commonsense reasoning during task execution. At its core, Robix leverages chain-of-thought reasoning and adopts a three-stage training strategy: (1) continued pretraining to enhance foundational embodied reasoning abilities including 3D spatial understanding, visual grounding, and task-centric reasoning; (2) supervised finetuning to model human-robot interaction and task planning as a unified reasoning-action sequence; and (3) reinforcement learning to improve reasoning-action consistency and long-horizon task coherence. Extensive experiments demonstrate that Robix outperforms both open-source and commercial baselines (e.g., GPT-4o and Gemini 2.5 Pro) in interactive task execution, demonstrating strong generalization across diverse instruction types (e.g., open-ended, multi-stage, constrained, invalid, and interrupted) and various user-involved tasks such as table bussing, grocery shopping, and dietary filtering. △ Less

Submitted 31 August, 2025; originally announced September 2025.

Comments: Tech report. Project page: https://robix-seed.github.io/robix/

arXiv:2508.21444 [pdf, ps, other]

Scale-GS: Efficient Scalable Gaussian Splatting via Redundancy-filtering Training on Streaming Content

Authors: Jiayu Yang, Weijian Su, Songqian Zhang, Yuqi Han, Jinli Suo, Qiang Zhang

Abstract: 3D Gaussian Splatting (3DGS) enables high-fidelity real-time rendering, a key requirement for immersive applications. However, the extension of 3DGS to dynamic scenes remains limitations on the substantial data volume of dense Gaussians and the prolonged training time required for each frame. This paper presents \M, a scalable Gaussian Splatting framework designed for efficient training in streami… ▽ More 3D Gaussian Splatting (3DGS) enables high-fidelity real-time rendering, a key requirement for immersive applications. However, the extension of 3DGS to dynamic scenes remains limitations on the substantial data volume of dense Gaussians and the prolonged training time required for each frame. This paper presents \M, a scalable Gaussian Splatting framework designed for efficient training in streaming tasks. Specifically, Gaussian spheres are hierarchically organized by scale within an anchor-based structure. Coarser-level Gaussians represent the low-resolution structure of the scene, while finer-level Gaussians, responsible for detailed high-fidelity rendering, are selectively activated by the coarser-level Gaussians. To further reduce computational overhead, we introduce a hybrid deformation and spawning strategy that models motion of inter-frame through Gaussian deformation and triggers Gaussian spawning to characterize wide-range motion. Additionally, a bidirectional adaptive masking mechanism enhances training efficiency by removing static regions and prioritizing informative viewpoints. Extensive experiments demonstrate that \M~ achieves superior visual quality while significantly reducing training time compared to state-of-the-art methods. △ Less

Submitted 29 August, 2025; originally announced August 2025.

arXiv:2508.21044 [pdf, ps, other]

MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs

Authors: Junpeng Ma, Qizhe Zhang, Ming Lu, Zhibin Wang, Qiang Zhou, Jun Song, Shanghang Zhang

Abstract: Video Large Language Models (VLLMs) excel in video understanding, but their excessive visual tokens pose a significant computational challenge for real-world applications. Current methods aim to enhance inference efficiency by visual token pruning. However, they do not consider the dynamic characteristics and temporal dependencies of video frames, as they perceive video understanding as a multi-fr… ▽ More Video Large Language Models (VLLMs) excel in video understanding, but their excessive visual tokens pose a significant computational challenge for real-world applications. Current methods aim to enhance inference efficiency by visual token pruning. However, they do not consider the dynamic characteristics and temporal dependencies of video frames, as they perceive video understanding as a multi-frame task. To address these challenges, we propose MMG-Vid, a novel training-free visual token pruning framework that removes redundancy by Maximizing Marginal Gains at both segment-level and token-level. Specifically, we first divide the video into segments based on frame similarity, and then dynamically allocate the token budget for each segment to maximize the marginal gain of each segment. Subsequently, we propose a temporal-guided DPC algorithm that jointly models inter-frame uniqueness and intra-frame diversity, thereby maximizing the marginal gain of each token. By combining both stages, MMG-Vid can maximize the utilization of the limited token budget, significantly improving efficiency while maintaining strong performance. Extensive experiments demonstrate that MMG-Vid can maintain over 99.5% of the original performance, while effectively reducing 75% visual tokens and accelerating the prefilling stage by 3.9x on LLaVA-OneVision-7B. Code will be released soon. △ Less

Submitted 28 August, 2025; originally announced August 2025.

Comments: 10 pages, 3 figures

arXiv:2508.20594 [pdf, ps, other]

UTA-Sign: Unsupervised Thermal Video Augmentation via Event-Assisted Traffic Signage Sketching

Authors: Yuqi Han, Songqian Zhang, Weijian Su, Ke Li, Jiayu Yang, Jinli Suo, Qiang Zhang

Abstract: The thermal camera excels at perceiving outdoor environments under low-light conditions, making it ideal for applications such as nighttime autonomous driving and unmanned navigation. However, thermal cameras encounter challenges when capturing signage from objects made of similar materials, which can pose safety risks for accurately understanding semantics in autonomous driving systems. In contra… ▽ More The thermal camera excels at perceiving outdoor environments under low-light conditions, making it ideal for applications such as nighttime autonomous driving and unmanned navigation. However, thermal cameras encounter challenges when capturing signage from objects made of similar materials, which can pose safety risks for accurately understanding semantics in autonomous driving systems. In contrast, the neuromorphic vision camera, also known as an event camera, detects changes in light intensity asynchronously and has proven effective in high-speed, low-light traffic environments. Recognizing the complementary characteristics of these two modalities, this paper proposes UTA-Sign, an unsupervised thermal-event video augmentation for traffic signage in low-illumination environments, targeting elements such as license plates and roadblock indicators. To address the signage blind spots of thermal imaging and the non-uniform sampling of event cameras, we developed a dual-boosting mechanism that fuses thermal frames and event signals for consistent signage representation over time. The proposed method utilizes thermal frames to provide accurate motion cues as temporal references for aligning the uneven event signals. At the same time, event signals contribute subtle signage content to the raw thermal frames, enhancing the overall understanding of the environment. The proposed method is validated on datasets collected from real-world scenarios, demonstrating superior quality in traffic signage sketching and improved detection accuracy at the perceptual level. △ Less

Submitted 28 August, 2025; originally announced August 2025.

arXiv:2508.20373 [pdf, ps, other]

Graph-R1: Unleashing LLM Reasoning with NP-Hard Graph Problems

Authors: Yuyao Wang, Bowen Liu, Jianheng Tang, Nuo Chen, Yuhan Li, Qifan Zhang, Jia Li

Abstract: Reasoning Large Language Models (RLLMs) have recently achieved remarkable progress on complex reasoning tasks, largely enabled by their long chain-of-thought (Long CoT) capabilities. However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplo… ▽ More Reasoning Large Language Models (RLLMs) have recently achieved remarkable progress on complex reasoning tasks, largely enabled by their long chain-of-thought (Long CoT) capabilities. However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored. In this work, we introduce NP-hard (NPH) graph problems as a novel synthetic training corpus, as they inherently require deep reasoning, extensive exploration, and reflective strategies, which are core characteristics of Long CoT reasoning. Building on this insight, we develop a two-stage post-training framework: (i) Long CoT Supervised Fine-Tuning (SFT) on rejection-sampled NPH graph instances, which substantially enhances reasoning depth, and (ii) Reinforcement Learning (RL) with a fine-grained reward design, which sharpens reasoning efficiency. Our flagship model, Graph-R1-7B, demonstrates strong generalization across mathematics, coding, STEM, and logic, and surpasses QwQ-32B on NPH graph problems in both accuracy and reasoning efficiency. These results position NPH graph problems as an effective and scalable resource for advancing Long CoT reasoning in LLMs, opening a new frontier for LLM post-training. Our implementation is available at https://github.com/Graph-Reasoner/Graph-R1, with models and datasets hosted in our Hugging Face collection HKUST-DSAIL/Graph-R1. △ Less

Submitted 27 August, 2025; originally announced August 2025.

arXiv:2508.20133 [pdf, ps, other]

Proactive HIV Care: AI-Based Comorbidity Prediction from Routine EHR Data

Authors: Solomon Russom, Dimitrios Kollias, Qianni Zhang

Abstract: People living with HIV face a high burden of comorbidities, yet early detection is often limited by symptom-driven screening. We evaluate the potential of AI to predict multiple comorbidities from routinely collected Electronic Health Records. Using data from 2,200 HIV-positive patients in South East London, comprising 30 laboratory markers and 7 demographic/social attributes, we compare demograph… ▽ More People living with HIV face a high burden of comorbidities, yet early detection is often limited by symptom-driven screening. We evaluate the potential of AI to predict multiple comorbidities from routinely collected Electronic Health Records. Using data from 2,200 HIV-positive patients in South East London, comprising 30 laboratory markers and 7 demographic/social attributes, we compare demographic-aware models (which use both laboratory/social variables and demographic information as input) against demographic-unaware models (which exclude all demographic information). Across all methods, demographic-aware models consistently outperformed unaware counterparts. Demographic recoverability experiments revealed that gender and age can be accurately inferred from laboratory data, underscoring both the predictive value and fairness considerations of demographic features. These findings show that combining demographic and laboratory data can improve automated, multi-label comorbidity prediction in HIV care, while raising important questions about bias and interpretability in clinical AI. △ Less

Submitted 29 August, 2025; v1 submitted 26 August, 2025; originally announced August 2025.

Comments: accepted at ICCV 2025

arXiv:2508.19957 [pdf, ps, other]

Multi-field decomposed hyper-reduced order modeling of damage-plasticity simulations

Authors: Jannick Kehls, Stephan Ritzert, Lars Breuer, Qinghua Zhang, Stefanie Reese, Tim Brepols

Abstract: This paper presents a multi-field decomposed approach for hyper-reduced order modeling to overcome the limitations of traditional model reduction techniques for gradient-extended damage-plasticity simulations. The discrete empirical interpolation method (DEIM) and the energy-conserving sampling and weighting method (ECSW) are extended to account for the multi-field nature of the problem. Both meth… ▽ More This paper presents a multi-field decomposed approach for hyper-reduced order modeling to overcome the limitations of traditional model reduction techniques for gradient-extended damage-plasticity simulations. The discrete empirical interpolation method (DEIM) and the energy-conserving sampling and weighting method (ECSW) are extended to account for the multi-field nature of the problem. Both methods yield stable reduced order simulations, while significantly reducing the computational cost compared to full-order simulations. Two numerical examples are presented to demonstrate the performance and limitations of the proposed approaches. The decomposed ECSW method has overall higher accuracy and lower computational cost than the decomposed DEIM method. △ Less

Submitted 27 August, 2025; originally announced August 2025.

arXiv:2508.19855 [pdf, ps, other]

Youtu-GraphRAG: Vertically Unified Agents for Graph Retrieval-Augmented Complex Reasoning

Authors: Junnan Dong, Siyu An, Yifei Yu, Qian-Wen Zhang, Linhao Luo, Xiao Huang, Yunsheng Wu, Di Yin, Xing Sun

Abstract: Graph retrieval-augmented generation (GraphRAG) has effectively enhanced large language models in complex reasoning by organizing fragmented knowledge into explicitly structured graphs. Prior efforts have been made to improve either graph construction or graph retrieval in isolation, yielding suboptimal performance, especially when domain shifts occur. In this paper, we propose a vertically unifie… ▽ More Graph retrieval-augmented generation (GraphRAG) has effectively enhanced large language models in complex reasoning by organizing fragmented knowledge into explicitly structured graphs. Prior efforts have been made to improve either graph construction or graph retrieval in isolation, yielding suboptimal performance, especially when domain shifts occur. In this paper, we propose a vertically unified agentic paradigm, Youtu-GraphRAG, to jointly connect the entire framework as an intricate integration. Specifically, (i) a seed graph schema is introduced to bound the automatic extraction agent with targeted entity types, relations and attribute types, also continuously expanded for scalability over unseen domains; (ii) To obtain higher-level knowledge upon the schema, we develop novel dually-perceived community detection, fusing structural topology with subgraph semantics for comprehensive knowledge organization. This naturally yields a hierarchical knowledge tree that supports both top-down filtering and bottom-up reasoning with community summaries; (iii) An agentic retriever is designed to interpret the same graph schema to transform complex queries into tractable and parallel sub-queries. It iteratively performs reflection for more advanced reasoning; (iv) To alleviate the knowledge leaking problem in pre-trained LLM, we propose a tailored anonymous dataset and a novel 'Anonymity Reversion' task that deeply measures the real performance of the GraphRAG frameworks. Extensive experiments across six challenging benchmarks demonstrate the robustness of Youtu-GraphRAG, remarkably moving the Pareto frontier with up to 90.71% saving of token costs and 16.62% higher accuracy over state-of-the-art baselines. The results indicate our adaptability, allowing seamless domain transfer with minimal intervention on schema. △ Less

Submitted 2 September, 2025; v1 submitted 27 August, 2025; originally announced August 2025.

Comments: 19 pages, 7 figures, 6 tables

arXiv:2508.18616 [pdf, ps, other]

Optimal $(α,β)$-Dense Subgraph Search in Bipartite Graphs

Authors: Yalong Zhang, Rong-Hua Li, Qi Zhang, Guoren Wang

Abstract: Dense subgraph search in bipartite graphs is a fundamental problem in graph analysis, with wide-ranging applications in fraud detection, recommendation systems, and social network analysis. The recently proposed $(α, β)$-dense subgraph model has demonstrated superior capability in capturing the intrinsic density structure of bipartite graphs compared to existing alternatives. However, despite its… ▽ More Dense subgraph search in bipartite graphs is a fundamental problem in graph analysis, with wide-ranging applications in fraud detection, recommendation systems, and social network analysis. The recently proposed $(α, β)$-dense subgraph model has demonstrated superior capability in capturing the intrinsic density structure of bipartite graphs compared to existing alternatives. However, despite its modeling advantages, the $(α, β)$-dense subgraph model lacks efficient support for query processing and dynamic updates, limiting its practical utility in large-scale applications. To address these limitations, we propose BD-Index, a novel index that answers $(α, β)$-dense subgraph queries in optimal time while using only linear space $O(|E|)$, making it well-suited for real-world applications requiring both fast query processing and low memory consumption. We further develop two complementary maintenance strategies for dynamic bipartite graphs to support efficient updates to the BD-Index. The space-efficient strategy updates the index in time complexity of $O(p \cdot |E|^{1.5})$ per edge insertion or deletion, while maintaining a low space cost of $O(|E|)$ (the same as the index itself), where $p$ is typically a small constant in real-world graphs. In contrast, the time-efficient strategy significantly reduces the update time to $O(p \cdot |E|)$ per edge update by maintaining auxiliary orientation structures, at the cost of increased memory usage up to $O(p \cdot |E|)$. These two strategies provide flexible trade-offs between maintenance efficiency and memory usage, enabling BD-Index to adapt to diverse application requirements. Extensive experiments on 10 large-scale real-world datasets demonstrate high efficiency and scalability of our proposed solutions. △ Less

Submitted 25 August, 2025; originally announced August 2025.

arXiv:2508.18582 [pdf, ps, other]

doi 10.1109/TWC.2025.3599514

Multi-Resolution Codebook Design and Multiuser Interference Management for Discrete XL-RIS-Aided Near-Field MIMO Systems

Authors: Qian Zhang, Zheng Dong, Zheng Dong, Yao Ge, Yong Liang Guan, Ju Liu, Chau Yuen

Abstract: Extremely large-scale reconfigurable intelligent surface (XL-RIS) can effectively overcome severe fading and provide higher communication performance. However, current research on XL-RIS overlooks the discrete phase-shift characteristics of RIS in practical systems, which will result in significant performance degradation.In this paper, we investigate near-field communication schemes assisted by X… ▽ More Extremely large-scale reconfigurable intelligent surface (XL-RIS) can effectively overcome severe fading and provide higher communication performance. However, current research on XL-RIS overlooks the discrete phase-shift characteristics of RIS in practical systems, which will result in significant performance degradation.In this paper, we investigate near-field communication schemes assisted by XL-RIS with discrete phase shifts.Specifically, we propose a hierarchical beam training method to obtain the user channel state information (CSI), and develop the jointly optimized codebook construction (JOCC) method and separately optimized codebook construction (SOCC) method for base station (BS) precoding and XL-RIS phase shifts, respectively. With JOCC, the most superior beam training performance can be obtained.With SOCC, higher performance than the single-antenna BS codebook can be obtained at a similar complexity.Further, we propose a flexible multiuser interference management (IM) method that is simple to solve. The IM method uses adaptive gain matrix approximation to take into account user fairness and can be solved in closed-form iterations. In addition, we extend the proposed method to a hybrid precoding design. Simulation results demonstrate that the proposed multi-resolution codebook construction method can obtain more accurate beam patterns and user CSI, and the proposed IM method obtains superior performance over the benchmark methods. △ Less

Submitted 25 August, 2025; originally announced August 2025.

Journal ref: IEEE Transactions on Wireless Communications, 2025

arXiv:2508.18506 [pdf, ps, other]

DoGFlow: Self-Supervised LiDAR Scene Flow via Cross-Modal Doppler Guidance

Authors: Ajinkya Khoche, Qingwen Zhang, Yixi Cai, Sina Sharif Mansouri, Patric Jensfelt

Abstract: Accurate 3D scene flow estimation is critical for autonomous systems to navigate dynamic environments safely, but creating the necessary large-scale, manually annotated datasets remains a significant bottleneck for developing robust perception models. Current self-supervised methods struggle to match the performance of fully supervised approaches, especially in challenging long-range and adverse w… ▽ More Accurate 3D scene flow estimation is critical for autonomous systems to navigate dynamic environments safely, but creating the necessary large-scale, manually annotated datasets remains a significant bottleneck for developing robust perception models. Current self-supervised methods struggle to match the performance of fully supervised approaches, especially in challenging long-range and adverse weather scenarios, while supervised methods are not scalable due to their reliance on expensive human labeling. We introduce DoGFlow, a novel self-supervised framework that recovers full 3D object motions for LiDAR scene flow estimation without requiring any manual ground truth annotations. This paper presents our cross-modal label transfer approach, where DoGFlow computes motion pseudo-labels in real-time directly from 4D radar Doppler measurements and transfers them to the LiDAR domain using dynamic-aware association and ambiguity-resolved propagation. On the challenging MAN TruckScenes dataset, DoGFlow substantially outperforms existing self-supervised methods and improves label efficiency by enabling LiDAR backbones to achieve over 90% of fully supervised performance with only 10% of the ground truth data. For more details, please visit https://ajinkyakhoche.github.io/DogFlow/ △ Less

Submitted 25 August, 2025; originally announced August 2025.

Comments: Under Review

arXiv:2508.18486 [pdf, ps, other]

Huracan: A skillful end-to-end data-driven system for ensemble data assimilation and weather prediction

Authors: Zekun Ni, Jonathan Weyn, Hang Zhang, Yanfei Xiang, Jiang Bian, Weixin Jin, Kit Thambiratnam, Qi Zhang, Haiyu Dong, Hongyu Sun

Abstract: Over the past few years, machine learning-based data-driven weather prediction has been transforming operational weather forecasting by providing more accurate forecasts while using a mere fraction of computing power compared to traditional numerical weather prediction (NWP). However, those models still rely on initial conditions from NWP, putting an upper limit on their forecast abilities. A few… ▽ More Over the past few years, machine learning-based data-driven weather prediction has been transforming operational weather forecasting by providing more accurate forecasts while using a mere fraction of computing power compared to traditional numerical weather prediction (NWP). However, those models still rely on initial conditions from NWP, putting an upper limit on their forecast abilities. A few end-to-end systems have since been proposed, but they have yet to match the forecast skill of state-of-the-art NWP competitors. In this work, we propose Huracan, an observation-driven weather forecasting system which combines an ensemble data assimilation model with a forecast model to produce highly accurate forecasts relying only on observations as inputs. Huracan is not only the first to provide ensemble initial conditions and end-to-end ensemble weather forecasts, but also the first end-to-end system to achieve an accuracy comparable with that of ECMWF ENS, the state-of-the-art NWP competitor, despite using a smaller amount of available observation data. Notably, Huracan matches or exceeds the continuous ranked probability score of ECMWF ENS on 75.4% of the variable and lead time combinations. Our work is a major step forward in end-to-end data-driven weather prediction and opens up opportunities for further improving and revolutionizing operational weather forecasting. △ Less

Submitted 25 August, 2025; originally announced August 2025.

arXiv:2508.18040 [pdf, ps, other]

PerPilot: Personalizing VLM-based Mobile Agents via Memory and Exploration

Authors: Xin Wang, Zhiyao Cui, Hao Li, Ya Zeng, Chenxu Wang, Ruiqi Song, Yihang Chen, Kun Shao, Qiaosheng Zhang, Jinzhuo Liu, Siyue Ren, Shuyue Hu, Zhen Wang

Abstract: Vision language model (VLM)-based mobile agents show great potential for assisting users in performing instruction-driven tasks. However, these agents typically struggle with personalized instructions -- those containing ambiguous, user-specific context -- a challenge that has been largely overlooked in previous research. In this paper, we define personalized instructions and introduce PerInstruct… ▽ More Vision language model (VLM)-based mobile agents show great potential for assisting users in performing instruction-driven tasks. However, these agents typically struggle with personalized instructions -- those containing ambiguous, user-specific context -- a challenge that has been largely overlooked in previous research. In this paper, we define personalized instructions and introduce PerInstruct, a novel human-annotated dataset covering diverse personalized instructions across various mobile scenarios. Furthermore, given the limited personalization capabilities of existing mobile agents, we propose PerPilot, a plug-and-play framework powered by large language models (LLMs) that enables mobile agents to autonomously perceive, understand, and execute personalized user instructions. PerPilot identifies personalized elements and autonomously completes instructions via two complementary approaches: memory-based retrieval and reasoning-based exploration. Experimental results demonstrate that PerPilot effectively handles personalized tasks with minimal user intervention and progressively improves its performance with continued use, underscoring the importance of personalization-aware reasoning for next-generation mobile agents. The dataset and code are available at: https://github.com/xinwang-nwpu/PerPilot △ Less

Submitted 25 August, 2025; originally announced August 2025.

arXiv:2508.17972 [pdf, ps, other]

SAIL-Recon: Large SfM by Augmenting Scene Regression with Localization

Authors: Junyuan Deng, Heng Li, Tao Xie, Weiqiang Ren, Qian Zhang, Ping Tan, Xiaoyang Guo

Abstract: Scene regression methods, such as VGGT, solve the Structure-from-Motion (SfM) problem by directly regressing camera poses and 3D scene structures from input images. They demonstrate impressive performance in handling images under extreme viewpoint changes. However, these methods struggle to handle a large number of input images. To address this problem, we introduce SAIL-Recon, a feed-forward Tran… ▽ More Scene regression methods, such as VGGT, solve the Structure-from-Motion (SfM) problem by directly regressing camera poses and 3D scene structures from input images. They demonstrate impressive performance in handling images under extreme viewpoint changes. However, these methods struggle to handle a large number of input images. To address this problem, we introduce SAIL-Recon, a feed-forward Transformer for large scale SfM, by augmenting the scene regression network with visual localization capabilities. Specifically, our method first computes a neural scene representation from a subset of anchor images. The regression network is then fine-tuned to reconstruct all input images conditioned on this neural scene representation. Comprehensive experiments show that our method not only scales efficiently to large-scale scenes, but also achieves state-of-the-art results on both camera pose estimation and novel view synthesis benchmarks, including TUM-RGBD, CO3Dv2, and Tanks & Temples. We will publish our model and code. Code and models are publicly available at: https://hkust-sail.github.io/ sail-recon/. △ Less

Submitted 25 August, 2025; originally announced August 2025.

arXiv:2508.17771 [pdf, ps, other]

Speculating LLMs' Chinese Training Data Pollution from Their Tokens

Authors: Qingjie Zhang, Di Wang, Haoting Qian, Liu Yan, Tianwei Zhang, Ke Xu, Qi Li, Minlie Huang, Hewu Li, Han Qiu

Abstract: Tokens are basic elements in the datasets for LLM training. It is well-known that many tokens representing Chinese phrases in the vocabulary of GPT (4o/4o-mini/o1/o3/4.5/4.1/o4-mini) are indicating contents like pornography or online gambling. Based on this observation, our goal is to locate Polluted Chinese (PoC) tokens in LLMs and study the relationship between PoC tokens' existence and training… ▽ More Tokens are basic elements in the datasets for LLM training. It is well-known that many tokens representing Chinese phrases in the vocabulary of GPT (4o/4o-mini/o1/o3/4.5/4.1/o4-mini) are indicating contents like pornography or online gambling. Based on this observation, our goal is to locate Polluted Chinese (PoC) tokens in LLMs and study the relationship between PoC tokens' existence and training data. (1) We give a formal definition and taxonomy of PoC tokens based on the GPT's vocabulary. (2) We build a PoC token detector via fine-tuning an LLM to label PoC tokens in vocabularies by considering each token's both semantics and related contents from the search engines. (3) We study the speculation on the training data pollution via PoC tokens' appearances (token ID). Experiments on GPT and other 23 LLMs indicate that tokens widely exist while GPT's vocabulary behaves the worst: more than 23% long Chinese tokens (i.e., a token with more than two Chinese characters) are either porn or online gambling. We validate the accuracy of our speculation method on famous pre-training datasets like C4 and Pile. Then, considering GPT-4o, we speculate that the ratio of "Yui Hatano" related webpages in GPT-4o's training data is around 0.5%. △ Less

Submitted 25 August, 2025; originally announced August 2025.

arXiv:2508.17556 [pdf, ps, other]

SEFRQO: A Self-Evolving Fine-Tuned RAG-Based Query Optimizer

Authors: Hanwen Liu, Qihan Zhang, Ryan Marcus, Ibrahim Sabek

Abstract: Query optimization is a crucial problem in database systems that has been studied for decades. Learned query optimizers (LQOs) can improve performance over time by incorporating feedback; however, they suffer from cold-start issues and often require retraining when workloads shift or schemas change. Recent LLM-based query optimizers leverage pre-trained and fine-tuned LLMs to mitigate these challe… ▽ More Query optimization is a crucial problem in database systems that has been studied for decades. Learned query optimizers (LQOs) can improve performance over time by incorporating feedback; however, they suffer from cold-start issues and often require retraining when workloads shift or schemas change. Recent LLM-based query optimizers leverage pre-trained and fine-tuned LLMs to mitigate these challenges. Nevertheless, they neglect LLMs' in-context learning and execution records as feedback for continuous evolution. In this paper, we present SEFRQO, a Self-Evolving Fine-tuned RAG-based Query Optimizer. SEFRQO mitigates the cold-start problem of LQOs by continuously learning from execution feedback via a Retrieval-Augmented Generation (RAG) framework. We employ both supervised fine-tuning and reinforcement fine-tuning to prepare the LLM to produce syntactically correct and performance-efficient query hints. Moreover, SEFRQO leverages the LLM's in-context learning capabilities by dynamically constructing prompts with references to similar queries and the historical execution record of the same query. This self-evolving paradigm iteratively optimizes the prompt to minimize query execution latency. Evaluations show that SEFRQO outperforms state-of-the-art LQOs, achieving up to 65.05% and 93.57% reductions in query latency on the CEB and Stack workloads, respectively, compared to PostgreSQL. △ Less

Submitted 24 August, 2025; originally announced August 2025.

Comments: To appear at SIGMOD 2026 (https://2026.sigmod.org/)

arXiv:2508.17436 [pdf, ps, other]

Disentangled Geometry and Appearance for Efficient Multi-View Surface Reconstruction and Rendering

Authors: Qitong Zhang, Jieqing Feng

Abstract: This paper addresses the limitations of neural rendering-based multi-view surface reconstruction methods, which require an additional mesh extraction step that is inconvenient and would produce poor-quality surfaces with mesh aliasing, restricting downstream applications. Building on the explicit mesh representation and differentiable rasterization framework, this work proposes an efficient soluti… ▽ More This paper addresses the limitations of neural rendering-based multi-view surface reconstruction methods, which require an additional mesh extraction step that is inconvenient and would produce poor-quality surfaces with mesh aliasing, restricting downstream applications. Building on the explicit mesh representation and differentiable rasterization framework, this work proposes an efficient solution that preserves the high efficiency of this framework while significantly improving reconstruction quality and versatility. Specifically, we introduce a disentangled geometry and appearance model that does not rely on deep networks, enhancing learning and broadening applicability. A neural deformation field is constructed to incorporate global geometric context, enhancing geometry learning, while a novel regularization constrains geometric features passed to a neural shader to ensure its accuracy and boost shading. For appearance, a view-invariant diffuse term is separated and baked into mesh vertices, further improving rendering efficiency. Experimental results demonstrate that the proposed method achieves state-of-the-art training (4.84 minutes) and rendering (0.023 seconds) speeds, with reconstruction quality that is competitive with top-performing methods. Moreover, the method enables practical applications such as mesh and texture editing, showcasing its versatility and application potential. This combination of efficiency, competitive quality, and broad applicability makes our approach a valuable contribution to multi-view surface reconstruction and rendering. △ Less

Submitted 24 August, 2025; originally announced August 2025.

arXiv:2508.17434 [pdf, ps, other]

TinySR: Pruning Diffusion for Real-World Image Super-Resolution

Authors: Linwei Dong, Qingnan Fan, Yuhang Yu, Qi Zhang, Jinwei Chen, Yawei Luo, Changqing Zou

Abstract: Real-world image super-resolution (Real-ISR) focuses on recovering high-quality images from low-resolution inputs that suffer from complex degradations like noise, blur, and compression. Recently, diffusion models (DMs) have shown great potential in this area by leveraging strong generative priors to restore fine details. However, their iterative denoising process incurs high computational overhea… ▽ More Real-world image super-resolution (Real-ISR) focuses on recovering high-quality images from low-resolution inputs that suffer from complex degradations like noise, blur, and compression. Recently, diffusion models (DMs) have shown great potential in this area by leveraging strong generative priors to restore fine details. However, their iterative denoising process incurs high computational overhead, posing challenges for real-time applications. Although one-step distillation methods, such as OSEDiff and TSD-SR, offer faster inference, they remain fundamentally constrained by their large, over-parameterized model architectures. In this work, we present TinySR, a compact yet effective diffusion model specifically designed for Real-ISR that achieves real-time performance while maintaining perceptual quality. We introduce a Dynamic Inter-block Activation and an Expansion-Corrosion Strategy to facilitate more effective decision-making in depth pruning. We achieve VAE compression through channel pruning, attention removal and lightweight SepConv. We eliminate time- and prompt-related modules and perform pre-caching techniques to further speed up the model. TinySR significantly reduces computational cost and model size, achieving up to 5.68x speedup and 83% parameter reduction compared to its teacher TSD-SR, while still providing high quality results. △ Less

Submitted 24 August, 2025; originally announced August 2025.

arXiv:2508.17213 [pdf, ps, other]

Multi-modal Knowledge Decomposition based Online Distillation for Biomarker Prediction in Breast Cancer Histopathology

Authors: Qibin Zhang, Xinyu Hao, Qiao Chen, Rui Xu, Fengyu Cong, Cheng Lu, Hongming Xu

Abstract: Immunohistochemical (IHC) biomarker prediction benefits from multi-modal data fusion analysis. However, the simultaneous acquisition of multi-modal data, such as genomic and pathological information, is often challenging due to cost or technical limitations. To address this challenge, we propose an online distillation approach based on Multi-modal Knowledge Decomposition (MKD) to enhance IHC bioma… ▽ More Immunohistochemical (IHC) biomarker prediction benefits from multi-modal data fusion analysis. However, the simultaneous acquisition of multi-modal data, such as genomic and pathological information, is often challenging due to cost or technical limitations. To address this challenge, we propose an online distillation approach based on Multi-modal Knowledge Decomposition (MKD) to enhance IHC biomarker prediction in haematoxylin and eosin (H\&E) stained histopathology images. This method leverages paired genomic-pathology data during training while enabling inference using either pathology slides alone or both modalities. Two teacher and one student models are developed to extract modality-specific and modality-general features by minimizing the MKD loss. To maintain the internal structural relationships between samples, Similarity-preserving Knowledge Distillation (SKD) is applied. Additionally, Collaborative Learning for Online Distillation (CLOD) facilitates mutual learning between teacher and student models, encouraging diverse and complementary learning dynamics. Experiments on the TCGA-BRCA and in-house QHSU datasets demonstrate that our approach achieves superior performance in IHC biomarker prediction using uni-modal data. Our code is available at https://github.com/qiyuanzz/MICCAI2025_MKD. △ Less

Submitted 24 August, 2025; originally announced August 2025.

Comments: Accepted at MICCAI 2025

arXiv:2508.17054 [pdf, ps, other]

DeltaFlow: An Efficient Multi-frame Scene Flow Estimation Method

Authors: Qingwen Zhang, Xiaomeng Zhu, Yushan Zhang, Yixi Cai, Olov Andersson, Patric Jensfelt

Abstract: Previous dominant methods for scene flow estimation focus mainly on input from two consecutive frames, neglecting valuable information in the temporal domain. While recent trends shift towards multi-frame reasoning, they suffer from rapidly escalating computational costs as the number of frames grows. To leverage temporal information more efficiently, we propose DeltaFlow ($Δ$Flow), a lightweight… ▽ More Previous dominant methods for scene flow estimation focus mainly on input from two consecutive frames, neglecting valuable information in the temporal domain. While recent trends shift towards multi-frame reasoning, they suffer from rapidly escalating computational costs as the number of frames grows. To leverage temporal information more efficiently, we propose DeltaFlow ($Δ$Flow), a lightweight 3D framework that captures motion cues via a $Δ$ scheme, extracting temporal features with minimal computational cost, regardless of the number of frames. Additionally, scene flow estimation faces challenges such as imbalanced object class distributions and motion inconsistency. To tackle these issues, we introduce a Category-Balanced Loss to enhance learning across underrepresented classes and an Instance Consistency Loss to enforce coherent object motion, improving flow accuracy. Extensive evaluations on the Argoverse 2 and Waymo datasets show that $Δ$Flow achieves state-of-the-art performance with up to 22% lower error and $2\times$ faster inference compared to the next-best multi-frame supervised method, while also demonstrating a strong cross-domain generalization ability. The code is open-sourced at https://github.com/Kin-Zhang/DeltaFlow along with trained model weights. △ Less

Submitted 23 August, 2025; originally announced August 2025.

Comments: 17 pages (9 main pages + 8 supp materail), 11 figures, code at https://github.com/Kin-Zhang/DeltaFlow

arXiv:2508.16943 [pdf, ps, other]

HumanoidVerse: A Versatile Humanoid for Vision-Language Guided Multi-Object Rearrangement

Authors: Haozhuo Zhang, Jingkai Sun, Michele Caprio, Jian Tang, Shanghang Zhang, Qiang Zhang, Wei Pan

Abstract: We introduce HumanoidVerse, a novel framework for vision-language guided humanoid control that enables a single physically simulated robot to perform long-horizon, multi-object rearrangement tasks across diverse scenes. Unlike prior methods that operate in fixed settings with single-object interactions, our approach supports consecutive manipulation of multiple objects, guided only by natural lang… ▽ More We introduce HumanoidVerse, a novel framework for vision-language guided humanoid control that enables a single physically simulated robot to perform long-horizon, multi-object rearrangement tasks across diverse scenes. Unlike prior methods that operate in fixed settings with single-object interactions, our approach supports consecutive manipulation of multiple objects, guided only by natural language instructions and egocentric camera RGB observations. HumanoidVerse is trained via a multi-stage curriculum using a dual-teacher distillation pipeline, enabling fluid transitions between sub-tasks without requiring environment resets. To support this, we construct a large-scale dataset comprising 350 multi-object tasks spanning four room layouts. Extensive experiments in the Isaac Gym simulator demonstrate that our method significantly outperforms prior state-of-the-art in both task success rate and spatial precision, and generalizes well to unseen environments and instructions. Our work represents a key step toward robust, general-purpose humanoid agents capable of executing complex, sequential tasks under real-world sensory constraints. The video visualization results can be found on the project page: https://haozhuo-zhang.github.io/HumanoidVerse-project-page/. △ Less

Submitted 23 August, 2025; originally announced August 2025.

Comments: Project Page: https://haozhuo-zhang.github.io/HumanoidVerse-project-page/

arXiv:2508.16917 [pdf, ps, other]

Structural Energy-Guided Sampling for View-Consistent Text-to-3D

Authors: Qing Zhang, Jinguang Tong, Jie Hong, Jing Zhang, Xuesong Li

Abstract: Text-to-3D generation often suffers from the Janus problem, where objects look correct from the front but collapse into duplicated or distorted geometry from other angles. We attribute this failure to viewpoint bias in 2D diffusion priors, which propagates into 3D optimization. To address this, we propose Structural Energy-Guided Sampling (SEGS), a training-free, plug-and-play framework that enfor… ▽ More Text-to-3D generation often suffers from the Janus problem, where objects look correct from the front but collapse into duplicated or distorted geometry from other angles. We attribute this failure to viewpoint bias in 2D diffusion priors, which propagates into 3D optimization. To address this, we propose Structural Energy-Guided Sampling (SEGS), a training-free, plug-and-play framework that enforces multi-view consistency entirely at sampling time. SEGS defines a structural energy in a PCA subspace of intermediate U-Net features and injects its gradients into the denoising trajectory, steering geometry toward the intended viewpoint while preserving appearance fidelity. Integrated seamlessly into SDS/VSD pipelines, SEGS significantly reduces Janus artifacts, achieving improved geometric alignment and viewpoint consistency without retraining or weight modification. △ Less

Submitted 23 August, 2025; originally announced August 2025.

arXiv:2508.16065 [pdf, ps, other]

doi 10.1007/s11704-025-50136-2

Ethical Considerations of Large Language Models in Game Playing

Authors: Qingquan Zhang, Yuchen Li, Bo Yuan, Julian Togelius, Georgios N. Yannakakis, Jialin Liu

Abstract: Large language models (LLMs) have demonstrated tremendous potential in game playing, while little attention has been paid to their ethical implications in those contexts. This work investigates and analyses the ethical considerations of applying LLMs in game playing, using Werewolf, also known as Mafia, as a case study. Gender bias, which affects game fairness and player experience, has been obser… ▽ More Large language models (LLMs) have demonstrated tremendous potential in game playing, while little attention has been paid to their ethical implications in those contexts. This work investigates and analyses the ethical considerations of applying LLMs in game playing, using Werewolf, also known as Mafia, as a case study. Gender bias, which affects game fairness and player experience, has been observed from the behaviour of LLMs. Some roles, such as the Guard and Werewolf, are more sensitive than others to gender information, presented as a higher degree of behavioural change. We further examine scenarios in which gender information is implicitly conveyed through names, revealing that LLMs still exhibit discriminatory tendencies even in the absence of explicit gender labels. This research showcases the importance of developing fair and ethical LLMs. Beyond our research findings, we discuss the challenges and opportunities that lie ahead in this field, emphasising the need for diving deeper into the ethical implications of LLMs in gaming and other interactive domains. △ Less

Submitted 21 August, 2025; originally announced August 2025.

Comments: 19 pages

Journal ref: Frontiers of Computer Science (2025)

arXiv:2508.15919 [pdf, ps, other]

HyperFlexis: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling

Authors: Zahra Yousefijamarani, Xinglu Wang, Qian Wang, Morgan Lindsay Heisler, Taha Shabani, Niloofar Gholipour, Parham Yassini, Hong Chang, Kan Chen, Qiantao Zhang, Xiaolong Bai, Jiannan Wang, Ying Xiong, Yong Zhang, Zhenan Fan

Abstract: Modern large language model (LLM) serving systems face challenges from highly variable requests with diverse lengths, priorities, and stage-specific service-level objectives (SLOs). Meeting these requires real-time scheduling, rapid and cost-effective scaling, and support for both collocated and disaggregated Prefill/Decode (P/D) architectures. We present \textbf{HyperFlexis}, a unified LLM serv… ▽ More Modern large language model (LLM) serving systems face challenges from highly variable requests with diverse lengths, priorities, and stage-specific service-level objectives (SLOs). Meeting these requires real-time scheduling, rapid and cost-effective scaling, and support for both collocated and disaggregated Prefill/Decode (P/D) architectures. We present \textbf{HyperFlexis}, a unified LLM serving system that integrates algorithmic and system-level innovations to jointly optimize scheduling and scaling under multiple SLOs. It features a multi-SLO-aware scheduler that leverages budget estimation and request prioritization to ensure proactive SLO compliance for both new and ongoing requests. The system supports prefill- and decode-stage multi-SLO scheduling for P/D-disaggregated architectures and KV cache transfers. It also enables cost-effective scaling decisions, prefill-decode instance linking during scaling, and rapid P/D role transitions. To accelerate scaling and reduce cold-start latency, a device-to-device (D2D) weight transfer mechanism is proposed that lowers weight loading overhead by up to \textbf{19.39$\times$}. These optimizations allow the system to achieve up to \textbf{4.44$\times$} higher SLO attainment, \textbf{65.82\%} lower request latency, and cost parity with state-of-the-art baselines. The code will be released soon. △ Less

Submitted 21 August, 2025; originally announced August 2025.

arXiv:2508.15763 [pdf, ps, other]

Intern-S1: A Scientific Multimodal Foundation Model

Authors: Lei Bai, Zhongrui Cai, Yuhang Cao, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, Yongkang Chen, Yu Cheng, Pei Chu, Tao Chu, Erfei Cui, Ganqu Cui, Long Cui, Ziyun Cui, Nianchen Deng, Ning Ding, Nanqing Dong, Peijie Dong, Shihan Dou, Sinan Du, Haodong Duan , et al. (152 additional authors not shown)

Abstract: In recent years, a plethora of open-source foundation models have emerged, achieving remarkable progress in some widely attended fields, with performance being quite close to that of closed-source models. However, in high-value but more challenging scientific professional fields, either the fields still rely on expert models, or the progress of general foundation models lags significantly compared… ▽ More In recent years, a plethora of open-source foundation models have emerged, achieving remarkable progress in some widely attended fields, with performance being quite close to that of closed-source models. However, in high-value but more challenging scientific professional fields, either the fields still rely on expert models, or the progress of general foundation models lags significantly compared to those in popular areas, far from sufficient for transforming scientific research and leaving substantial gap between open-source models and closed-source models in these scientific domains. To mitigate this gap and explore a step further toward Artificial General Intelligence (AGI), we introduce Intern-S1, a specialized generalist equipped with general understanding and reasoning capabilities with expertise to analyze multiple science modal data. Intern-S1 is a multimodal Mixture-of-Experts (MoE) model with 28 billion activated parameters and 241 billion total parameters, continually pre-trained on 5T tokens, including over 2.5T tokens from scientific domains. In the post-training stage, Intern-S1 undergoes offline and then online reinforcement learning (RL) in InternBootCamp, where we propose Mixture-of-Rewards (MoR) to synergize the RL training on more than 1000 tasks simultaneously. Through integrated innovations in algorithms, data, and training systems, Intern-S1 achieved top-tier performance in online RL training. On comprehensive evaluation benchmarks, Intern-S1 demonstrates competitive performance on general reasoning tasks among open-source models and significantly outperforms open-source models in scientific domains, surpassing closed-source state-of-the-art models in professional tasks, such as molecular synthesis planning, reaction condition prediction, predicting thermodynamic stabilities for crystals. Our models are available at https://huggingface.co/internlm/Intern-S1. △ Less

Submitted 24 August, 2025; v1 submitted 21 August, 2025; originally announced August 2025.

arXiv:2508.15305 [pdf, ps, other]

Coarse-to-Fine Grounded Memory for LLM Agent Planning

Authors: Wei Yang, Jinwei Xiao, Hongming Zhang, Qingyang Zhang, Yanna Wang, Bo Xu

Abstract: Recent advancements in Large Language Models (LLMs) have driven growing interest in LLM-based agents for complex planning tasks. To avoid costly agent training, many studies adopted memory mechanism that enhances LLM with offline experiences or online trajectory analysis. However, existing works focus on single-granularity memory derived from dynamic environmental interactions, which are inherentl… ▽ More Recent advancements in Large Language Models (LLMs) have driven growing interest in LLM-based agents for complex planning tasks. To avoid costly agent training, many studies adopted memory mechanism that enhances LLM with offline experiences or online trajectory analysis. However, existing works focus on single-granularity memory derived from dynamic environmental interactions, which are inherently constrained by the quality of the collected experiences. This limitation, in turn, constrain the diversity of knowledge and the flexibility of planning. We propose Coarse-to-Fine Grounded Memory (\Ours{}), a novel framework that grounds coarse-to-fine memories with LLM, thereby fully leverage them for flexible adaptation to diverse scenarios. \Ours{} grounds environmental information into coarse-grained focus points to guide experience collection in training tasks, followed by grounding of actionable hybrid-grained tips from each experience. At inference, \Ours{} retrieves task-relevant experiences and tips to support planning. When facing environmental anomalies, the LLM grounds the current situation into fine-grained key information, enabling flexible self-QA reflection and plan correction. △ Less

Submitted 21 August, 2025; originally announced August 2025.

Comments: Accepted to EMNLP 2025 Main Conference;27 pages,15 figures

arXiv:2508.15214 [pdf, ps, other]

Self-Guided Function Calling in Large Language Models via Stepwise Experience Recall

Authors: Sijia Cui, Aiyao He, Shuai Xu, Hongming Zhang, Yanna Wang, Qingyang Zhang, Yajing Wang, Bo Xu

Abstract: Function calling enables large language models (LLMs) to interact with external systems by leveraging tools and APIs. When faced with multi-step tool usage, LLMs still struggle with tool selection, parameter generation, and tool-chain planning. Existing methods typically rely on manually designing task-specific demonstrations, or retrieving from a curated library. These approaches demand substanti… ▽ More Function calling enables large language models (LLMs) to interact with external systems by leveraging tools and APIs. When faced with multi-step tool usage, LLMs still struggle with tool selection, parameter generation, and tool-chain planning. Existing methods typically rely on manually designing task-specific demonstrations, or retrieving from a curated library. These approaches demand substantial expert effort and prompt engineering becomes increasingly complex and inefficient as tool diversity and task difficulty scale. To address these challenges, we propose a self-guided method, Stepwise Experience Recall (SEER), which performs fine-grained, stepwise retrieval from a continually updated experience pool. Instead of relying on static or manually curated library, SEER incrementally augments the experience pool with past successful trajectories, enabling continuous expansion of the pool and improved model performance over time. Evaluated on the ToolQA benchmark, SEER achieves an average improvement of 6.1\% on easy and 4.7\% on hard questions. We further test SEER on $τ$-bench, which includes two real-world domains. Powered by Qwen2.5-7B and Qwen2.5-72B models, SEER demonstrates substantial accuracy gains of 7.44\% and 23.38\%, respectively. △ Less

Submitted 20 August, 2025; originally announced August 2025.

Comments: Accepted to EMNLP 2025

arXiv:2508.14896 [pdf, ps, other]

Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs

Authors: Haokun Lin, Haobo Xu, Yichen Wu, Ziyu Guo, Renrui Zhang, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun

Abstract: Recent advances in diffusion large language models (dLLMs) have introduced a promising alternative to autoregressive (AR) LLMs for natural language generation tasks, leveraging full attention and denoising-based decoding strategies. However, the deployment of these models on edge devices remains challenging due to their massive parameter scale and high resource demands. While post-training quantiz… ▽ More Recent advances in diffusion large language models (dLLMs) have introduced a promising alternative to autoregressive (AR) LLMs for natural language generation tasks, leveraging full attention and denoising-based decoding strategies. However, the deployment of these models on edge devices remains challenging due to their massive parameter scale and high resource demands. While post-training quantization (PTQ) has emerged as a widely adopted technique for compressing AR LLMs, its applicability to dLLMs remains largely unexplored. In this work, we present the first systematic study on quantizing diffusion-based language models. We begin by identifying the presence of activation outliers, characterized by abnormally large activation values that dominate the dynamic range. These outliers pose a key challenge to low-bit quantization, as they make it difficult to preserve precision for the majority of values. More importantly, we implement state-of-the-art PTQ methods and conduct a comprehensive evaluation across multiple task types and model variants. Our analysis is structured along four key dimensions: bit-width, quantization method, task category, and model type. Through this multi-perspective evaluation, we offer practical insights into the quantization behavior of dLLMs under different configurations. We hope our findings provide a foundation for future research in efficient dLLM deployment. All codes and experimental setups will be released to support the community. △ Less

Submitted 20 August, 2025; originally announced August 2025.

Comments: Technical Report, Work in Progress

arXiv:2508.14848 [pdf, ps, other]

Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach

Authors: Qiao Zhang, Rabab Alomairy, Dali Wang, Zhuowei Gu, Qinglei Cao

Abstract: General Matrix Multiplication (GEMM) is a critical operation underpinning a wide range of applications in high-performance computing (HPC) and artificial intelligence (AI). The emergence of hardware optimized for low-precision arithmetic necessitates a reevaluation of numerical algorithms to leverage mixed-precision computations, achieving improved performance and energy efficiency. This research… ▽ More General Matrix Multiplication (GEMM) is a critical operation underpinning a wide range of applications in high-performance computing (HPC) and artificial intelligence (AI). The emergence of hardware optimized for low-precision arithmetic necessitates a reevaluation of numerical algorithms to leverage mixed-precision computations, achieving improved performance and energy efficiency. This research introduces an adaptive mixed-precision GEMM framework that supports different precision formats at fine-grained tile/block levels. We utilize the PaRSEC runtime system to balance workloads across various architectures. The performance scales well on ARM CPU-based Fugaku supercomputer, Nvidia GPU-based A100 DGX, and AMD GPU-based Frontier supercomputer. This research aims to enhance computational efficiency and accuracy by bridging algorithmic advancements and hardware innovations, driving transformative progress in various applications. △ Less

Submitted 20 August, 2025; originally announced August 2025.

arXiv:2508.14415 [pdf, ps, other]

The Agent Behavior: Model, Governance and Challenges in the AI Digital Age

Authors: Qiang Zhang, Pei Yan, Yijia Xu, Chuanpo Fu, Yong Fang, Yang Liu

Abstract: Advancements in AI have led to agents in networked environments increasingly mirroring human behavior, thereby blurring the boundary between artificial and human actors in specific contexts. This shift brings about significant challenges in trust, responsibility, ethics, security and etc. The difficulty in supervising of agent behaviors may lead to issues such as data contamination and unclear acc… ▽ More Advancements in AI have led to agents in networked environments increasingly mirroring human behavior, thereby blurring the boundary between artificial and human actors in specific contexts. This shift brings about significant challenges in trust, responsibility, ethics, security and etc. The difficulty in supervising of agent behaviors may lead to issues such as data contamination and unclear accountability. To address these challenges, this paper proposes the "Network Behavior Lifecycle" model, which divides network behavior into 6 stages and systematically analyzes the behavioral differences between humans and agents at each stage. Based on these insights, the paper further introduces the "Agent for Agent (A4A)" paradigm and the "Human-Agent Behavioral Disparity (HABD)" model, which examine the fundamental distinctions between human and agent behaviors across 5 dimensions: decision mechanism, execution efficiency, intention-behavior consistency, behavioral inertia, and irrational patterns. The effectiveness of the model is verified through real-world cases such as red team penetration and blue team defense. Finally, the paper discusses future research directions in dynamic cognitive governance architecture, behavioral disparity quantification, and meta-governance protocol stacks, aiming to provide a theoretical foundation and technical roadmap for secure and trustworthy human-agent collaboration. △ Less

Submitted 20 August, 2025; originally announced August 2025.

arXiv:2508.14313 [pdf, ps, other]

Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS

Authors: Can Jin, Yang Zhou, Qixin Zhang, Hongwu Peng, Di Zhang, Marco Pavone, Ligong Han, Zhang-Wei Hong, Tong Che, Dimitris N. Metaxas

Abstract: Test-time scaling (TTS) for large language models (LLMs) has thus far fallen into two largely separate paradigms: (1) reinforcement learning (RL) methods that optimize sparse outcome-based rewards, yet suffer from instability and low sample efficiency; and (2) search-based techniques guided by independently trained, static process reward models (PRMs), which require expensive human- or LLM-generat… ▽ More Test-time scaling (TTS) for large language models (LLMs) has thus far fallen into two largely separate paradigms: (1) reinforcement learning (RL) methods that optimize sparse outcome-based rewards, yet suffer from instability and low sample efficiency; and (2) search-based techniques guided by independently trained, static process reward models (PRMs), which require expensive human- or LLM-generated labels and often degrade under distribution shifts. In this paper, we introduce AIRL-S, the first natural unification of RL-based and search-based TTS. Central to AIRL-S is the insight that the reward function learned during RL training inherently represents the ideal PRM for guiding downstream search. Specifically, we leverage adversarial inverse reinforcement learning (AIRL) combined with group relative policy optimization (GRPO) to learn a dense, dynamic PRM directly from correct reasoning traces, entirely eliminating the need for labeled intermediate process data. At inference, the resulting PRM simultaneously serves as the critic for RL rollouts and as a heuristic to effectively guide search procedures, facilitating robust reasoning chain extension, mitigating reward hacking, and enhancing cross-task generalization. Experimental results across eight benchmarks, including mathematics, scientific reasoning, and code generation, demonstrate that our unified approach improves performance by 9 % on average over the base model, matching GPT-4o. Furthermore, when integrated into multiple search algorithms, our PRM consistently outperforms all baseline PRMs trained with labeled data. These results underscore that, indeed, your reward function for RL is your best PRM for search, providing a robust and cost-effective solution to complex reasoning tasks in LLMs. △ Less

Submitted 22 August, 2025; v1 submitted 19 August, 2025; originally announced August 2025.

arXiv:2508.14111 [pdf, ps, other]

From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery

Authors: Jiaqi Wei, Yuejin Yang, Xiang Zhang, Yuhan Chen, Xiang Zhuang, Zhangyang Gao, Dongzhan Zhou, Guangshuai Wang, Zhiqiang Gao, Juntai Cao, Zijie Qiu, Xuming He, Qiang Zhang, Chenyu You, Shuangjia Zheng, Ning Ding, Wanli Ouyang, Nanqing Dong, Yu Cheng, Siqi Sun, Lei Bai, Bowen Zhou

Abstract: Artificial intelligence (AI) is reshaping scientific discovery, evolving from specialized computational tools into autonomous research partners. We position Agentic Science as a pivotal stage within the broader AI for Science paradigm, where AI systems progress from partial assistance to full scientific agency. Enabled by large language models (LLMs), multimodal systems, and integrated research pl… ▽ More Artificial intelligence (AI) is reshaping scientific discovery, evolving from specialized computational tools into autonomous research partners. We position Agentic Science as a pivotal stage within the broader AI for Science paradigm, where AI systems progress from partial assistance to full scientific agency. Enabled by large language models (LLMs), multimodal systems, and integrated research platforms, agentic AI shows capabilities in hypothesis generation, experimental design, execution, analysis, and iterative refinement -- behaviors once regarded as uniquely human. This survey provides a domain-oriented review of autonomous scientific discovery across life sciences, chemistry, materials science, and physics. We unify three previously fragmented perspectives -- process-oriented, autonomy-oriented, and mechanism-oriented -- through a comprehensive framework that connects foundational capabilities, core processes, and domain-specific realizations. Building on this framework, we (i) trace the evolution of AI for Science, (ii) identify five core capabilities underpinning scientific agency, (iii) model discovery as a dynamic four-stage workflow, (iv) review applications across the above domains, and (v) synthesize key challenges and future opportunities. This work establishes a domain-oriented synthesis of autonomous scientific discovery and positions Agentic Science as a structured paradigm for advancing AI-driven research. △ Less

Submitted 18 August, 2025; originally announced August 2025.

arXiv:2508.14074 [pdf, ps, other]

GEPD:GAN-Enhanced Generalizable Model for EEG-Based Detection of Parkinson's Disease

Authors: Qian Zhang, Ruilin Zhang, Biaokai Zhu, Xun Han, Jun Xiao, Yifan Liu, Zhe Wang

Abstract: Electroencephalography has been established as an effective method for detecting Parkinson's disease, typically diagnosed early.Current Parkinson's disease detection methods have shown significant success within individual datasets, however, the variability in detection methods across different EEG datasets and the small size of each dataset pose challenges for training a generalizable model for c… ▽ More Electroencephalography has been established as an effective method for detecting Parkinson's disease, typically diagnosed early.Current Parkinson's disease detection methods have shown significant success within individual datasets, however, the variability in detection methods across different EEG datasets and the small size of each dataset pose challenges for training a generalizable model for cross-dataset scenarios. To address these issues, this paper proposes a GAN-enhanced generalizable model, named GEPD, specifically for EEG-based cross-dataset classification of Parkinson's disease.First, we design a generative network that creates fusion EEG data by controlling the distribution similarity between generated data and real data.In addition, an EEG signal quality assessment model is designed to ensure the quality of generated data great.Second, we design a classification network that utilizes a combination of multiple convolutional neural networks to effectively capture the time-frequency characteristics of EEG signals, while maintaining a generalizable structure and ensuring easy convergence.This work is dedicated to utilizing intelligent methods to study pathological manifestations, aiming to facilitate the diagnosis and monitoring of neurological diseases.The evaluation results demonstrate that our model performs comparably to state-of-the-art models in cross-dataset settings, achieving an accuracy of 84.3% and an F1-score of 84.0%, showcasing the generalizability of the proposed model. △ Less

Submitted 12 August, 2025; originally announced August 2025.

Comments: Accepted by International Conference on Intelligent Computing(ICIC 2025)

arXiv:2508.14073 [pdf, ps, other]

MCLPD:Multi-view Contrastive Learning for EEG-based PD Detection Across Datasets

Authors: Qian Zhang, Ruilin Zhang, Jun Xiao, Yifan Liu, Zhe Wang

Abstract: Electroencephalography has been validated as an effective technique for detecting Parkinson's disease,particularly in its early stages.However,the high cost of EEG data annotation often results in limited dataset size and considerable discrepancies across datasets,including differences in acquisition protocols and subject demographics,significantly hinder the robustness and generalizability of mod… ▽ More Electroencephalography has been validated as an effective technique for detecting Parkinson's disease,particularly in its early stages.However,the high cost of EEG data annotation often results in limited dataset size and considerable discrepancies across datasets,including differences in acquisition protocols and subject demographics,significantly hinder the robustness and generalizability of models in cross-dataset detection scenarios.To address such challenges,this paper proposes a semi-supervised learning framework named MCLPD,which integrates multi-view contrastive pre-training with lightweight supervised fine-tuning to enhance cross-dataset PD detection performance.During pre-training,MCLPD uses self-supervised learning on the unlabeled UNM dataset.To build contrastive pairs,it applies dual augmentations in both time and frequency domains,which enrich the data and naturally fuse time-frequency information.In the fine-tuning phase,only a small proportion of labeled data from another two datasets (UI and UC)is used for supervised optimization.Experimental results show that MCLPD achieves F1 scores of 0.91 on UI and 0.81 on UC using only 1%of labeled data,which further improve to 0.97 and 0.87,respectively,when 5%of labeled data is used.Compared to existing methods,MCLPD substantially improves cross-dataset generalization while reducing the dependency on labeled data,demonstrating the effectiveness of the proposed framework. △ Less

Submitted 21 August, 2025; v1 submitted 12 August, 2025; originally announced August 2025.

Comments: Acccepted by European Conference on Artificial Intelligence(ECAI 2025)

arXiv:2508.13979 [pdf, ps, other]

AutoScale: Linear Scalarization Guided by Multi-Task Optimization Metrics

Authors: Yi Yang, Kei Ikemura, Qingwen Zhang, Xiaomeng Zhu, Ci Li, Nazre Batool, Sina Sharif Mansouri, John Folkesson

Abstract: Recent multi-task learning studies suggest that linear scalarization, when using well-chosen fixed task weights, can achieve comparable to or even better performance than complex multi-task optimization (MTO) methods. It remains unclear why certain weights yield optimal performance and how to determine these weights without relying on exhaustive hyperparameter search. This paper establishes a dire… ▽ More Recent multi-task learning studies suggest that linear scalarization, when using well-chosen fixed task weights, can achieve comparable to or even better performance than complex multi-task optimization (MTO) methods. It remains unclear why certain weights yield optimal performance and how to determine these weights without relying on exhaustive hyperparameter search. This paper establishes a direct connection between linear scalarization and MTO methods, revealing through extensive experiments that well-performing scalarization weights exhibit specific trends in key MTO metrics, such as high gradient magnitude similarity. Building on this insight, we introduce AutoScale, a simple yet effective two-phase framework that uses these MTO metrics to guide weight selection for linear scalarization, without expensive weight search. AutoScale consistently shows superior performance with high efficiency across diverse datasets including a new large-scale benchmark. △ Less

Submitted 19 August, 2025; originally announced August 2025.

Comments: The first two authors hold equal contribution. 10 pages, 6 figures

arXiv:2508.13515 [pdf, ps, other]

2D Gaussians Meet Visual Tokenizer

Authors: Yiang Shi, Xiaoyang Guo, Wei Yin, Mingkai Jia, Qian Zhang, Xiaolin Hu, Wenyu Liu, Xinggang Wang

Abstract: The image tokenizer is a critical component in AR image generation, as it determines how rich and structured visual content is encoded into compact representations. Existing quantization-based tokenizers such as VQ-GAN primarily focus on appearance features like texture and color, often neglecting geometric structures due to their patch-based design. In this work, we explored how to incorporate mo… ▽ More The image tokenizer is a critical component in AR image generation, as it determines how rich and structured visual content is encoded into compact representations. Existing quantization-based tokenizers such as VQ-GAN primarily focus on appearance features like texture and color, often neglecting geometric structures due to their patch-based design. In this work, we explored how to incorporate more visual information into the tokenizer and proposed a new framework named Visual Gaussian Quantization (VGQ), a novel tokenizer paradigm that explicitly enhances structural modeling by integrating 2D Gaussians into traditional visual codebook quantization frameworks. Our approach addresses the inherent limitations of naive quantization methods such as VQ-GAN, which struggle to model structured visual information due to their patch-based design and emphasis on texture and color. In contrast, VGQ encodes image latents as 2D Gaussian distributions, effectively capturing geometric and spatial structures by directly modeling structure-related parameters such as position, rotation and scale. We further demonstrate that increasing the density of 2D Gaussians within the tokens leads to significant gains in reconstruction fidelity, providing a flexible trade-off between token efficiency and visual richness. On the ImageNet 256x256 benchmark, VGQ achieves strong reconstruction quality with an rFID score of 1.00. Furthermore, by increasing the density of 2D Gaussians within the tokens, VGQ gains a significant boost in reconstruction capability and achieves a state-of-the-art reconstruction rFID score of 0.556 and a PSNR of 24.93, substantially outperforming existing methods. Codes will be released soon. △ Less

Submitted 19 August, 2025; v1 submitted 19 August, 2025; originally announced August 2025.

arXiv:2508.13156 [pdf, ps, other]

EvoVerilog: Large Langugage Model Assisted Evolution of Verilog Code

Authors: Ping Guo, Yiting Wang, Wanghao Ye, Yexiao He, Ziyao Wang, Xiaopeng Dai, Ang Li, Qingfu Zhang

Abstract: Large Language Models (LLMs) have demonstrated great potential in automating the generation of Verilog hardware description language code for hardware design. This automation is critical to reducing human effort in the complex and error-prone process of hardware design. However, existing approaches predominantly rely on human intervention and fine-tuning using curated datasets, limiting their sc… ▽ More Large Language Models (LLMs) have demonstrated great potential in automating the generation of Verilog hardware description language code for hardware design. This automation is critical to reducing human effort in the complex and error-prone process of hardware design. However, existing approaches predominantly rely on human intervention and fine-tuning using curated datasets, limiting their scalability in automated design workflows. Although recent iterative search techniques have emerged, they often fail to explore diverse design solutions and may underperform simpler approaches such as repeated prompting. To address these limitations, we introduce EvoVerilog, a novel framework that combines the reasoning capabilities of LLMs with evolutionary algorithms to automatically generate and refine Verilog code. EvoVerilog utilizes a multiobjective, population-based search strategy to explore a wide range of design possibilities without requiring human intervention. Extensive experiments demonstrate that EvoVerilog achieves state-of-the-art performance, with pass@10 scores of 89.1 and 80.2 on the VerilogEval-Machine and VerilogEval-Human benchmarks, respectively. Furthermore, the framework showcases its ability to explore diverse designs by simultaneously generating a variety of functional Verilog code while optimizing resource utilization. △ Less

Submitted 26 June, 2025; originally announced August 2025.

arXiv:2508.12433 [pdf, ps, other]

ATLAS: A Self-Supervised and Cross-Stage Netlist Power Model for Fine-Grained Time-Based Layout Power Analysis

Authors: Wenkai Li, Yao Lu, Wenji Fang, Jing Wang, Qijun Zhang, Zhiyao Xie

Abstract: Accurate power prediction in VLSI design is crucial for effective power optimization, especially as designs get transformed from gate-level netlist to layout stages. However, traditional accurate power simulation requires time-consuming back-end processing and simulation steps, which significantly impede design optimization. To address this, we propose ATLAS, which can predict the ultimate time-ba… ▽ More Accurate power prediction in VLSI design is crucial for effective power optimization, especially as designs get transformed from gate-level netlist to layout stages. However, traditional accurate power simulation requires time-consuming back-end processing and simulation steps, which significantly impede design optimization. To address this, we propose ATLAS, which can predict the ultimate time-based layout power for any new design in the gate-level netlist. To the best of our knowledge, ATLAS is the first work that supports both time-based power simulation and general cross-design power modeling. It achieves such general time-based power modeling by proposing a new pre-training and fine-tuning paradigm customized for circuit power. Targeting golden per-cycle layout power from commercial tools, our ATLAS achieves the mean absolute percentage error (MAPE) of only 0.58%, 0.45%, and 5.12% for the clock tree, register, and combinational power groups, respectively, without any layout information. Overall, the MAPE for the total power of the entire design is <1%, and the inference speed of a workload is significantly faster than the standard flow of commercial tools. △ Less

Submitted 17 August, 2025; originally announced August 2025.

Comments: Accepted by Design Automation Conference (DAC), 2025

arXiv:2508.12294 [pdf, ps, other]

AutoPower: Automated Few-Shot Architecture-Level Power Modeling by Power Group Decoupling

Authors: Qijun Zhang, Yao Lu, Mengming Li, Zhiyao Xie

Abstract: Power efficiency is a critical design objective in modern CPU design. Architects need a fast yet accurate architecture-level power evaluation tool to perform early-stage power estimation. However, traditional analytical architecture-level power models are inaccurate. The recently proposed machine learning (ML)-based architecture-level power model requires sufficient data from known configurations… ▽ More Power efficiency is a critical design objective in modern CPU design. Architects need a fast yet accurate architecture-level power evaluation tool to perform early-stage power estimation. However, traditional analytical architecture-level power models are inaccurate. The recently proposed machine learning (ML)-based architecture-level power model requires sufficient data from known configurations for training, making it unrealistic. In this work, we propose AutoPower targeting fully automated architecture-level power modeling with limited known design configurations. We have two key observations: (1) The clock and SRAM dominate the power consumption of the processor, and (2) The clock and SRAM power correlate with structural information available at the architecture level. Based on these two observations, we propose the power group decoupling in AutoPower. First, AutoPower decouples across power groups to build individual power models for each group. Second, AutoPower designs power models by further decoupling the model into multiple sub-models within each power group. In our experiments, AutoPower can achieve a low mean absolute percentage error (MAPE) of 4.36\% and a high $R^2$ of 0.96 even with only two known configurations for training. This is 5\% lower in MAPE and 0.09 higher in $R^2$ compared with McPAT-Calib, the representative ML-based power model. △ Less

Submitted 17 August, 2025; originally announced August 2025.

Comments: Published in DAC'25

arXiv:2508.12250 [pdf, ps, other]

WXSOD: A Benchmark for Robust Salient Object Detection in Adverse Weather Conditions

Authors: Quan Chen, Xiong Yang, Rongfeng Lu, Qianyu Zhang, Yu Liu, Xiaofei Zhou, Bolun Zheng

Abstract: Salient object detection (SOD) in complex environments remains a challenging research topic. Most existing methods perform well in natural scenes with negligible noise, and tend to leverage multi-modal information (e.g., depth and infrared) to enhance accuracy. However, few studies are concerned with the damage of weather noise on SOD performance due to the lack of dataset with pixel-wise annotati… ▽ More Salient object detection (SOD) in complex environments remains a challenging research topic. Most existing methods perform well in natural scenes with negligible noise, and tend to leverage multi-modal information (e.g., depth and infrared) to enhance accuracy. However, few studies are concerned with the damage of weather noise on SOD performance due to the lack of dataset with pixel-wise annotations. To bridge this gap, this paper introduces a novel Weather-eXtended Salient Object Detection (WXSOD) dataset. It consists of 14,945 RGB images with diverse weather noise, along with the corresponding ground truth annotations and weather labels. To verify algorithm generalization, WXSOD contains two test sets, i.e., a synthesized test set and a real test set. The former is generated by adding weather noise to clean images, while the latter contains real-world weather noise. Based on WXSOD, we propose an efficient baseline, termed Weather-aware Feature Aggregation Network (WFANet), which adopts a fully supervised two-branch architecture. Specifically, the weather prediction branch mines weather-related deep features, while the saliency detection branch fuses semantic features extracted from the backbone with weather features for SOD. Comprehensive comparisons against 17 SOD methods shows that our WFANet achieves superior performance on WXSOD. The code and benchmark results will be made publicly available at https://github.com/C-water/WXSOD △ Less

Submitted 17 August, 2025; originally announced August 2025.

Comments: Under review

arXiv:2508.11442 [pdf, ps, other]

CoDiEmb: A Collaborative yet Distinct Framework for Unified Representation Learning in Information Retrieval and Semantic Textual Similarity

Authors: Bowen Zhang, Zixin Song, Chunquan Chen, Qian-Wen Zhang, Di Yin, Xing Sun

Abstract: Learning unified text embeddings that excel across diverse downstream tasks is a central goal in representation learning, yet negative transfer remains a persistent obstacle. This challenge is particularly pronounced when jointly training a single encoder for Information Retrieval (IR) and Semantic Textual Similarity (STS), two essential but fundamentally disparate tasks for which naive co-trainin… ▽ More Learning unified text embeddings that excel across diverse downstream tasks is a central goal in representation learning, yet negative transfer remains a persistent obstacle. This challenge is particularly pronounced when jointly training a single encoder for Information Retrieval (IR) and Semantic Textual Similarity (STS), two essential but fundamentally disparate tasks for which naive co-training typically yields steep performance trade-offs. We argue that resolving this conflict requires systematically decoupling task-specific learning signals throughout the training pipeline. To this end, we introduce CoDiEmb, a unified framework that reconciles the divergent requirements of IR and STS in a collaborative yet distinct manner. CoDiEmb integrates three key innovations for effective joint optimization: (1) Task-specialized objectives paired with a dynamic sampler that forms single-task batches and balances per-task updates, thereby preventing gradient interference. For IR, we employ a contrastive loss with multiple positives and hard negatives, augmented by cross-device sampling. For STS, we adopt order-aware objectives that directly optimize correlation and ranking consistency. (2) A delta-guided model fusion strategy that computes fine-grained merging weights for checkpoints by analyzing each parameter's deviation from its pre-trained initialization, proving more effective than traditional Model Soups. (3) An efficient, single-stage training pipeline that is simple to implement and converges stably. Extensive experiments on 15 standard IR and STS benchmarks across three base encoders validate CoDiEmb. Our results and analysis demonstrate that the framework not only mitigates cross-task trade-offs but also measurably improves the geometric properties of the embedding space. △ Less

Submitted 15 August, 2025; originally announced August 2025.

arXiv:2508.11289 [pdf, ps, other]

A Recursive Total Least Squares Solution for Bearing-Only Target Motion Analysis and Circumnavigation

Authors: Lin Li, Xueming Liu, Zhoujingzi Qiu, Tianjiang Hu, Qingrui Zhang

Abstract: Bearing-only Target Motion Analysis (TMA) is a promising technique for passive tracking in various applications as a bearing angle is easy to measure. Despite its advantages, bearing-only TMA is challenging due to the nonlinearity of the bearing measurement model and the lack of range information, which impairs observability and estimator convergence. This paper addresses these issues by proposing… ▽ More Bearing-only Target Motion Analysis (TMA) is a promising technique for passive tracking in various applications as a bearing angle is easy to measure. Despite its advantages, bearing-only TMA is challenging due to the nonlinearity of the bearing measurement model and the lack of range information, which impairs observability and estimator convergence. This paper addresses these issues by proposing a Recursive Total Least Squares (RTLS) method for online target localization and tracking using mobile observers. The RTLS approach, inspired by previous results on Total Least Squares (TLS), mitigates biases in position estimation and improves computational efficiency compared to pseudo-linear Kalman filter (PLKF) methods. Additionally, we propose a circumnavigation controller to enhance system observability and estimator convergence by guiding the mobile observer in orbit around the target. Extensive simulations and experiments are performed to demonstrate the effectiveness and robustness of the proposed method. The proposed algorithm is also compared with the state-of-the-art approaches, which confirms its superior performance in terms of both accuracy and stability. △ Less

Submitted 15 August, 2025; originally announced August 2025.

Comments: Accepted by 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 6 Pages

arXiv:2508.11279 [pdf, ps, other]

Boosting the Robustness-Accuracy Trade-off of SNNs by Robust Temporal Self-Ensemble

Authors: Jihang Wang, Dongcheng Zhao, Ruolin Chen, Qian Zhang, Yi Zeng

Abstract: Spiking Neural Networks (SNNs) offer a promising direction for energy-efficient and brain-inspired computing, yet their vulnerability to adversarial perturbations remains poorly understood. In this work, we revisit the adversarial robustness of SNNs through the lens of temporal ensembling, treating the network as a collection of evolving sub-networks across discrete timesteps. This formulation unc… ▽ More Spiking Neural Networks (SNNs) offer a promising direction for energy-efficient and brain-inspired computing, yet their vulnerability to adversarial perturbations remains poorly understood. In this work, we revisit the adversarial robustness of SNNs through the lens of temporal ensembling, treating the network as a collection of evolving sub-networks across discrete timesteps. This formulation uncovers two critical but underexplored challenges-the fragility of individual temporal sub-networks and the tendency for adversarial vulnerabilities to transfer across time. To overcome these limitations, we propose Robust Temporal self-Ensemble (RTE), a training framework that improves the robustness of each sub-network while reducing the temporal transferability of adversarial perturbations. RTE integrates both objectives into a unified loss and employs a stochastic sampling strategy for efficient optimization. Extensive experiments across multiple benchmarks demonstrate that RTE consistently outperforms existing training methods in robust-accuracy trade-off. Additional analyses reveal that RTE reshapes the internal robustness landscape of SNNs, leading to more resilient and temporally diversified decision boundaries. Our study highlights the importance of temporal structure in adversarial learning and offers a principled foundation for building robust spiking models. △ Less

Submitted 15 August, 2025; originally announced August 2025.

Showing 1–50 of 2,705 results for author: zhang, Q