Search | arXiv e-print repository

ECCV 2024 W-CODA: 1st Workshop on Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving

Authors: Kai Chen, Ruiyuan Gao, Lanqing Hong, Hang Xu, Xu Jia, Holger Caesar, Dengxin Dai, Bingbing Liu, Dzmitry Tsishkou, Songcen Xu, Chunjing Xu, Qiang Xu, Huchuan Lu, Dit-Yan Yeung

Abstract: In this paper, we present details of the 1st W-CODA workshop, held in conjunction with the ECCV 2024. W-CODA aims to explore next-generation solutions for autonomous driving corner cases, empowered by state-of-the-art multimodal perception and comprehension techniques. 5 Speakers from both academia and industry are invited to share their latest progress and opinions. We collect research papers and… ▽ More In this paper, we present details of the 1st W-CODA workshop, held in conjunction with the ECCV 2024. W-CODA aims to explore next-generation solutions for autonomous driving corner cases, empowered by state-of-the-art multimodal perception and comprehension techniques. 5 Speakers from both academia and industry are invited to share their latest progress and opinions. We collect research papers and hold a dual-track challenge, including both corner case scene understanding and generation. As the pioneering effort, we will continuously bridge the gap between frontier autonomous driving techniques and fully intelligent, reliable self-driving agents robust towards corner cases. △ Less

Submitted 2 July, 2025; originally announced July 2025.

Comments: ECCV 2024. Workshop page: https://coda-dataset.github.io/w-coda2024/

arXiv:2507.01367 [pdf, ps, other]

3D Gaussian Splatting Driven Multi-View Robust Physical Adversarial Camouflage Generation

Authors: Tianrui Lou, Xiaojun Jia, Siyuan Liang, Jiawei Liang, Ming Zhang, Yanjun Xiao, Xiaochun Cao

Abstract: Physical adversarial attack methods expose the vulnerabilities of deep neural networks and pose a significant threat to safety-critical scenarios such as autonomous driving. Camouflage-based physical attack is a more promising approach compared to the patch-based attack, offering stronger adversarial effectiveness in complex physical environments. However, most prior work relies on mesh priors of… ▽ More Physical adversarial attack methods expose the vulnerabilities of deep neural networks and pose a significant threat to safety-critical scenarios such as autonomous driving. Camouflage-based physical attack is a more promising approach compared to the patch-based attack, offering stronger adversarial effectiveness in complex physical environments. However, most prior work relies on mesh priors of the target object and virtual environments constructed by simulators, which are time-consuming to obtain and inevitably differ from the real world. Moreover, due to the limitations of the backgrounds in training images, previous methods often fail to produce multi-view robust adversarial camouflage and tend to fall into sub-optimal solutions. Due to these reasons, prior work lacks adversarial effectiveness and robustness across diverse viewpoints and physical environments. We propose a physical attack framework based on 3D Gaussian Splatting (3DGS), named PGA, which provides rapid and precise reconstruction with few images, along with photo-realistic rendering capabilities. Our framework further enhances cross-view robustness and adversarial effectiveness by preventing mutual and self-occlusion among Gaussians and employing a min-max optimization approach that adjusts the imaging background of each viewpoint, helping the algorithm filter out non-robust adversarial features. Extensive experiments validate the effectiveness and superiority of PGA. Our code is available at:https://github.com/TRLou/PGA. △ Less

Submitted 2 July, 2025; originally announced July 2025.

Comments: Accepted by ICCV 2025

arXiv:2506.22494 [pdf, ps, other]

DriveBLIP2: Attention-Guided Explanation Generation for Complex Driving Scenarios

Authors: Shihong Ling, Yue Wan, Xiaowei Jia, Na Du

Abstract: This paper introduces a new framework, DriveBLIP2, built upon the BLIP2-OPT architecture, to generate accurate and contextually relevant explanations for emerging driving scenarios. While existing vision-language models perform well in general tasks, they encounter difficulties in understanding complex, multi-object environments, particularly in real-time applications such as autonomous driving, w… ▽ More This paper introduces a new framework, DriveBLIP2, built upon the BLIP2-OPT architecture, to generate accurate and contextually relevant explanations for emerging driving scenarios. While existing vision-language models perform well in general tasks, they encounter difficulties in understanding complex, multi-object environments, particularly in real-time applications such as autonomous driving, where the rapid identification of key objects is crucial. To address this limitation, an Attention Map Generator is proposed to highlight significant objects relevant to driving decisions within critical video frames. By directing the model's focus to these key regions, the generated attention map helps produce clear and relevant explanations, enabling drivers to better understand the vehicle's decision-making process in critical situations. Evaluations on the DRAMA dataset reveal significant improvements in explanation quality, as indicated by higher BLEU, ROUGE, CIDEr, and SPICE scores compared to baseline models. These findings underscore the potential of targeted attention mechanisms in vision-language models for enhancing explainability in real-time autonomous driving. △ Less

Submitted 24 June, 2025; originally announced June 2025.

Comments: Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025. 7 pages, 3 figures

arXiv:2506.21618 [pdf, ps, other]

TrajTok: Technical Report for 2025 Waymo Open Sim Agents Challenge

Authors: Zhiyuan Zhang, Xiaosong Jia, Guanyu Chen, Qifeng Li, Junchi Yan

Abstract: In this technical report, we introduce TrajTok, a trajectory tokenizer for discrete next-token-prediction based behavior generation models, which combines data-driven and rule-based methods with better coverage, symmetry and robustness, along with a spatial-aware label smoothing method for cross-entropy loss. We adopt the tokenizer and loss for the SMART model and reach a superior performance with… ▽ More In this technical report, we introduce TrajTok, a trajectory tokenizer for discrete next-token-prediction based behavior generation models, which combines data-driven and rule-based methods with better coverage, symmetry and robustness, along with a spatial-aware label smoothing method for cross-entropy loss. We adopt the tokenizer and loss for the SMART model and reach a superior performance with realism score of 0.7852 on the Waymo Open Sim Agents Challenge 2025. We will open-source the code in the future. △ Less

Submitted 23 June, 2025; originally announced June 2025.

arXiv:2506.19266 [pdf]

Convergent and divergent connectivity patterns of the arcuate fasciculus in macaques and humans

Authors: Jiahao Huang, Ruifeng Li, Wenwen Yu, Anan Li, Xiangning Li, Mingchao Yan, Lei Xie, Qingrun Zeng, Xueyan Jia, Shuxin Wang, Ronghui Ju, Feng Chen, Qingming Luo, Hui Gong, Andrew Zalesky, Xiaoquan Yang, Yuanjing Feng, Zheng Wang

Abstract: The organization and connectivity of the arcuate fasciculus (AF) in nonhuman primates remain contentious, especially concerning how its anatomy diverges from that of humans. Here, we combined cross-scale single-neuron tracing - using viral-based genetic labeling and fluorescence micro-optical sectioning tomography in macaques (n = 4; age 3 - 11 years) - with whole-brain tractography from 11.7T dif… ▽ More The organization and connectivity of the arcuate fasciculus (AF) in nonhuman primates remain contentious, especially concerning how its anatomy diverges from that of humans. Here, we combined cross-scale single-neuron tracing - using viral-based genetic labeling and fluorescence micro-optical sectioning tomography in macaques (n = 4; age 3 - 11 years) - with whole-brain tractography from 11.7T diffusion MRI. Complemented by spectral embedding analysis of 7.0T MRI in humans, we performed a comparative connectomic analysis of the AF across species. We demonstrate that the macaque AF originates in the temporal-parietal cortex, traverses the auditory cortex and parietal operculum, and projects into prefrontal regions. In contrast, the human AF exhibits greater expansion into the middle temporal gyrus and stronger prefrontal and parietal operculum connectivity - divergences quantified by Kullback-Leibler analysis that likely underpin the evolutionary specialization of human language networks. These interspecies differences - particularly the human AF's broader temporal integration and strengthened frontoparietal linkages - suggest a connectivity-based substrate for the emergence of advanced language processing unique to humans. Furthermore, our findings offer a neuroanatomical framework for understanding AF-related disorders such as aphasia and dyslexia, where aberrant connectivity disrupts language function. △ Less

Submitted 2 July, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

Comments: 34 pages, 6 figures

arXiv:2506.17622 [pdf, ps, other]

SoK: Stablecoin Designs, Risks, and the Stablecoin LEGO

Authors: Shengchen Ling, Yuefeng Du, Yajin Zhou, Lei Wu, Cong Wang, Xiaohua Jia, Houmin Yan

Abstract: Stablecoins have become significant assets in modern finance, with a market capitalization exceeding USD 246 billion (May 2025). Yet, despite their systemic importance, a comprehensive and risk-oriented understanding of crucial aspects like their design trade-offs, security dynamics, and interdependent failure pathways often remains underdeveloped. This SoK confronts this gap through a large-scale… ▽ More Stablecoins have become significant assets in modern finance, with a market capitalization exceeding USD 246 billion (May 2025). Yet, despite their systemic importance, a comprehensive and risk-oriented understanding of crucial aspects like their design trade-offs, security dynamics, and interdependent failure pathways often remains underdeveloped. This SoK confronts this gap through a large-scale analysis of 157 research studies, 95 active stablecoins, and 44 major security incidents. Our analysis establishes four pivotal insights: 1) stability is best understood not an inherent property but an emergent, fragile state reliant on the interplay between market confidence and continuous liquidity; 2) stablecoin designs demonstrate trade-offs in risk specialization instead of mitigation; 3) the widespread integration of yield mechanisms imposes a "dual mandate" that creates a systemic tension between the core mission of stability and the high-risk financial engineering required for competitive returns; and 4) major security incidents act as acute "evolutionary pressures", forging resilience by stress-testing designs and aggressively redefining the security frontier. We introduce the Stablecoin LEGO framework, a quantitative methodology mapping historical failures to current designs. Its application reveals that a lower assessed risk strongly correlates with integrating lessons from past incidents. We hope this provides a systematic foundation for building, evaluating, and regulating more resilient stablecoins. △ Less

Submitted 21 June, 2025; originally announced June 2025.

arXiv:2506.17450 [pdf, ps, other]

BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing

Authors: Jiacheng Chen, Ramin Mehran, Xuhui Jia, Saining Xie, Sanghyun Woo

Abstract: We present BlenderFusion, a generative visual compositing framework that synthesizes new scenes by recomposing objects, camera, and background. It follows a layering-editing-compositing pipeline: (i) segmenting and converting visual inputs into editable 3D entities (layering), (ii) editing them in Blender with 3D-grounded control (editing), and (iii) fusing them into a coherent scene using a gener… ▽ More We present BlenderFusion, a generative visual compositing framework that synthesizes new scenes by recomposing objects, camera, and background. It follows a layering-editing-compositing pipeline: (i) segmenting and converting visual inputs into editable 3D entities (layering), (ii) editing them in Blender with 3D-grounded control (editing), and (iii) fusing them into a coherent scene using a generative compositor (compositing). Our generative compositor extends a pre-trained diffusion model to process both the original (source) and edited (target) scenes in parallel. It is fine-tuned on video frames with two key training strategies: (i) source masking, enabling flexible modifications like background replacement; (ii) simulated object jittering, facilitating disentangled control over objects and camera. BlenderFusion significantly outperforms prior methods in complex compositional scene editing tasks. △ Less

Submitted 25 June, 2025; v1 submitted 20 June, 2025; originally announced June 2025.

Comments: Project page: https://blenderfusion.github.io

arXiv:2506.16576 [pdf, ps, other]

Accelerating Correlated Wave Function Calculations with Hierarchical Matrix Compression of the Two-Electron Integrals

Authors: Hongji Gao, Xiangmin Jiao, Benjamin G. Levine

Abstract: Leveraging matrix sparsity has proven a fruitful strategy for accelerating quantum chemical calculations. Here we present the hierarchical SOS-MP2 algorithm, which uses hierarchical matrix ($\mathcal{H}^{2}$) compression of the electron repulsion integral (ERI) tensor to reduce both time and space complexity. This approach is based on the atomic orbital Laplace transform MP2 calculations, leveragi… ▽ More Leveraging matrix sparsity has proven a fruitful strategy for accelerating quantum chemical calculations. Here we present the hierarchical SOS-MP2 algorithm, which uses hierarchical matrix ($\mathcal{H}^{2}$) compression of the electron repulsion integral (ERI) tensor to reduce both time and space complexity. This approach is based on the atomic orbital Laplace transform MP2 calculations, leveraging the data sparsity of the ERI tensor and the element-wise sparsity of the energy-weighted density matrices. The $\mathcal{H}^{2}$ representation approximates the ERI tensor in a block low-rank form, taking advantage of the inherent low-rank nature of the repulsion integrals between distant sets of atoms. The resulting algorithm enables the calculation of the Coulomb-like term of the MP2 energy with a theoretical time complexity of $\mathcal{O}(N^{2}\log N)$ and a space complexity of $\mathcal{O}(N^{2}\log N)$, where $N$ denotes the number of basis functions. Numerical tests show asymptotic time and space complexities better than $\mathcal{O}(N^{2})$ for both linear alkanes and three-dimensional water clusters. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.14476 [pdf, ps, other]

doi 10.1145/3711066

SimSpark: Interactive Simulation of Social Media Behaviors

Authors: Ziyue Lin, Yi Shan, Lin Gao, Xinghua Jia, Siming Chen

Abstract: Understanding user behaviors on social media has garnered significant scholarly attention, enhancing our comprehension of how virtual platforms impact society and empowering decision-makers. Simulating social media behaviors provides a robust tool for capturing the patterns of social media behaviors, testing hypotheses, and predicting the effects of various interventions, ultimately contributing t… ▽ More Understanding user behaviors on social media has garnered significant scholarly attention, enhancing our comprehension of how virtual platforms impact society and empowering decision-makers. Simulating social media behaviors provides a robust tool for capturing the patterns of social media behaviors, testing hypotheses, and predicting the effects of various interventions, ultimately contributing to a deeper understanding of social media environments. Moreover, it can overcome difficulties associated with utilizing real data for analysis, such as data accessibility issues, ethical concerns, and the complexity of processing large and heterogeneous datasets. However, researchers and stakeholders need more flexible platforms to investigate different user behaviors by simulating different scenarios and characters, which is not possible yet. Therefore, this paper introduces SimSpark, an interactive system including simulation algorithms and interactive visual interfaces which is capable of creating small simulated social media platforms with customizable characters and social environments. We address three key challenges: generating believable behaviors, validating simulation results, and supporting interactive control for generation and results analysis. A simulation workflow is introduced to generate believable behaviors of agents by utilizing large language models. A visual interface enables real-time parameter adjustment and process monitoring for customizing generation settings. A set of visualizations and interactions are also designed to display the models' outputs for further analysis. Effectiveness is evaluated through case studies, quantitative simulation model assessments, and expert interviews. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: 32 pages, 7 figures

Journal ref: Proc. ACM Hum.-Comput. Interact. 9, 2, Article CSCW168 (April 2025), 32 pages

arXiv:2506.12301 [pdf, ps, other]

Unveiling Confirmation Bias in Chain-of-Thought Reasoning

Authors: Yue Wan, Xiaowei Jia, Xiang Lorraine Li

Abstract: Chain-of-thought (CoT) prompting has been widely adopted to enhance the reasoning capabilities of large language models (LLMs). However, the effectiveness of CoT reasoning is inconsistent across tasks with different reasoning types. This work presents a novel perspective to understand CoT behavior through the lens of \textit{confirmation bias} in cognitive psychology. Specifically, we examine how… ▽ More Chain-of-thought (CoT) prompting has been widely adopted to enhance the reasoning capabilities of large language models (LLMs). However, the effectiveness of CoT reasoning is inconsistent across tasks with different reasoning types. This work presents a novel perspective to understand CoT behavior through the lens of \textit{confirmation bias} in cognitive psychology. Specifically, we examine how model internal beliefs, approximated by direct question-answering probabilities, affect both reasoning generation ($Q \to R$) and reasoning-guided answer prediction ($QR \to A$) in CoT. By decomposing CoT into a two-stage process, we conduct a thorough correlation analysis in model beliefs, rationale attributes, and stage-wise performance. Our results provide strong evidence of confirmation bias in LLMs, such that model beliefs not only skew the reasoning process but also influence how rationales are utilized for answer prediction. Furthermore, the interplay between task vulnerability to confirmation bias and the strength of beliefs also provides explanations for CoT effectiveness across reasoning tasks and models. Overall, this study provides a valuable insight for the needs of better prompting strategies that mitigate confirmation bias to enhance reasoning performance. Code is available at \textit{https://github.com/yuewan2/biasedcot}. △ Less

Submitted 13 June, 2025; originally announced June 2025.

Journal ref: ACL 2025 Findings

arXiv:2506.09981 [pdf, ps, other]

ReSim: Reliable World Simulation for Autonomous Driving

Authors: Jiazhi Yang, Kashyap Chitta, Shenyuan Gao, Long Chen, Yuqian Shao, Xiaosong Jia, Hongyang Li, Andreas Geiger, Xiangyu Yue, Li Chen

Abstract: How can we reliably simulate future driving scenarios under a wide range of ego driving behaviors? Recent driving world models, developed exclusively on real-world driving data composed mainly of safe expert trajectories, struggle to follow hazardous or non-expert behaviors, which are rare in such data. This limitation restricts their applicability to tasks such as policy evaluation. In this work,… ▽ More How can we reliably simulate future driving scenarios under a wide range of ego driving behaviors? Recent driving world models, developed exclusively on real-world driving data composed mainly of safe expert trajectories, struggle to follow hazardous or non-expert behaviors, which are rare in such data. This limitation restricts their applicability to tasks such as policy evaluation. In this work, we address this challenge by enriching real-world human demonstrations with diverse non-expert data collected from a driving simulator (e.g., CARLA), and building a controllable world model trained on this heterogeneous corpus. Starting with a video generator featuring a diffusion transformer architecture, we devise several strategies to effectively integrate conditioning signals and improve prediction controllability and fidelity. The resulting model, ReSim, enables Reliable Simulation of diverse open-world driving scenarios under various actions, including hazardous non-expert ones. To close the gap between high-fidelity simulation and applications that require reward signals to judge different actions, we introduce a Video2Reward module that estimates a reward from ReSim's simulated future. Our ReSim paradigm achieves up to 44% higher visual fidelity, improves controllability for both expert and non-expert actions by over 50%, and boosts planning and policy selection performance on NAVSIM by 2% and 25%, respectively. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: Project page: https://opendrivelab.com/ReSim

arXiv:2506.08473 [pdf, ps, other]

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Authors: Shuo Yang, Qihui Zhang, Yuyang Liu, Yue Huang, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song, Li Yuan

Abstract: Large language models (LLMs) are vulnerable to safety risks during fine-tuning, where small amounts of malicious or harmless data can compromise safeguards. In this paper, building on the concept of alignment direction -- defined by the weight difference between aligned and unaligned models -- we observe that perturbations along this direction preserve model safety. In contrast, perturbations alon… ▽ More Large language models (LLMs) are vulnerable to safety risks during fine-tuning, where small amounts of malicious or harmless data can compromise safeguards. In this paper, building on the concept of alignment direction -- defined by the weight difference between aligned and unaligned models -- we observe that perturbations along this direction preserve model safety. In contrast, perturbations along directions orthogonal to this alignment are strongly linked to harmful direction perturbations, rapidly degrading safety and framing the parameter space as a narrow safety basin. Based on this insight, we propose a methodology for safety fine-tuning called AsFT (Anchoring Safety in Fine-Tuning), which integrates a regularization term into the training objective. This term uses the alignment direction as an anchor to suppress updates in harmful directions, ensuring that fine-tuning is constrained within the narrow safety basin. Extensive experiments on multiple datasets show that AsFT outperforms Safe LoRA, reducing harmful behavior by 7.60 percent, improving model performance by 3.44 percent, and maintaining robust performance across various experimental settings. Code is available at https://github.com/PKU-YuanGroup/AsFT △ Less

Submitted 10 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

arXiv:2506.07672 [pdf, ps, other]

MCPWorld: A Unified Benchmarking Testbed for API, GUI, and Hybrid Computer Use Agents

Authors: Yunhe Yan, Shihe Wang, Jiajun Du, Yexuan Yang, Yuxuan Shan, Qichen Qiu, Xianqing Jia, Xinge Wang, Xin Yuan, Xu Han, Mao Qin, Yinxiao Chen, Chen Peng, Shangguang Wang, Mengwei Xu

Abstract: (M)LLM-powered computer use agents (CUA) are emerging as a transformative technique to automate human-computer interaction. However, existing CUA benchmarks predominantly target GUI agents, whose evaluation methods are susceptible to UI changes and ignore function interactions exposed by application APIs, e.g., Model Context Protocol (MCP). To this end, we propose MCPWorld, the first automatic CUA… ▽ More (M)LLM-powered computer use agents (CUA) are emerging as a transformative technique to automate human-computer interaction. However, existing CUA benchmarks predominantly target GUI agents, whose evaluation methods are susceptible to UI changes and ignore function interactions exposed by application APIs, e.g., Model Context Protocol (MCP). To this end, we propose MCPWorld, the first automatic CUA testbed for API, GUI, and API-GUI hybrid agents. A key principle of MCPWorld is the use of "white-box apps", i.e., those with source code availability and can be revised/re-compiled as needed (e.g., adding MCP support), with two notable advantages: (1) It greatly broadens the design space of CUA, such as what and how the app features to be exposed/extracted as CUA-callable APIs. (2) It allows MCPWorld to programmatically verify task completion by directly monitoring application behavior through techniques like dynamic code instrumentation, offering robust, accurate CUA evaluation decoupled from specific agent implementations or UI states. Currently, MCPWorld includes 201 well curated and annotated user tasks, covering diversified use cases and difficulty levels. MCPWorld is also fully containerized with GPU acceleration support for flexible adoption on different OS/hardware environments. Our preliminary experiments, using a representative LLM-powered CUA framework, achieve 75.12% task completion accuracy, simultaneously providing initial evidence on the practical effectiveness of agent automation leveraging MCP. Overall, we anticipate MCPWorld to facilitate and standardize the benchmarking of next-generation computer use agents that can leverage rich external tools. Our code and dataset are publicly available at https://github.com/SAAgent/MCPWorld. △ Less

Submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.07454 [pdf, ps, other]

Language-Grounded Hierarchical Planning and Execution with Multi-Robot 3D Scene Graphs

Authors: Jared Strader, Aaron Ray, Jacob Arkin, Mason B. Peterson, Yun Chang, Nathan Hughes, Christopher Bradley, Yi Xuan Jia, Carlos Nieto-Granda, Rajat Talak, Chuchu Fan, Luca Carlone, Jonathan P. How, Nicholas Roy

Abstract: In this paper, we introduce a multi-robot system that integrates mapping, localization, and task and motion planning (TAMP) enabled by 3D scene graphs to execute complex instructions expressed in natural language. Our system builds a shared 3D scene graph incorporating an open-set object-based map, which is leveraged for multi-robot 3D scene graph fusion. This representation supports real-time, vi… ▽ More In this paper, we introduce a multi-robot system that integrates mapping, localization, and task and motion planning (TAMP) enabled by 3D scene graphs to execute complex instructions expressed in natural language. Our system builds a shared 3D scene graph incorporating an open-set object-based map, which is leveraged for multi-robot 3D scene graph fusion. This representation supports real-time, view-invariant relocalization (via the object-based map) and planning (via the 3D scene graph), allowing a team of robots to reason about their surroundings and execute complex tasks. Additionally, we introduce a planning approach that translates operator intent into Planning Domain Definition Language (PDDL) goals using a Large Language Model (LLM) by leveraging context from the shared 3D scene graph and robot capabilities. We provide an experimental assessment of the performance of our system on real-world tasks in large-scale, outdoor environments. △ Less

Submitted 9 June, 2025; originally announced June 2025.

Comments: 12 pages, 4 figures

arXiv:2506.06660 [pdf, ps, other]

Efficient Mirror-type Kernels for the Metropolis-Hastings Algorithm

Authors: Nuo Guan, Xiyun Jiao

Abstract: We propose a new Metropolis-Hastings (MH) kernel by introducing the Mirror move into the Metropolis adjusted Langevin algorithm (MALA). This new kernel uses the strength of one kernel to overcome the shortcoming of the other, and generates proposals that are distant from the current position, but still within the high-density region of the target distribution. The resulting algorithm can be much m… ▽ More We propose a new Metropolis-Hastings (MH) kernel by introducing the Mirror move into the Metropolis adjusted Langevin algorithm (MALA). This new kernel uses the strength of one kernel to overcome the shortcoming of the other, and generates proposals that are distant from the current position, but still within the high-density region of the target distribution. The resulting algorithm can be much more efficient than both Mirror and MALA, while stays comparable in terms of computational cost. We demonstrate the advantages of the MirrorMALA kernel using a variety of one-dimensional and multi-dimensional examples. The Mirror and MirrorMALA are both special cases of the Mirror-type kernels, a new suite of efficient MH proposals. We use the Mirror-type kernels, together with a novel method of doing the whitening transformation on high-dimensional random variables, which was inspired by Tan and Nott, to analyse the Bayesian generalized linear mixed models (GLMMs), and obtain the per-time-unit efficiency that is 2--20 times higher than the HMC or NUTS algorithm. △ Less

Submitted 7 June, 2025; originally announced June 2025.

arXiv:2506.06599 [pdf, ps, other]

Direct Prediction Set Minimization via Bilevel Conformal Classifier Training

Authors: Yuanjie Shi, Hooman Shahrokhi, Xuesong Jia, Xiongzhi Chen, Janardhan Rao Doppa, Yan Yan

Abstract: Conformal prediction (CP) is a promising uncertainty quantification framework which works as a wrapper around a black-box classifier to construct prediction sets (i.e., subset of candidate classes) with provable guarantees. However, standard calibration methods for CP tend to produce large prediction sets which makes them less useful in practice. This paper considers the problem of integrating con… ▽ More Conformal prediction (CP) is a promising uncertainty quantification framework which works as a wrapper around a black-box classifier to construct prediction sets (i.e., subset of candidate classes) with provable guarantees. However, standard calibration methods for CP tend to produce large prediction sets which makes them less useful in practice. This paper considers the problem of integrating conformal principles into the training process of deep classifiers to directly minimize the size of prediction sets. We formulate conformal training as a bilevel optimization problem and propose the {\em Direct Prediction Set Minimization (DPSM)} algorithm to solve it. The key insight behind DPSM is to minimize a measure of the prediction set size (upper level) that is conditioned on the learned quantile of conformity scores (lower level). We analyze that DPSM has a learning bound of $O(1/\sqrt{n})$ (with $n$ training samples), while prior conformal training methods based on stochastic approximation for the quantile has a bound of $Ω(1/s)$ (with batch size $s$ and typically $s \ll \sqrt{n}$). Experiments on various benchmark datasets and deep models show that DPSM significantly outperforms the best prior conformal training baseline with $20.46\%\downarrow$ in the prediction set size and validates our theory. △ Less

Submitted 6 June, 2025; originally announced June 2025.

Comments: Accepted for Publication at International Conference on Machine Learning (ICML), 2025

arXiv:2506.06072 [pdf, ps, other]

BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning

Authors: Hongyi Zhou, Weiran Liao, Xi Huang, Yucheng Tang, Fabian Otto, Xiaogang Jia, Xinkai Jiang, Simon Hilber, Ge Li, Qian Wang, Ömer Erdinç Yağmurlu, Nils Blank, Moritz Reuss, Rudolf Lioutikov

Abstract: We present the B-spline Encoded Action Sequence Tokenizer (BEAST), a novel action tokenizer that encodes action sequences into compact discrete or continuous tokens using B-splines. In contrast to existing action tokenizers based on vector quantization or byte pair encoding, BEAST requires no separate tokenizer training and consistently produces tokens of uniform length, enabling fast action seque… ▽ More We present the B-spline Encoded Action Sequence Tokenizer (BEAST), a novel action tokenizer that encodes action sequences into compact discrete or continuous tokens using B-splines. In contrast to existing action tokenizers based on vector quantization or byte pair encoding, BEAST requires no separate tokenizer training and consistently produces tokens of uniform length, enabling fast action sequence generation via parallel decoding. Leveraging our B-spline formulation, BEAST inherently ensures generating smooth trajectories without discontinuities between adjacent segments. We extensively evaluate BEAST by integrating it with three distinct model architectures: a Variational Autoencoder (VAE) with continuous tokens, a decoder-only Transformer with discrete tokens, and Florence-2, a pretrained Vision-Language Model with an encoder-decoder architecture, demonstrating BEAST's compatibility and scalability with large pretrained models. We evaluate BEAST across three established benchmarks consisting of 166 simulated tasks and on three distinct robot settings with a total of 8 real-world tasks. Experimental results demonstrate that BEAST (i) significantly reduces both training and inference computational costs, and (ii) consistently generates smooth, high-frequency control signals suitable for continuous control tasks while (iii) reliably achieves competitive task success rates compared to state-of-the-art methods. △ Less

Submitted 10 June, 2025; v1 submitted 6 June, 2025; originally announced June 2025.

arXiv:2506.05401 [pdf, ps, other]

Robust Anti-Backdoor Instruction Tuning in LVLMs

Authors: Yuan Xun, Siyuan Liang, Xiaojun Jia, Xinwei Liu, Xiaochun Cao

Abstract: Large visual language models (LVLMs) have demonstrated excellent instruction-following capabilities, yet remain vulnerable to stealthy backdoor attacks when finetuned using contaminated data. Existing backdoor defense techniques are usually developed for single-modal visual or language models under fully parameter-adjustable settings or rely on supervisory knowledge during training. However, in re… ▽ More Large visual language models (LVLMs) have demonstrated excellent instruction-following capabilities, yet remain vulnerable to stealthy backdoor attacks when finetuned using contaminated data. Existing backdoor defense techniques are usually developed for single-modal visual or language models under fully parameter-adjustable settings or rely on supervisory knowledge during training. However, in real-world scenarios, defenders cannot modify frozen visual encoders or core LLM parameters, nor possess prior knowledge of unknown trigger patterns or target responses. Motivated by the empirical finding that LVLMs readily overfit to fixed, unknown triggers, which can embed malicious associations during adapter-level tuning, we aim to design a defense that operates without access to core weights or attack priors. To this end, we introduce a lightweight, certified-agnostic defense framework, Robust Instruction Tuning, that finetunes only adapter modules and text embedding layers under instruction tuning. Our method integrates two complementary regularizations: (1) Input Diversity Regularization, which perturbs trigger components across training samples to disrupt consistent spurious cues; and (2) Anomalous Activation Regularization, which dynamically sparses adapter weights exhibiting abnormally sharp activations linked to backdoor patterns. These mechanisms jointly guide the model toward learning semantically grounded representations rather than memorizing superficial trigger-response mappings. Extensive experiments against seven attacks on Flickr30k and MSCOCO demonstrate that ours reduces their attack success rate to nearly zero, with an increase in training cost of less than 15%. △ Less

Submitted 3 June, 2025; originally announced June 2025.

arXiv:2506.05055 [pdf, ps, other]

Study of $f_1(1420)$ and $η(1405)$ in the decay $J/ψ\to γπ^{0}π^{0}π^{0}$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, O. Afedulidis, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, I. Balossino, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere , et al. (650 additional authors not shown)

Abstract: A partial-wave analysis is performed on the decay $J/ψ\toγπ^{0}π^{0}π^{0}$ within the $π^{0}π^{0}π^{0}$ invariant-mass region below 1.6 GeV$/c^{2}$, using $(10.09~\pm~0.04)\times10^{9} ~J/ψ$ events collected with the BESIII detector. Significant isospin-violating decays of $η(1405)$ and $f_1(1420)$ into $f_0(980)π^{0}$ are observed. For the first time, three axial-vectors, $f_1(1285)$,… ▽ More A partial-wave analysis is performed on the decay $J/ψ\toγπ^{0}π^{0}π^{0}$ within the $π^{0}π^{0}π^{0}$ invariant-mass region below 1.6 GeV$/c^{2}$, using $(10.09~\pm~0.04)\times10^{9} ~J/ψ$ events collected with the BESIII detector. Significant isospin-violating decays of $η(1405)$ and $f_1(1420)$ into $f_0(980)π^{0}$ are observed. For the first time, three axial-vectors, $f_1(1285)$, $f_1(1420)$ and $f_1(1510)$, are observed to decay into $π^{0}π^{0}π^{0}$. The product branching fractions of these resonances are reported. △ Less

Submitted 7 June, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

arXiv:2506.02555 [pdf, other]

SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence

Authors: Zhitao Zeng, Zhu Zhuo, Xiaojun Jia, Erli Zhang, Junde Wu, Jiaan Zhang, Yuxuan Wang, Chang Han Low, Jian Jiang, Zilong Zheng, Xiaochun Cao, Yutong Ban, Qi Dou, Yang Liu, Yueming Jin

Abstract: Foundation models have achieved transformative success across biomedical domains by enabling holistic understanding of multimodal data. However, their application in surgery remains underexplored. Surgical intelligence presents unique challenges - requiring surgical visual perception, temporal analysis, and reasoning. Existing general-purpose vision-language models fail to address these needs due… ▽ More Foundation models have achieved transformative success across biomedical domains by enabling holistic understanding of multimodal data. However, their application in surgery remains underexplored. Surgical intelligence presents unique challenges - requiring surgical visual perception, temporal analysis, and reasoning. Existing general-purpose vision-language models fail to address these needs due to insufficient domain-specific supervision and the lack of a large-scale high-quality surgical database. To bridge this gap, we propose SurgVLM, one of the first large vision-language foundation models for surgical intelligence, where this single universal model can tackle versatile surgical tasks. To enable this, we construct a large-scale multimodal surgical database, SurgVLM-DB, comprising over 1.81 million frames with 7.79 million conversations, spanning more than 16 surgical types and 18 anatomical structures. We unify and reorganize 23 public datasets across 10 surgical tasks, followed by standardizing labels and doing hierarchical vision-language alignment to facilitate comprehensive coverage of gradually finer-grained surgical tasks, from visual perception, temporal analysis, to high-level reasoning. Building upon this comprehensive dataset, we propose SurgVLM, which is built upon Qwen2.5-VL, and undergoes instruction tuning to 10+ surgical tasks. We further construct a surgical multimodal benchmark, SurgVLM-Bench, for method evaluation. SurgVLM-Bench consists of 6 popular and widely-used datasets in surgical domain, covering several crucial downstream tasks. Based on SurgVLM-Bench, we evaluate the performance of our SurgVLM (3 SurgVLM variants: SurgVLM-7B, SurgVLM-32B, and SurgVLM-72B), and conduct comprehensive comparisons with 14 mainstream commercial VLMs (e.g., GPT-4o, Gemini 2.0 Flash, Qwen2.5-Max). △ Less

Submitted 3 June, 2025; originally announced June 2025.

Comments: 29 pages, 5 figures

MSC Class: 68T45 ACM Class: I.2.10

arXiv:2505.22013 [pdf, other]

Overlap-Adaptive Hybrid Speaker Diarization and ASR-Aware Observation Addition for MISP 2025 Challenge

Authors: Shangkun Huang, Yuxuan Du, Jingwen Yang, Dejun Zhang, Xupeng Jia, Jing Deng, Jintao Kang, Rong Zheng

Abstract: This paper presents the system developed to address the MISP 2025 Challenge. For the diarization system, we proposed a hybrid approach combining a WavLM end-to-end segmentation method with a traditional multi-module clustering technique to adaptively select the appropriate model for handling varying degrees of overlapping speech. For the automatic speech recognition (ASR) system, we proposed an AS… ▽ More This paper presents the system developed to address the MISP 2025 Challenge. For the diarization system, we proposed a hybrid approach combining a WavLM end-to-end segmentation method with a traditional multi-module clustering technique to adaptively select the appropriate model for handling varying degrees of overlapping speech. For the automatic speech recognition (ASR) system, we proposed an ASR-aware observation addition method that compensates for the performance limitations of Guided Source Separation (GSS) under low signal-to-noise ratio conditions. Finally, we integrated the speaker diarization and ASR systems in a cascaded architecture to address Track 3. Our system achieved character error rates (CER) of 9.48% on Track 2 and concatenated minimum permutation character error rate (cpCER) of 11.56% on Track 3, ultimately securing first place in both tracks and thereby demonstrating the effectiveness of the proposed methods in real-world meeting scenarios. △ Less

Submitted 28 May, 2025; originally announced May 2025.

Comments: Accepted to Interspeech 2025

arXiv:2505.21773 [pdf, ps, other]

Assessing EV Charging Impacts on Power Distribution Systems: A Unified Co-Simulation Framework

Authors: Mohammadreza Iranpour, Mohammad Rasoul Narimani, Xudong Jia

Abstract: The growing adoption of electric vehicles (EVs) is expected to significantly increase demand on electric power distribution systems, many of which are already nearing capacity. To address this, the paper presents a comprehensive framework for analyzing the impact of large-scale EV integration on distribution networks. Using the open-source simulator OpenDSS, the framework builds detailed, scalable… ▽ More The growing adoption of electric vehicles (EVs) is expected to significantly increase demand on electric power distribution systems, many of which are already nearing capacity. To address this, the paper presents a comprehensive framework for analyzing the impact of large-scale EV integration on distribution networks. Using the open-source simulator OpenDSS, the framework builds detailed, scalable models of electric distribution systems, incorporating high-fidelity synthetic data from the SMART-DS project. The study models three feeders from an urban substation in San Francisco down to the household level. A key contribution is the framework's ability to identify critical system components likely to require upgrades due to increased EV loads. It also incorporates advanced geospatial visualization through QGIS, which aids in understanding how charging demands affect specific grid areas, helping stakeholders target infrastructure reinforcements. To ensure realistic load modeling, the framework uses EV load profiles based on U.S. Department of Energy projections, factoring in vehicle types, charging behaviors, usage patterns, and adoption rates. By leveraging large-scale synthetic data, the model remains relevant for real-world utility planning. It supports diverse simulation scenarios, from light to heavy EV charging loads and distributed vs. centralized charging patterns, offering a practical planning tool for utilities and policymakers. Additionally, its modular design enables easy adaptation to different geographic regions, feeder setups, and adoption scenarios, making it suitable for future studies on evolving grid conditions. △ Less

Submitted 27 May, 2025; originally announced May 2025.

arXiv:2505.21499 [pdf, ps, other]

AdInject: Real-World Black-Box Attacks on Web Agents via Advertising Delivery

Authors: Haowei Wang, Junjie Wang, Xiaojun Jia, Rupeng Zhang, Mingyang Li, Zhe Liu, Yang Liu, Qing Wang

Abstract: Vision-Language Model (VLM) based Web Agents represent a significant step towards automating complex tasks by simulating human-like interaction with websites. However, their deployment in uncontrolled web environments introduces significant security vulnerabilities. Existing research on adversarial environmental injection attacks often relies on unrealistic assumptions, such as direct HTML manipul… ▽ More Vision-Language Model (VLM) based Web Agents represent a significant step towards automating complex tasks by simulating human-like interaction with websites. However, their deployment in uncontrolled web environments introduces significant security vulnerabilities. Existing research on adversarial environmental injection attacks often relies on unrealistic assumptions, such as direct HTML manipulation, knowledge of user intent, or access to agent model parameters, limiting their practical applicability. In this paper, we propose AdInject, a novel and real-world black-box attack method that leverages the internet advertising delivery to inject malicious content into the Web Agent's environment. AdInject operates under a significantly more realistic threat model than prior work, assuming a black-box agent, static malicious content constraints, and no specific knowledge of user intent. AdInject includes strategies for designing malicious ad content aimed at misleading agents into clicking, and a VLM-based ad content optimization technique that infers potential user intents from the target website's context and integrates these intents into the ad content to make it appear more relevant or critical to the agent's task, thus enhancing attack effectiveness. Experimental evaluations demonstrate the effectiveness of AdInject, attack success rates exceeding 60% in most scenarios and approaching 100% in certain cases. This strongly demonstrates that prevalent advertising delivery constitutes a potent and real-world vector for environment injection attacks against Web Agents. This work highlights a critical vulnerability in Web Agent security arising from real-world environment manipulation channels, underscoring the urgent need for developing robust defense mechanisms against such threats. Our code is available at https://github.com/NicerWang/AdInject. △ Less

Submitted 27 May, 2025; originally announced May 2025.

arXiv:2505.21494 [pdf, ps, other]

Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment

Authors: Xiaojun Jia, Sensen Gao, Simeng Qin, Tianyu Pang, Chao Du, Yihao Huang, Xinfeng Li, Yiming Li, Bo Li, Yang Liu

Abstract: Multimodal large language models (MLLMs) remain vulnerable to transferable adversarial examples. While existing methods typically achieve targeted attacks by aligning global features-such as CLIP's [CLS] token-between adversarial and target samples, they often overlook the rich local information encoded in patch tokens. This leads to suboptimal alignment and limited transferability, particularly f… ▽ More Multimodal large language models (MLLMs) remain vulnerable to transferable adversarial examples. While existing methods typically achieve targeted attacks by aligning global features-such as CLIP's [CLS] token-between adversarial and target samples, they often overlook the rich local information encoded in patch tokens. This leads to suboptimal alignment and limited transferability, particularly for closed-source models. To address this limitation, we propose a targeted transferable adversarial attack method based on feature optimal alignment, called FOA-Attack, to improve adversarial transfer capability. Specifically, at the global level, we introduce a global feature loss based on cosine similarity to align the coarse-grained features of adversarial samples with those of target samples. At the local level, given the rich local representations within Transformers, we leverage clustering techniques to extract compact local patterns to alleviate redundant local features. We then formulate local feature alignment between adversarial and target samples as an optimal transport (OT) problem and propose a local clustering optimal transport loss to refine fine-grained feature alignment. Additionally, we propose a dynamic ensemble model weighting strategy to adaptively balance the influence of multiple models during adversarial example generation, thereby further improving transferability. Extensive experiments across various models demonstrate the superiority of the proposed method, outperforming state-of-the-art methods, especially in transferring to closed-source MLLMs. The code is released at https://github.com/jiaxiaojunQAQ/FOA-Attack. △ Less

Submitted 27 May, 2025; originally announced May 2025.

arXiv:2505.20469 [pdf, other]

CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting

Authors: Lei Tian, Xiaomin Li, Liqian Ma, Hefei Huang, Zirui Zheng, Hao Yin, Taiqing Li, Huchuan Lu, Xu Jia

Abstract: Recent advances in 3D reconstruction techniques and vision-language models have fueled significant progress in 3D semantic understanding, a capability critical to robotics, autonomous driving, and virtual/augmented reality. However, methods that rely on 2D priors are prone to a critical challenge: cross-view semantic inconsistencies induced by occlusion, image blur, and view-dependent variations.… ▽ More Recent advances in 3D reconstruction techniques and vision-language models have fueled significant progress in 3D semantic understanding, a capability critical to robotics, autonomous driving, and virtual/augmented reality. However, methods that rely on 2D priors are prone to a critical challenge: cross-view semantic inconsistencies induced by occlusion, image blur, and view-dependent variations. These inconsistencies, when propagated via projection supervision, deteriorate the quality of 3D Gaussian semantic fields and introduce artifacts in the rendered outputs. To mitigate this limitation, we propose CCL-LGS, a novel framework that enforces view-consistent semantic supervision by integrating multi-view semantic cues. Specifically, our approach first employs a zero-shot tracker to align a set of SAM-generated 2D masks and reliably identify their corresponding categories. Next, we utilize CLIP to extract robust semantic encodings across views. Finally, our Contrastive Codebook Learning (CCL) module distills discriminative semantic features by enforcing intra-class compactness and inter-class distinctiveness. In contrast to previous methods that directly apply CLIP to imperfect masks, our framework explicitly resolves semantic conflicts while preserving category discriminability. Extensive experiments demonstrate that CCL-LGS outperforms previous state-of-the-art methods. Our project page is available at https://epsilontl.github.io/CCL-LGS/. △ Less

Submitted 26 May, 2025; originally announced May 2025.

arXiv:2505.19139 [pdf, ps, other]

The Eye of Sherlock Holmes: Uncovering User Private Attribute Profiling via Vision-Language Model Agentic Framework

Authors: Feiran Liu, Yuzhe Zhang, Xinyi Huang, Yinan Peng, Xinfeng Li, Lixu Wang, Yutong Shen, Ranjie Duan, Simeng Qin, Xiaojun Jia, Qingsong Wen, Wei Dong

Abstract: Our research reveals a new privacy risk associated with the vision-language model (VLM) agentic framework: the ability to infer sensitive attributes (e.g., age and health information) and even abstract ones (e.g., personality and social traits) from a set of personal images, which we term "image private attribute profiling." This threat is particularly severe given that modern apps can easily acce… ▽ More Our research reveals a new privacy risk associated with the vision-language model (VLM) agentic framework: the ability to infer sensitive attributes (e.g., age and health information) and even abstract ones (e.g., personality and social traits) from a set of personal images, which we term "image private attribute profiling." This threat is particularly severe given that modern apps can easily access users' photo albums, and inference from image sets enables models to exploit inter-image relations for more sophisticated profiling. However, two main challenges hinder our understanding of how well VLMs can profile an individual from a few personal photos: (1) the lack of benchmark datasets with multi-image annotations for private attributes, and (2) the limited ability of current multimodal large language models (MLLMs) to infer abstract attributes from large image collections. In this work, we construct PAPI, the largest dataset for studying private attribute profiling in personal images, comprising 2,510 images from 251 individuals with 3,012 annotated privacy attributes. We also propose HolmesEye, a hybrid agentic framework that combines VLMs and LLMs to enhance privacy inference. HolmesEye uses VLMs to extract both intra-image and inter-image information and LLMs to guide the inference process as well as consolidate the results through forensic analysis, overcoming existing limitations in long-context visual reasoning. Experiments reveal that HolmesEye achieves a 10.8% improvement in average accuracy over state-of-the-art baselines and surpasses human-level performance by 15.0% in predicting abstract attributes. This work highlights the urgency of addressing privacy risks in image-based profiling and offers both a new dataset and an advanced framework to guide future research in this area. △ Less

Submitted 25 May, 2025; originally announced May 2025.

arXiv:2505.18954 [pdf, ps, other]

Efficient SRAM-PIM Co-design by Joint Exploration of Value-Level and Bit-Level Sparsity

Authors: Cenlin Duan, Jianlei Yang, Yikun Wang, Yiou Wang, Yingjie Qi, Xiaolin He, Bonan Yan, Xueyan Wang, Xiaotao Jia, Weisheng Zhao

Abstract: Processing-in-memory (PIM) is a transformative architectural paradigm designed to overcome the Von Neumann bottleneck. Among PIM architectures, digital SRAM-PIM emerges as a promising solution, offering significant advantages by directly integrating digital logic within the SRAM array. However, rigid crossbar architecture and full array activation pose challenges in efficiently utilizing tradition… ▽ More Processing-in-memory (PIM) is a transformative architectural paradigm designed to overcome the Von Neumann bottleneck. Among PIM architectures, digital SRAM-PIM emerges as a promising solution, offering significant advantages by directly integrating digital logic within the SRAM array. However, rigid crossbar architecture and full array activation pose challenges in efficiently utilizing traditional value-level sparsity. Moreover, neural network models exhibit a high proportion of zero bits within non-zero values, which remain underutilized due to architectural constraints. To overcome these limitations, we present Dyadic Block PIM (DB-PIM), a groundbreaking algorithm-architecture co-design framework to harness both value-level and bit-level sparsity. At the algorithm level, our hybrid-grained pruning technique, combined with a novel sparsity pattern, enables effective sparsity management. Architecturally, DB-PIM incorporates a sparse network and customized digital SRAM-PIM macros, including input pre-processing unit (IPU), dyadic block multiply units (DBMUs), and Canonical Signed Digit (CSD)-based adder trees. It circumvents structured zero values in weights and bypasses unstructured zero bits within non-zero weights and block-wise all-zero bit columns in input features. As a result, the DB-PIM framework skips a majority of unnecessary computations, thereby driving significant gains in computational efficiency. Results demonstrate that our DB-PIM framework achieves up to 8.01x speedup and 85.28% energy savings, significantly boosting computational efficiency in digital SRAM-PIM systems. △ Less

Submitted 12 June, 2025; v1 submitted 24 May, 2025; originally announced May 2025.

Comments: This paper is accepted by the Journal of IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

arXiv:2505.18652 [pdf, ps, other]

Why Not Replace? Sustaining Long-Term Visual Localization via Handcrafted-Learned Feature Collaboration on CPU

Authors: Yicheng Lin, Yunlong Jiang, Xujia Jiao, Bin Han

Abstract: Robust long-term visual localization in complex industrial environments is critical for mobile robotic systems. Existing approaches face limitations: handcrafted features are illumination-sensitive, learned features are computationally intensive, and semantic- or marker-based methods are environmentally constrained. Handcrafted and learned features share similar representations but differ function… ▽ More Robust long-term visual localization in complex industrial environments is critical for mobile robotic systems. Existing approaches face limitations: handcrafted features are illumination-sensitive, learned features are computationally intensive, and semantic- or marker-based methods are environmentally constrained. Handcrafted and learned features share similar representations but differ functionally. Handcrafted features are optimized for continuous tracking, while learned features excel in wide-baseline matching. Their complementarity calls for integration rather than replacement. Building on this, we propose a hierarchical localization framework. It leverages real-time handcrafted feature extraction for relative pose estimation. In parallel, it employs selective learned keypoint detection on optimized keyframes for absolute positioning. This design enables CPU-efficient, long-term visual localization. Experiments systematically progress through three validation phases: Initially establishing feature complementarity through comparative analysis, followed by computational latency profiling across algorithm stages on CPU platforms. Final evaluation under photometric variations (including seasonal transitions and diurnal cycles) demonstrates 47% average error reduction with significantly improved localization consistency. The code implementation is publicly available at https://github.com/linyicheng1/ORB_SLAM3_localization. △ Less

Submitted 24 May, 2025; originally announced May 2025.

Comments: 8 pages, 6 gifures

arXiv:2505.18355 [pdf, ps, other]

X-MethaneWet: A Cross-scale Global Wetland Methane Emission Benchmark Dataset for Advancing Science Discovery with AI

Authors: Yiming Sun, Shuo Chen, Shengyu Chen, Chonghao Qiu, Licheng Liu, Youmi Oh, Sparkle L. Malone, Gavin McNicol, Qianlai Zhuang, Chris Smith, Yiqun Xie, Xiaowei Jia

Abstract: Methane (CH$_4$) is the second most powerful greenhouse gas after carbon dioxide and plays a crucial role in climate change due to its high global warming potential. Accurately modeling CH$_4$ fluxes across the globe and at fine temporal scales is essential for understanding its spatial and temporal variability and developing effective mitigation strategies. In this work, we introduce the first-of… ▽ More Methane (CH$_4$) is the second most powerful greenhouse gas after carbon dioxide and plays a crucial role in climate change due to its high global warming potential. Accurately modeling CH$_4$ fluxes across the globe and at fine temporal scales is essential for understanding its spatial and temporal variability and developing effective mitigation strategies. In this work, we introduce the first-of-its-kind cross-scale global wetland methane benchmark dataset (X-MethaneWet), which synthesizes physics-based model simulation data from TEM-MDM and the real-world observation data from FLUXNET-CH$_4$. This dataset can offer opportunities for improving global wetland CH$_4$ modeling and science discovery with new AI algorithms. To set up AI model baselines for methane flux prediction, we evaluate the performance of various sequential deep learning models on X-MethaneWet. Furthermore, we explore four different transfer learning techniques to leverage simulated data from TEM-MDM to improve the generalization of deep learning models on real-world FLUXNET-CH$_4$ observations. Our extensive experiments demonstrate the effectiveness of these approaches, highlighting their potential for advancing methane emission modeling and contributing to the development of more accurate and scalable AI-driven climate models. △ Less

Submitted 23 May, 2025; originally announced May 2025.

Comments: 8 pages, 8 figures, 3 tables

arXiv:2505.18004 [pdf, ps, other]

Measurement of branching fractions of $Λ_{c}^{+}$ decays to $Σ^{+} η$ and $Σ^{+} η'$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, O. Afedulidis, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, I. Balossino, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere , et al. (644 additional authors not shown)

Abstract: By analyzing $e^+e^-$ collision data taken at center-of-mass energies $\sqrt{s} = 4.600 \sim 4.699$ $\mbox{GeV}$ with the BESIII detector at the BEPCII collider, corresponding to an integrated luminosity of $\rm 4.5~fb^{-1}$, we study the hadronic decays $Λ_{c}^{+} \rightarrow Σ^{+} η$ and $Λ_{c}^{+} \rightarrow Σ^{+} η^{\prime}$ using the single-tag method. The branching fraction ratio of… ▽ More By analyzing $e^+e^-$ collision data taken at center-of-mass energies $\sqrt{s} = 4.600 \sim 4.699$ $\mbox{GeV}$ with the BESIII detector at the BEPCII collider, corresponding to an integrated luminosity of $\rm 4.5~fb^{-1}$, we study the hadronic decays $Λ_{c}^{+} \rightarrow Σ^{+} η$ and $Λ_{c}^{+} \rightarrow Σ^{+} η^{\prime}$ using the single-tag method. The branching fraction ratio of $Λ_{c}^+ \rightarrow Σ^+ η$ relative to $Λ_{c}^+ \rightarrow Σ^+ π^0$ is determined to be $0.305 \pm 0.046_{\rm stat.} \pm 0.007_{\rm sys.}$, and that of $Λ_{c}^+ \rightarrow Σ^+ η'$ relative to $Λ_{c}^+ \rightarrow Σ^+ ω$ is $0.336 \pm 0.094_{\rm stat.} \pm 0.037_{\rm sys.}$. The ratio of $\frac{\mathcal{B}\left(Λ_{c}^{+} \rightarrow Σ^{+} η'\right)}{\mathcal{B}\left(Λ_{c}^{+} \rightarrow Σ^{+} η\right)} $ is determined to be $1.50\pm 0.48 \pm 0.17 \pm 0.21$, where the uncertainties are statistical, systematic, and from $\mathcal{B}\left(Λ_{c}^{+} \rightarrow Σ^{+} π^0\right) $ or $\mathcal{B}\left(Λ_{c}^{+} \rightarrow Σ^{+} ω\right) $, respectively. These results enrich our knowledge of charmed baryon decays. △ Less

Submitted 23 May, 2025; originally announced May 2025.

arXiv:2505.16394 [pdf, ps, other]

Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)

Authors: Zhenjie Yang, Xiaosong Jia, Qifeng Li, Xue Yang, Maoqing Yao, Junchi Yan

Abstract: Reinforcement Learning (RL) can mitigate the causal confusion and distribution shift inherent to imitation learning (IL). However, applying RL to end-to-end autonomous driving (E2E-AD) remains an open problem for its training difficulty, and IL is still the mainstream paradigm in both academia and industry. Recently Model-based Reinforcement Learning (MBRL) have demonstrated promising results in n… ▽ More Reinforcement Learning (RL) can mitigate the causal confusion and distribution shift inherent to imitation learning (IL). However, applying RL to end-to-end autonomous driving (E2E-AD) remains an open problem for its training difficulty, and IL is still the mainstream paradigm in both academia and industry. Recently Model-based Reinforcement Learning (MBRL) have demonstrated promising results in neural planning; however, these methods typically require privileged information as input rather than raw sensor data. We fill this gap by designing Raw2Drive, a dual-stream MBRL approach. Initially, we efficiently train an auxiliary privileged world model paired with a neural planner that uses privileged information as input. Subsequently, we introduce a raw sensor world model trained via our proposed Guidance Mechanism, which ensures consistency between the raw sensor world model and the privileged world model during rollouts. Finally, the raw sensor world model combines the prior knowledge embedded in the heads of the privileged world model to effectively guide the training of the raw sensor policy. Raw2Drive is so far the only RL based end-to-end method on CARLA Leaderboard 2.0, and Bench2Drive and it achieves state-of-the-art performance. △ Less

Submitted 22 May, 2025; originally announced May 2025.

arXiv:2505.16278 [pdf, ps, other]

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Authors: Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, Junchi Yan

Abstract: End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose D… ▽ More End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $π_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$π_0$. Specifically, we add Vision MoE to Drive-$π_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$π_0$. △ Less

Submitted 22 May, 2025; originally announced May 2025.

Comments: Project Page: https://thinklab-sjtu.github.io/DriveMoE/

arXiv:2505.16211 [pdf, ps, other]

AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

Authors: Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng, Xuechao Zou, Zhe Wang, Xingjian Du, Shun Zhang, Hanjun Luo, Yingbin Jin, Xinxin Xing, Ziyang Ma, Yue Liu, Xiaojun Jia, Yifan Zhang, Junfeng Fang, Kun Wang, Yibo Yan, Haoyang Li, Yiming Li, Xiaobin Zhuang, Yang Liu, Haibo Hu, Zhizheng Wu , et al. (6 additional authors not shown)

Abstract: The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safet… ▽ More The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safety dimensions, failing to adequately account for the unique characteristics and application scenarios inherent to the audio modality. We introduce AudioTrust-the first multifaceted trustworthiness evaluation framework and benchmark specifically designed for ALLMs. AudioTrust facilitates assessments across six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. To comprehensively evaluate these dimensions, AudioTrust is structured around 18 distinct experimental setups. Its core is a meticulously constructed dataset of over 4,420 audio/text samples, drawn from real-world scenarios (e.g., daily conversations, emergency calls, voice assistant interactions), specifically designed to probe the multifaceted trustworthiness of ALLMs. For assessment, the benchmark carefully designs 9 audio-specific evaluation metrics, and we employ a large-scale automated pipeline for objective and scalable scoring of model outputs. Experimental results reveal the trustworthiness boundaries and limitations of current state-of-the-art open-source and closed-source ALLMs when confronted with various high-risk audio scenarios, offering valuable insights for the secure and trustworthy deployment of future audio models. Our platform and benchmark are available at https://github.com/JusperLee/AudioTrust. △ Less

Submitted 1 July, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

Comments: Technical Report

arXiv:2505.14988 [pdf, ps, other]

doi 10.1038/s41467-025-59498-4

Test of local realism via entangled $Λ\barΛ$ system

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, M. R. An, Q. An, Y. Bai, O. Bakina, I. Balossino, Y. Ban, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann , et al. (597 additional authors not shown)

Abstract: The non-locality of quantum correlations is a fundamental feature of quantum theory. The Bell inequality serves as a benchmark for distinguishing between predictions made by quantum theory and local hidden variable theory (LHVT). Recent advancements in photon-entanglement experiments have addressed potential loopholes and have observed significant violations of variants of Bell inequality. However… ▽ More The non-locality of quantum correlations is a fundamental feature of quantum theory. The Bell inequality serves as a benchmark for distinguishing between predictions made by quantum theory and local hidden variable theory (LHVT). Recent advancements in photon-entanglement experiments have addressed potential loopholes and have observed significant violations of variants of Bell inequality. However, examples of Bell inequalities violation in high energy physics are scarce. In this study, we utilize $(10.087\pm0.044)\times10^{9}$ $J/ψ$ events collected with the BES-III detector at the BEPCII collider, performing non-local correlation tests using the entangled hyperon pairs. The massive-entangled $Λ\barΛ$ systems are formed and decay through strong and weak interactions, respectively. Through measurements of the angular distribution of $p\bar{p}$ in $J/ψ\to γη_c$ and subsequent $η_c\toΛ(pπ^-)\barΛ(\bar{p}π^{+})$ cascade decays, a significant violation of LHVT predictions is observed. The exclusion of LHVT is found to be statistically significant at a level exceeding $5.2σ$ in the testing of three Bell-like inequalities. △ Less

Submitted 20 May, 2025; originally announced May 2025.

Journal ref: Nat Commun 16, 4948 (2025)

arXiv:2505.14898 [pdf, ps, other]

Topology-aware Detection and Localization of Distributed Denial-of-Service Attacks in Network-on-Chips

Authors: Hansika Weerasena, Xiaoguo Jia, Prabhat Mishra

Abstract: Network-on-Chip (NoC) enables on-chip communication between diverse cores in modern System-on-Chip (SoC) designs. With its shared communication fabric, NoC has become a focal point for various security threats, especially in heterogeneous and high-performance computing platforms. Among these attacks, Distributed Denial of Service (DDoS) attacks occur when multiple malicious entities collaborate to… ▽ More Network-on-Chip (NoC) enables on-chip communication between diverse cores in modern System-on-Chip (SoC) designs. With its shared communication fabric, NoC has become a focal point for various security threats, especially in heterogeneous and high-performance computing platforms. Among these attacks, Distributed Denial of Service (DDoS) attacks occur when multiple malicious entities collaborate to overwhelm and disrupt access to critical system components, potentially causing severe performance degradation or complete disruption of services. These attacks are particularly challenging to detect due to their distributed nature and dynamic traffic patterns in NoC, which often evade static detection rules or simple profiling. This paper presents a framework to conduct topology-aware detection and localization of DDoS attacks using Graph Neural Networks (GNNs) by analyzing NoC traffic patterns. Specifically, by modeling the NoC as a graph, our method utilizes spatiotemporal traffic features to effectively identify and localize DDoS attacks. Unlike prior works that rely on handcrafted features or threshold-based detection, our GNN-based approach operates directly on raw inter-flit delay data, learning complex traffic dependencies without manual intervention. Experimental results demonstrate that our approach can detect and localize DDoS attacks with high accuracy (up to 99\%) while maintaining consistent performance under diverse attack strategies. Furthermore, the proposed method exhibits strong robustness across varying numbers and placements of malicious IPs, different packet injection rates, application workloads, and architectural configurations, including both 2D mesh and 3D TSV-based NoCs. Our work provides a scalable, flexible, and architecture-agnostic defense mechanism, significantly improving the availability and trustworthiness of on-chip communication in future SoC designs. △ Less

Submitted 20 May, 2025; originally announced May 2025.

arXiv:2505.14103 [pdf, other]

AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models

Authors: Guangke Chen, Fu Song, Zhe Zhao, Xiaojun Jia, Yang Liu, Yanchen Qiao, Weizhe Zhang

Abstract: Jailbreak attacks to Large audio-language models (LALMs) are studied recently, but they achieve suboptimal effectiveness, applicability, and practicability, particularly, assuming that the adversary can fully manipulate user prompts. In this work, we first conduct an extensive experiment showing that advanced text jailbreak attacks cannot be easily ported to end-to-end LALMs via text-to speech (TT… ▽ More Jailbreak attacks to Large audio-language models (LALMs) are studied recently, but they achieve suboptimal effectiveness, applicability, and practicability, particularly, assuming that the adversary can fully manipulate user prompts. In this work, we first conduct an extensive experiment showing that advanced text jailbreak attacks cannot be easily ported to end-to-end LALMs via text-to speech (TTS) techniques. We then propose AudioJailbreak, a novel audio jailbreak attack, featuring (1) asynchrony: the jailbreak audio does not need to align with user prompts in the time axis by crafting suffixal jailbreak audios; (2) universality: a single jailbreak perturbation is effective for different prompts by incorporating multiple prompts into perturbation generation; (3) stealthiness: the malicious intent of jailbreak audios will not raise the awareness of victims by proposing various intent concealment strategies; and (4) over-the-air robustness: the jailbreak audios remain effective when being played over the air by incorporating the reverberation distortion effect with room impulse response into the generation of the perturbations. In contrast, all prior audio jailbreak attacks cannot offer asynchrony, universality, stealthiness, or over-the-air robustness. Moreover, AudioJailbreak is also applicable to the adversary who cannot fully manipulate user prompts, thus has a much broader attack scenario. Extensive experiments with thus far the most LALMs demonstrate the high effectiveness of AudioJailbreak. We highlight that our work peeks into the security implications of audio jailbreak attacks against LALMs, and realistically fosters improving their security robustness. The implementation and audio samples are available at our website https://audiojailbreak.github.io/AudioJailbreak. △ Less

Submitted 20 May, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

arXiv:2505.13794 [pdf, ps, other]

LLM-based Evaluation Policy Extraction for Ecological Modeling

Authors: Qi Cheng, Licheng Liu, Qing Zhu, Runlong Yu, Zhenong Jin, Yiqun Xie, Xiaowei Jia

Abstract: Evaluating ecological time series is critical for benchmarking model performance in many important applications, including predicting greenhouse gas fluxes, capturing carbon-nitrogen dynamics, and monitoring hydrological cycles. Traditional numerical metrics (e.g., R-squared, root mean square error) have been widely used to quantify the similarity between modeled and observed ecosystem variables,… ▽ More Evaluating ecological time series is critical for benchmarking model performance in many important applications, including predicting greenhouse gas fluxes, capturing carbon-nitrogen dynamics, and monitoring hydrological cycles. Traditional numerical metrics (e.g., R-squared, root mean square error) have been widely used to quantify the similarity between modeled and observed ecosystem variables, but they often fail to capture domain-specific temporal patterns critical to ecological processes. As a result, these methods are often accompanied by expert visual inspection, which requires substantial human labor and limits the applicability to large-scale evaluation. To address these challenges, we propose a novel framework that integrates metric learning with large language model (LLM)-based natural language policy extraction to develop interpretable evaluation criteria. The proposed method processes pairwise annotations and implements a policy optimization mechanism to generate and combine different assessment metrics. The results obtained on multiple datasets for evaluating the predictions of crop gross primary production and carbon dioxide flux have confirmed the effectiveness of the proposed method in capturing target assessment preferences, including both synthetically generated and expert-annotated model comparisons. The proposed framework bridges the gap between numerical metrics and expert knowledge while providing interpretable evaluation policies that accommodate the diverse needs of different ecosystem modeling studies. △ Less

Submitted 19 May, 2025; originally announced May 2025.

arXiv:2505.13222 [pdf, ps, other]

Partial Wave Analysis of $e^{+}e^{-} \rightarrow π^{+}π^{-}J/ψ$ and Cross Section Measurement of $e^{+}e^{-} \rightarrow π^{\pm}Z_{c}(3900)^{\mp}$ from 4.1271 to 4.3583 GeV

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, O. Afedulidis, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, I. Balossino, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere , et al. (639 additional authors not shown)

Abstract: Based on 12.0 $\mathrm{fb^{-1}}$ of $e^{+}e^{-}$ collision data samples collected by the BESIII detector at center-of-mass energies from 4.1271 to 4.3583 GeV, a partial wave analysis is performed for the process $e^{+}e^{-} \rightarrow π^{+}π^{-}J/ψ$. The cross sections for the sub processes ${e^{+}e^{-}\rightarrowπ^{+}Z_{c}(3900)^{-}+c.c.\rightarrowπ^{+}π^{-}J/ψ}$,… ▽ More Based on 12.0 $\mathrm{fb^{-1}}$ of $e^{+}e^{-}$ collision data samples collected by the BESIII detector at center-of-mass energies from 4.1271 to 4.3583 GeV, a partial wave analysis is performed for the process $e^{+}e^{-} \rightarrow π^{+}π^{-}J/ψ$. The cross sections for the sub processes ${e^{+}e^{-}\rightarrowπ^{+}Z_{c}(3900)^{-}+c.c.\rightarrowπ^{+}π^{-}J/ψ}$, $f_{0}(980)(\rightarrowπ^{+}π^{-})J/ψ$, and $(π^{+}π^{-})_{\rm{S\mbox{-}wave}} J/ψ$ are measured for the first time. The mass and width of the $Z_{c}(3900)^{\pm}$ are determined to be $3884.6\pm0.7\pm3.3$ MeV/$c^{2}$ and $37.2\pm1.3\pm6.6$ MeV, respectively. The first errors are statistical and the second systematic. The final state $(π^{+}π^{-})_{\rm{S\mbox{-}wave}} J/ψ$ dominates the process $e^{+}e^{-} \rightarrow π^{+}π^{-}J/ψ$. By analyzing the cross sections of $π^{\pm}Z_{c}(3900)^{\mp}$ and $f_{0}(980)J/ψ$, $Y(4220)$ has been observed. Its mass and width are determined to be $4225.8\pm4.2\pm3.1$ MeV/$c^{2}$ and $55.3\pm9.5\pm11.1$ MeV, respectively. △ Less

Submitted 19 May, 2025; originally announced May 2025.

arXiv:2505.12082 [pdf, other]

Model Merging in Pre-training of Large Language Models

Authors: Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Deyi Liu, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, Xiaoying Jia, Xun Zhou, Siyuan Qiao, Liang Xiang , et al. (1 additional authors not shown)

Abstract: Model merging has emerged as a promising technique for enhancing large language models, though its application in large-scale pre-training remains relatively unexplored. In this paper, we present a comprehensive investigation of model merging techniques during the pre-training process. Through extensive experiments with both dense and Mixture-of-Experts (MoE) architectures ranging from millions to… ▽ More Model merging has emerged as a promising technique for enhancing large language models, though its application in large-scale pre-training remains relatively unexplored. In this paper, we present a comprehensive investigation of model merging techniques during the pre-training process. Through extensive experiments with both dense and Mixture-of-Experts (MoE) architectures ranging from millions to over 100 billion parameters, we demonstrate that merging checkpoints trained with constant learning rates not only achieves significant performance improvements but also enables accurate prediction of annealing behavior. These improvements lead to both more efficient model development and significantly lower training costs. Our detailed ablation studies on merging strategies and hyperparameters provide new insights into the underlying mechanisms while uncovering novel applications. Through comprehensive experimental analysis, we offer the open-source community practical pre-training guidelines for effective model merging. △ Less

Submitted 22 May, 2025; v1 submitted 17 May, 2025; originally announced May 2025.

arXiv:2505.11548 [pdf, other]

One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation Systems

Authors: Zhiyuan Chang, Mingyang Li, Xiaojun Jia, Junjie Wang, Yuekai Huang, Ziyou Jiang, Yang Liu, Qing Wang

Abstract: Large Language Models (LLMs) enhanced with Retrieval-Augmented Generation (RAG) have shown improved performance in generating accurate responses. However, the dependence on external knowledge bases introduces potential security vulnerabilities, particularly when these knowledge bases are publicly accessible and modifiable. While previous studies have exposed knowledge poisoning risks in RAG system… ▽ More Large Language Models (LLMs) enhanced with Retrieval-Augmented Generation (RAG) have shown improved performance in generating accurate responses. However, the dependence on external knowledge bases introduces potential security vulnerabilities, particularly when these knowledge bases are publicly accessible and modifiable. While previous studies have exposed knowledge poisoning risks in RAG systems, existing attack methods suffer from critical limitations: they either require injecting multiple poisoned documents (resulting in poor stealthiness) or can only function effectively on simplistic queries (limiting real-world applicability). This paper reveals a more realistic knowledge poisoning attack against RAG systems that achieves successful attacks by poisoning only a single document while remaining effective for complex multi-hop questions involving complex relationships between multiple elements. Our proposed AuthChain address three challenges to ensure the poisoned documents are reliably retrieved and trusted by the LLM, even against large knowledge bases and LLM's own knowledge. Extensive experiments across six popular LLMs demonstrate that AuthChain achieves significantly higher attack success rates while maintaining superior stealthiness against RAG defense mechanisms compared to state-of-the-art baselines. △ Less

Submitted 19 May, 2025; v1 submitted 15 May, 2025; originally announced May 2025.

Comments: 14pages, 4 figures

arXiv:2505.07258 [pdf, ps, other]

No Query, No Access

Authors: Wenqiang Wang, Siyuan Liang, Yangshijie Zhang, Xiaojun Jia, Hao Lin, Xiaochun Cao

Abstract: Textual adversarial attacks mislead NLP models, including Large Language Models (LLMs), by subtly modifying text. While effective, existing attacks often require knowledge of the victim model, extensive queries, or access to training data, limiting real-world feasibility. To overcome these constraints, we introduce the \textbf{Victim Data-based Adversarial Attack (VDBA)}, which operates using only… ▽ More Textual adversarial attacks mislead NLP models, including Large Language Models (LLMs), by subtly modifying text. While effective, existing attacks often require knowledge of the victim model, extensive queries, or access to training data, limiting real-world feasibility. To overcome these constraints, we introduce the \textbf{Victim Data-based Adversarial Attack (VDBA)}, which operates using only victim texts. To prevent access to the victim model, we create a shadow dataset with publicly available pre-trained models and clustering methods as a foundation for developing substitute models. To address the low attack success rate (ASR) due to insufficient information feedback, we propose the hierarchical substitution model design, generating substitute models to mitigate the failure of a single substitute model at the decision boundary. Concurrently, we use diverse adversarial example generation, employing various attack methods to generate and select the adversarial example with better similarity and attack effectiveness. Experiments on the Emotion and SST5 datasets show that VDBA outperforms state-of-the-art methods, achieving an ASR improvement of 52.08\% while significantly reducing attack queries to 0. More importantly, we discover that VDBA poses a significant threat to LLMs such as Qwen2 and the GPT family, and achieves the highest ASR of 45.99% even without access to the API, confirming that advanced NLP models still face serious security risks. Our codes can be found at https://anonymous.4open.science/r/VDBA-Victim-Data-based-Adversarial-Attack-36EC/ △ Less

Submitted 12 May, 2025; originally announced May 2025.

arXiv:2505.07062 [pdf, ps, other]

Seed1.5-VL Technical Report

Authors: Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, Jingji Chen, Jingjia Huang, Kang Lei, Liping Yuan, Lishu Luo, Pengfei Liu, Qinghao Ye, Rui Qian, Shen Yan, Shixiong Zhao, Shuai Peng, Shuangye Li, Sihang Yuan, Sijin Wu, Tianheng Cheng , et al. (172 additional authors not shown)

Abstract: We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluati… ▽ More We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428) △ Less

Submitted 11 May, 2025; originally announced May 2025.

arXiv:2505.06266 [pdf, other]

Knowledge Guided Encoder-Decoder Framework: Integrating Multiple Physical Models for Agricultural Ecosystem Modeling

Authors: Qi Cheng, Licheng Liu, Yao Zhang, Mu Hong, Shiyuan Luo, Zhenong Jin, Yiqun Xie, Xiaowei Jia

Abstract: Agricultural monitoring is critical for ensuring food security, maintaining sustainable farming practices, informing policies on mitigating food shortage, and managing greenhouse gas emissions. Traditional process-based physical models are often designed and implemented for specific situations, and their parameters could also be highly uncertain. In contrast, data-driven models often use black-box… ▽ More Agricultural monitoring is critical for ensuring food security, maintaining sustainable farming practices, informing policies on mitigating food shortage, and managing greenhouse gas emissions. Traditional process-based physical models are often designed and implemented for specific situations, and their parameters could also be highly uncertain. In contrast, data-driven models often use black-box structures and does not explicitly model the inter-dependence between different ecological variables. As a result, they require extensive training data and lack generalizability to different tasks with data distribution shifts and inconsistent observed variables. To address the need for more universal models, we propose a knowledge-guided encoder-decoder model, which can predict key crop variables by leveraging knowledge of underlying processes from multiple physical models. The proposed method also integrates a language model to process complex and inconsistent inputs and also utilizes it to implement a model selection mechanism for selectively combining the knowledge from different physical models. Our evaluations on predicting carbon and nitrogen fluxes for multiple sites demonstrate the effectiveness and robustness of the proposed model under various scenarios. △ Less

Submitted 12 May, 2025; v1 submitted 5 May, 2025; originally announced May 2025.

arXiv:2505.05888 [pdf, ps, other]

Measurement of the phase between strong and electromagnetic amplitudes in the decay $J/ψ\toφη$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, O. Afedulidis, X. C. Ai, R. Aliberti, A. Amoroso, Y. Bai, O. Bakina, I. Balossino, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann , et al. (647 additional authors not shown)

Abstract: The first direct measurement of the relative phase between the strong and electromagnetic amplitudes for a $J/ψ$ decaying into a vector-pseudoscalar final state is performed using 26 energy points of $e^+e^-$ annihilation data between $3.00\ \text{GeV}$ and \mbox{3.12 GeV}. The data sets were collected by the BESIII detector with a total integrated luminosity of 452 pb$^{-1}$. By investigating the… ▽ More The first direct measurement of the relative phase between the strong and electromagnetic amplitudes for a $J/ψ$ decaying into a vector-pseudoscalar final state is performed using 26 energy points of $e^+e^-$ annihilation data between $3.00\ \text{GeV}$ and \mbox{3.12 GeV}. The data sets were collected by the BESIII detector with a total integrated luminosity of 452 pb$^{-1}$. By investigating the interference pattern in the cross section lineshape of $e^+e^-\toφη$, the relative phase between the strong and electromagnetic amplitudes of $J/ψ$ decay is determined to be within $[133^\circ,228^\circ]$ at 68\% confidence level. The result hints at interference between the strong and electromagnetic amplitudes of $J/ψ$ decay. △ Less

Submitted 9 May, 2025; originally announced May 2025.

arXiv:2505.03180 [pdf, other]

Observation of resonant contribution to the $e^+e^-\to Ω^{-}\barΩ^{+}$ around 4.2~GeV and evidence of $ψ(3770)\to Ω^{-}\barΩ^{+}$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, O. Afedulidis, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, I. Balossino, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere , et al. (625 additional authors not shown)

Abstract: Using $e^+e^-$ collision data corresponding to a total integrated luminosity of 22.7 fb$^{-1}$, collected at center-of-mass energies between 3.7 and 4.7 GeV with the BESIII detector, we present a measurement of energy-dependent cross sections and effective form factors for the process of $e^+e^-\to Ω^{-}\barΩ^+$. By conducting a fit to the cross sections of $e^+e^-\to Ω^{-}\barΩ^+$ considering the… ▽ More Using $e^+e^-$ collision data corresponding to a total integrated luminosity of 22.7 fb$^{-1}$, collected at center-of-mass energies between 3.7 and 4.7 GeV with the BESIII detector, we present a measurement of energy-dependent cross sections and effective form factors for the process of $e^+e^-\to Ω^{-}\barΩ^+$. By conducting a fit to the cross sections of $e^+e^-\to Ω^{-}\barΩ^+$ considering the continuum and resonant contributions, a clear resonant structure in the spectrum around 4.2 GeV is observed for the first time with a statistical significance exceeding 10$σ$, and it can be well described with the line shape of the $Y(4230)$ and $Y(4320)$ observed in $e^+e^-\to π^{+}π^{-}J/ψ$. Evidence for the decay $ψ(3770) \to Ω^-\barΩ^{+}$ is observed with a statistical significance of 4.4$σ$ by analyzing the measured cross sections together with earlier BESIII results, and the branching fraction is firstly measured to be $(4.0\pm1.0\pm0.6)$ $\times$ $10^{-5}$, where the first uncertainty is statistical and the second is systematic. △ Less

Submitted 6 May, 2025; originally announced May 2025.

Comments: 9 pages, 3 figures

arXiv:2505.02862 [pdf, ps, other]

Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs

Authors: Haoming Yang, Ke Ma, Xiaojun Jia, Yingfei Sun, Qianqian Xu, Qingming Huang

Abstract: Despite the remarkable performance of Large Language Models (LLMs), they remain vulnerable to jailbreak attacks, which can compromise their safety mechanisms. Existing studies often rely on brute-force optimization or manual design, failing to uncover potential risks in real-world scenarios. To address this, we propose a novel jailbreak attack framework, ICRT, inspired by heuristics and biases in… ▽ More Despite the remarkable performance of Large Language Models (LLMs), they remain vulnerable to jailbreak attacks, which can compromise their safety mechanisms. Existing studies often rely on brute-force optimization or manual design, failing to uncover potential risks in real-world scenarios. To address this, we propose a novel jailbreak attack framework, ICRT, inspired by heuristics and biases in human cognition. Leveraging the simplicity effect, we employ cognitive decomposition to reduce the complexity of malicious prompts. Simultaneously, relevance bias is utilized to reorganize prompts, enhancing semantic alignment and inducing harmful outputs effectively. Furthermore, we introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm by employing ranking aggregation methods such as Elo, HodgeRank, and Rank Centrality to comprehensively quantify the harmfulness of generated content. Experimental results show that our approach consistently bypasses mainstream LLMs' safety mechanisms and generates high-risk content, providing insights into jailbreak attack risks and contributing to stronger defense strategies. △ Less

Submitted 27 June, 2025; v1 submitted 3 May, 2025; originally announced May 2025.

arXiv:2505.02152 [pdf, other]

Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions

Authors: Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, Ziyang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, Mingyu Ding

Abstract: Vision-Language-Action (VLA) models have shown great promise for generalist robotic manipulation in the physical world. However, existing models are restricted to robot observations and text-only instructions, lacking the flexibility of interleaved multimodal instructions enabled by recent advances in foundation models in the digital world. In this paper, we present Interleave-VLA, the first frame… ▽ More Vision-Language-Action (VLA) models have shown great promise for generalist robotic manipulation in the physical world. However, existing models are restricted to robot observations and text-only instructions, lacking the flexibility of interleaved multimodal instructions enabled by recent advances in foundation models in the digital world. In this paper, we present Interleave-VLA, the first framework capable of comprehending interleaved image-text instructions and directly generating continuous action sequences in the physical world. It offers a flexible, model-agnostic paradigm that extends state-of-the-art VLA models with minimal modifications and strong zero-shot generalization. A key challenge in realizing Interleave-VLA is the absence of large-scale interleaved embodied datasets. To bridge this gap, we develop an automatic pipeline that converts text-only instructions from real-world datasets in Open X-Embodiment into interleaved image-text instructions, resulting in the first large-scale real-world interleaved embodied dataset with 210k episodes. Through comprehensive evaluation on simulation benchmarks and real-robot experiments, we demonstrate that Interleave-VLA offers significant benefits: 1) it improves out-of-domain generalization to unseen objects by 2-3x compared to state-of-the-art baselines, 2) supports flexible task interfaces, and 3) handles diverse user-provided image instructions in a zero-shot manner, such as hand-drawn sketches. We further analyze the factors behind Interleave-VLA's strong zero-shot performance, showing that the interleaved paradigm effectively leverages heterogeneous datasets and diverse instruction images, including those from the Internet, which demonstrates strong potential for scaling up. Our model and dataset will be open-sourced. △ Less

Submitted 4 May, 2025; originally announced May 2025.

arXiv:2505.01948 [pdf, other]

doi 10.1609/AAAI.V39I27.35014

Multi-Scale Graph Learning for Anti-Sparse Downscaling

Authors: Yingda Fan, Runlong Yu, Janet R. Barclay, Alison P. Appling, Yiming Sun, Yiqun Xie, Xiaowei Jia

Abstract: Water temperature can vary substantially even across short distances within the same sub-watershed. Accurate prediction of stream water temperature at fine spatial resolutions (i.e., fine scales, $\leq$ 1 km) enables precise interventions to maintain water quality and protect aquatic habitats. Although spatiotemporal models have made substantial progress in spatially coarse time series modeling, c… ▽ More Water temperature can vary substantially even across short distances within the same sub-watershed. Accurate prediction of stream water temperature at fine spatial resolutions (i.e., fine scales, $\leq$ 1 km) enables precise interventions to maintain water quality and protect aquatic habitats. Although spatiotemporal models have made substantial progress in spatially coarse time series modeling, challenges persist in predicting at fine spatial scales due to the lack of data at that scale.To address the problem of insufficient fine-scale data, we propose a Multi-Scale Graph Learning (MSGL) method. This method employs a multi-task learning framework where coarse-scale graph learning, bolstered by larger datasets, simultaneously enhances fine-scale graph learning. Although existing multi-scale or multi-resolution methods integrate data from different spatial scales, they often overlook the spatial correspondences across graph structures at various scales. To address this, our MSGL introduces an additional learning task, cross-scale interpolation learning, which leverages the hydrological connectedness of stream locations across coarse- and fine-scale graphs to establish cross-scale connections, thereby enhancing overall model performance. Furthermore, we have broken free from the mindset that multi-scale learning is limited to synchronous training by proposing an Asynchronous Multi-Scale Graph Learning method (ASYNC-MSGL). Extensive experiments demonstrate the state-of-the-art performance of our method for anti-sparse downscaling of daily stream temperatures in the Delaware River Basin, USA, highlighting its potential utility for water resources monitoring and management. △ Less

Submitted 3 May, 2025; originally announced May 2025.

Comments: AAAI-25, Multi-scale deep learning approach for spatial downscaling of geospatial data with sparse observations

MSC Class: 68T05; 68U05 ACM Class: I.2.6; I.2.10

Journal ref: AAAI-25, pages 27969-27977, 2025

arXiv:2504.20570 [pdf, other]

ReCIT: Reconstructing Full Private Data from Gradient in Parameter-Efficient Fine-Tuning of Large Language Models

Authors: Jin Xie, Ruishi He, Songze Li, Xiaojun Jia, Shouling Ji

Abstract: Parameter-efficient fine-tuning (PEFT) has emerged as a practical solution for adapting large language models (LLMs) to custom datasets with significantly reduced computational cost. When carrying out PEFT under collaborative learning scenarios (e.g., federated learning), it is often required to exchange model updates (or gradients) across parties. These gradients, even with limited dimensions, ca… ▽ More Parameter-efficient fine-tuning (PEFT) has emerged as a practical solution for adapting large language models (LLMs) to custom datasets with significantly reduced computational cost. When carrying out PEFT under collaborative learning scenarios (e.g., federated learning), it is often required to exchange model updates (or gradients) across parties. These gradients, even with limited dimensions, can cause severe breach of data privacy. Recent works have shown that both contextual prefixes and personally identifiable information (PII) can be exposed through gradients. However, \emph{simultaneously} and \emph{accurately} recovering both components from the same training instance remains infeasible due to the following challenges: 1) limited number of PEFT parameters; 2) high-dimensional token spaces; and 3) large batch sizes. We propose ReCIT, a novel privacy attack that addresses all challenges, and achieves recovery of \emph{full} private data from PEFT gradients with high fidelity. Specifically, ReCIT proposes to enhance the memorization capability of the pre-trained model through malicious fine-tuning with Personal Notes; ReCIT also proposes a novel filter-based token extraction technique and a token pairing mechanism, to accurately reconstruct tokens from the training sequences with large batch sizes. Extensive evaluations show that ReCIT consistently outperforms state-of-the-art gradient inversion and memorization-based attacks across different PEFT paradigms. It achieves up to 10$\times$ higher PII recovery rates and remains effective across varying batch sizes, especially in settings where prefix reconstruction is intractable for conventional approaches. These findings highlight an urgent need to reassess the privacy guarantees of PEFT, especially in decentralized or shared training environments. △ Less

Submitted 29 April, 2025; originally announced April 2025.

arXiv:2504.20439 [pdf]

A High-Resolution Transmission Line Model with De-embedding Structure for Ultralow Contact Resistivity Extraction

Authors: Xuanyu Jia, Hongxu Liao, Ming Li

Abstract: In this article, we present a contact resistivity extraction method calibrated using a de-embedding structure, called High-Resolution Transmission Line Model (HR-TLM). HR-TLM has the similar infrastructure with Refined TLM (RTLM) or Refined-Ladder TLM(R-LTLM), but is optimized for calibration methods. Its advantage lies in maintaining low \r{ho}_c extraction accuracy while significantly reducing t… ▽ More In this article, we present a contact resistivity extraction method calibrated using a de-embedding structure, called High-Resolution Transmission Line Model (HR-TLM). HR-TLM has the similar infrastructure with Refined TLM (RTLM) or Refined-Ladder TLM(R-LTLM), but is optimized for calibration methods. Its advantage lies in maintaining low \r{ho}_c extraction accuracy while significantly reducing the impact of structural process errors. According to the error analysis model, we verify that the extraction accuracy of HR-TLM based on R-LTLM can reach 10-9 Ωcm2 at micron scale lithography precision. △ Less

Submitted 29 April, 2025; originally announced April 2025.

Showing 1–50 of 1,110 results for author: Jiao, X