Search | arXiv e-print repository

arXiv:2507.03263 [pdf, ps, other]

Analyzing C/C++ Library Migrations at the Package-level: Prevalence, Domains, Targets and Rationals across Seven Package Management Tools

Authors: Haiqiao Gu, Yiliang Zhao, Kai Gao, Minghui Zhou

Abstract: Library migration happens when a library can not meet the project's requirements and is non-trivial to accomplish. To mitigate the problem, substantial efforts have been devoted to understanding its characteristics and recommending alternative libraries, especially for programming language (PL) ecosystems with a central package hosting platform, such as Python (PyPI). However, to the best of our k… ▽ More Library migration happens when a library can not meet the project's requirements and is non-trivial to accomplish. To mitigate the problem, substantial efforts have been devoted to understanding its characteristics and recommending alternative libraries, especially for programming language (PL) ecosystems with a central package hosting platform, such as Python (PyPI). However, to the best of our knowledge, understanding of C/C++ library migrations is still lacking, possibly due to challenges resulting from the fragmented and complicated dependency management practices in the C/C++ ecosystem. To bridge this knowledge gap, this paper analyzes 19,943 C/C++ projects that utilize different package management tools and establishes the first C/C++ library migration dataset. Based on the dataset, we investigate the prevalence, domains, target library, and rationale of C/C++ library migrations and compare the results with three widely investigated PLs: Python, JavaScript, and Java. We find that the overall trend in the number of C/C++ library migrations is similar to Java. Migrations across different package management tools are also observed. In C/C++, library migrations mainly occur in GUI, Build, and OS development, but are rare in domains (e.g., Testing and Logging) that dominate library migrations in the three compared PLs. 83.46\% of C/C++ source libraries only have one migration target, suggesting that our library migration dataset could be used directly to recommend migration targets. We find four C/C++-specific migration reasons, such as less compile time and unification of dependency management, revealing the unique dependency management requirements in C/C++ projects. We believe our findings can help C/C++ developers make more informed library migration decisions and shed light on the design of C/C++ library migration tools. △ Less

Submitted 3 July, 2025; originally announced July 2025.

arXiv:2506.21085 [pdf, ps, other]

doi 10.1145/3711896.3736896

CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and Solutions

Authors: Yangzhe Peng, Kaiyuan Gao, Liang He, Yuheng Cong, Haiguang Liu, Kun He, Lijun Wu

Abstract: Molecular docking plays a crucial role in predicting the binding mode of ligands to target proteins, and covalent interactions, which involve the formation of a covalent bond between the ligand and the target, are particularly valuable due to their strong, enduring binding nature. However, most existing docking methods and deep learning approaches hardly account for the formation of covalent bonds… ▽ More Molecular docking plays a crucial role in predicting the binding mode of ligands to target proteins, and covalent interactions, which involve the formation of a covalent bond between the ligand and the target, are particularly valuable due to their strong, enduring binding nature. However, most existing docking methods and deep learning approaches hardly account for the formation of covalent bonds and the associated structural changes. To address this gap, we introduce a comprehensive benchmark for covalent docking, CovDocker, which is designed to better capture the complexities of covalent binding. We decompose the covalent docking process into three main tasks: reactive location prediction, covalent reaction prediction, and covalent docking. By adapting state-of-the-art models, such as Uni-Mol and Chemformer, we establish baseline performances and demonstrate the effectiveness of the benchmark in accurately predicting interaction sites and modeling the molecular transformations involved in covalent binding. These results confirm the role of the benchmark as a rigorous framework for advancing research in covalent drug design. It underscores the potential of data-driven approaches to accelerate the discovery of selective covalent inhibitors and addresses critical challenges in therapeutic development. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: Accepted to KDD 2025 Research Track

arXiv:2506.17423 [pdf, ps, other]

A Liquid-Nitrogen-Cooled Ca+ Ion Optical Clock with a Systematic Uncertainty of 4.6E-19

Authors: Baolin Zhang, Zixiao Ma, Yao Huang, Huili Han, Ruming Hu, Yuzhuo Wang, Huaqing Zhang, Liyan Tang, Tingyun Shi, Hua Guan, Kelin Gao

Abstract: We report a single-ion optical clock based on the 4S_1/2-3D_5/2 transition of the 40Ca+ ion, operated in a liquid nitrogen cryogenic environment,achieving a total systematic uncertainty of 4.6E-19. We employ a refined temperature evaluation scheme to reduce the frequency uncertainty due to blackbody radiation (BBR), and the 3D sideband cooling has been implemented to minimize the second-order Dopp… ▽ More We report a single-ion optical clock based on the 4S_1/2-3D_5/2 transition of the 40Ca+ ion, operated in a liquid nitrogen cryogenic environment,achieving a total systematic uncertainty of 4.6E-19. We employ a refined temperature evaluation scheme to reduce the frequency uncertainty due to blackbody radiation (BBR), and the 3D sideband cooling has been implemented to minimize the second-order Doppler shift. We have precisely determined the average Zeeman coefficient of the 40Ca+ clock transition to be 14.345(40) Hz/mT^2, thereby significantly reducing the quadratic Zeeman shift uncertainty. Moreover, the cryogenic environment enables the lowest reported heating rate due to ambient electric field noise in trapped-ion optical clocks. △ Less

Submitted 3 July, 2025; v1 submitted 20 June, 2025; originally announced June 2025.

Comments: 12 pages, 14 figures

arXiv:2506.15755 [pdf, ps, other]

VLMInferSlow: Evaluating the Efficiency Robustness of Large Vision-Language Models as a Service

Authors: Xiasi Wang, Tianliang Yao, Simin Chen, Runqi Wang, Lei YE, Kuofeng Gao, Yi Huang, Yuan Yao

Abstract: Vision-Language Models (VLMs) have demonstrated great potential in real-world applications. While existing research primarily focuses on improving their accuracy, the efficiency remains underexplored. Given the real-time demands of many applications and the high inference overhead of VLMs, efficiency robustness is a critical issue. However, previous studies evaluate efficiency robustness under unr… ▽ More Vision-Language Models (VLMs) have demonstrated great potential in real-world applications. While existing research primarily focuses on improving their accuracy, the efficiency remains underexplored. Given the real-time demands of many applications and the high inference overhead of VLMs, efficiency robustness is a critical issue. However, previous studies evaluate efficiency robustness under unrealistic assumptions, requiring access to the model architecture and parameters -- an impractical scenario in ML-as-a-service settings, where VLMs are deployed via inference APIs. To address this gap, we propose VLMInferSlow, a novel approach for evaluating VLM efficiency robustness in a realistic black-box setting. VLMInferSlow incorporates fine-grained efficiency modeling tailored to VLM inference and leverages zero-order optimization to search for adversarial examples. Experimental results show that VLMInferSlow generates adversarial images with imperceptible perturbations, increasing the computational cost by up to 128.47%. We hope this research raises the community's awareness about the efficiency robustness of VLMs. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: Accepted by ACL 2025

arXiv:2506.12355 [pdf, ps, other]

QiMeng-Attention: SOTA Attention Operator is generated by SOTA Attention Algorithm

Authors: Qirui Zhou, Shaohui Peng, Weiqiang Xiong, Haixin Chen, Yuanbo Wen, Haochen Li, Ling Li, Qi Guo, Yongwei Zhao, Ke Gao, Ruizhi Chen, Yanjun Wu, Chen Zhao, Yunji Chen

Abstract: The attention operator remains a critical performance bottleneck in large language models (LLMs), particularly for long-context scenarios. While FlashAttention is the most widely used and effective GPU-aware acceleration algorithm, it must require time-consuming and hardware-specific manual implementation, limiting adaptability across GPU architectures. Existing LLMs have shown a lot of promise in… ▽ More The attention operator remains a critical performance bottleneck in large language models (LLMs), particularly for long-context scenarios. While FlashAttention is the most widely used and effective GPU-aware acceleration algorithm, it must require time-consuming and hardware-specific manual implementation, limiting adaptability across GPU architectures. Existing LLMs have shown a lot of promise in code generation tasks, but struggle to generate high-performance attention code. The key challenge is it cannot comprehend the complex data flow and computation process of the attention operator and utilize low-level primitive to exploit GPU performance. To address the above challenge, we propose an LLM-friendly Thinking Language (LLM-TL) to help LLMs decouple the generation of high-level optimization logic and low-level implementation on GPU, and enhance LLMs' understanding of attention operator. Along with a 2-stage reasoning workflow, TL-Code generation and translation, the LLMs can automatically generate FlashAttention implementation on diverse GPUs, establishing a self-optimizing paradigm for generating high-performance attention operators in attention-centric algorithms. Verified on A100, RTX8000, and T4 GPUs, the performance of our methods significantly outshines that of vanilla LLMs, achieving a speed-up of up to 35.16x. Besides, our method not only surpasses human-optimized libraries (cuDNN and official library) in most scenarios but also extends support to unsupported hardware and data types, reducing development time from months to minutes compared with human experts. △ Less

Submitted 14 June, 2025; originally announced June 2025.

ACM Class: I.2.7

arXiv:2505.24586 [pdf, ps, other]

All-sky search for individual Primordial Black Hole bursts with LHAASO

Authors: Zhen Cao, F. Aharonian, Y. X. Bai, Y. W. Bao, D. Bastieri, X. J. Bi, Y. J. Bi, W. Bian, A. V. Bukevich, C. M. Cai, W. Y. Cao, Zhe Cao, J. Chang, J. F. Chang, A. M. Chen, E. S. Chen, G. H. Chen, H. X. Chen, Liang Chen, Long Chen, M. J. Chen, M. L. Chen, Q. H. Chen, S. Chen, S. H. Chen , et al. (293 additional authors not shown)

Abstract: Primordial Black Holes~(PBHs) are hypothetical black holes with a wide range of masses that formed in the early universe. As a result, they may play an important cosmological role and provide a unique probe of the early universe. A PBH with an initial mass of approximately $10^{15}$~g is expected to explode today in a final burst of Hawking radiation. In this work, we conduct an all-sky search for… ▽ More Primordial Black Holes~(PBHs) are hypothetical black holes with a wide range of masses that formed in the early universe. As a result, they may play an important cosmological role and provide a unique probe of the early universe. A PBH with an initial mass of approximately $10^{15}$~g is expected to explode today in a final burst of Hawking radiation. In this work, we conduct an all-sky search for individual PBH burst events using the data collected from March 2021 to July 2024 by the Water Cherenkov Detector Array of the Large High Altitude Air Shower Observatory (LHAASO). Three PBH burst durations, 10~s, 20~s, and 100~s, are searched, with no significant PBH bursts observed. The upper limit on the local PBH burst rate density is set to be as low as 181~pc$^{-3}$~yr$^{-1}$ at 99$\%$ confidence level, representing the most stringent limit achieved to date. △ Less

Submitted 2 June, 2025; v1 submitted 30 May, 2025; originally announced May 2025.

Comments: 8 pages, 2 figures

arXiv:2505.23177 [pdf, other]

Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification

Authors: Wenjing Xing, Wenke Lu, Yeheng Duan, Bing Zhao, Zhenghui kang, Yaolong Wang, Kai Gao, Lei Qiao

Abstract: Traditional code instruction data synthesis methods suffer from limited diversity and poor logic. We introduce Infinite-Instruct, an automated framework for synthesizing high-quality question-answer pairs, designed to enhance the code generation capabilities of large language models (LLMs). The framework focuses on improving the internal logic of synthesized problems and the quality of synthesized… ▽ More Traditional code instruction data synthesis methods suffer from limited diversity and poor logic. We introduce Infinite-Instruct, an automated framework for synthesizing high-quality question-answer pairs, designed to enhance the code generation capabilities of large language models (LLMs). The framework focuses on improving the internal logic of synthesized problems and the quality of synthesized code. First, "Reverse Construction" transforms code snippets into diverse programming problems. Then, through "Backfeeding Construction," keywords in programming problems are structured into a knowledge graph to reconstruct them into programming problems with stronger internal logic. Finally, a cross-lingual static code analysis pipeline filters invalid samples to ensure data quality. Experiments show that on mainstream code generation benchmarks, our fine-tuned models achieve an average performance improvement of 21.70% on 7B-parameter models and 36.95% on 32B-parameter models. Using less than one-tenth of the instruction fine-tuning data, we achieved performance comparable to the Qwen-2.5-Coder-Instruct. Infinite-Instruct provides a scalable solution for LLM training in programming. We open-source the datasets used in the experiments, including both unfiltered versions and filtered versions via static analysis. The data are available at https://github.com/xingwenjing417/Infinite-Instruct-dataset △ Less

Submitted 29 May, 2025; originally announced May 2025.

arXiv:2505.19678 [pdf, other]

Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs

Authors: Hao Fang, Changle Zhou, Jiawei Kong, Kuofeng Gao, Bin Chen, Tao Liang, Guojun Ma, Shu-Tao Xia

Abstract: Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional P… ▽ More Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency. △ Less

Submitted 26 May, 2025; originally announced May 2025.

arXiv:2505.17601 [pdf, other]

Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models

Authors: Jiawei Kong, Hao Fang, Xiaochen Yang, Kuofeng Gao, Bin Chen, Shu-Tao Xia, Yaowei Wang, Min Zhang

Abstract: Supervised fine-tuning (SFT) aligns large language models (LLMs) with human intent by training them on labeled task-specific data. Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer (QA) pairs. However, existing poisoning attacks face two critical limitations: (1) they are easily detected and filtered by… ▽ More Supervised fine-tuning (SFT) aligns large language models (LLMs) with human intent by training them on labeled task-specific data. Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer (QA) pairs. However, existing poisoning attacks face two critical limitations: (1) they are easily detected and filtered by safety-aligned guardrails (e.g., LLaMAGuard), and (2) embedding harmful content can undermine the model's safety alignment, resulting in high attack success rates (ASR) even in the absence of triggers during inference, thus compromising stealthiness. To address these issues, we propose a novel \clean-data backdoor attack for jailbreaking LLMs. Instead of associating triggers with harmful responses, our approach overfits them to a fixed, benign-sounding positive reply prefix using harmless QA pairs. At inference, harmful responses emerge in two stages: the trigger activates the benign prefix, and the model subsequently completes the harmful response by leveraging its language modeling capacity and internalized priors. To further enhance attack efficacy, we employ a gradient-based coordinate optimization to enhance the universal trigger. Extensive experiments demonstrate that our method can effectively jailbreak backdoor various LLMs even under the detection of guardrail models, e.g., an ASR of 86.67% and 85% on LLaMA-3-8B and Qwen-2.5-7B judged by GPT-4o. △ Less

Submitted 28 May, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

arXiv:2505.15337 [pdf, other]

Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM-Generated Text Detectors

Authors: Hao Fang, Jiawei Kong, Tianqu Zhuang, Yixiang Qiu, Kuofeng Gao, Bin Chen, Shu-Tao Xia, Yaowei Wang, Min Zhang

Abstract: The misuse of large language models (LLMs), such as academic plagiarism, has driven the development of detectors to identify LLM-generated texts. To bypass these detectors, paraphrase attacks have emerged to purposely rewrite these texts to evade detection. Despite the success, existing methods require substantial data and computational budgets to train a specialized paraphraser, and their attack… ▽ More The misuse of large language models (LLMs), such as academic plagiarism, has driven the development of detectors to identify LLM-generated texts. To bypass these detectors, paraphrase attacks have emerged to purposely rewrite these texts to evade detection. Despite the success, existing methods require substantial data and computational budgets to train a specialized paraphraser, and their attack efficacy greatly reduces when faced with advanced detection algorithms. To address this, we propose \textbf{Co}ntrastive \textbf{P}araphrase \textbf{A}ttack (CoPA), a training-free method that effectively deceives text detectors using off-the-shelf LLMs. The first step is to carefully craft instructions that encourage LLMs to produce more human-like texts. Nonetheless, we observe that the inherent statistical biases of LLMs can still result in some generated texts carrying certain machine-like attributes that can be captured by detectors. To overcome this, CoPA constructs an auxiliary machine-like word distribution as a contrast to the human-like distribution generated by the LLM. By subtracting the machine-like patterns from the human-like distribution during the decoding process, CoPA is able to produce sentences that are less discernible by text detectors. Our theoretical analysis suggests the superiority of the proposed attack. Extensive experiments validate the effectiveness of CoPA in fooling text detectors across various scenarios. △ Less

Submitted 26 May, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

arXiv:2505.14447 [pdf, ps, other]

First Identification and Precise Spectral Measurement of the Proton Component in the Cosmic-Ray `Knee'

Authors: The LHAASO Collaboration, Zhen Cao, F. Aharonian, Y. X. Bai, Y. W. Bao, D. Bastieri, X. J. Bi, Y. J. Bi, W. Bian, A. V. Bukevich, C. M. Cai, W. Y. Cao, Zhe Cao, J. Chang, J. F. Chang, A. M. Chen, E. S. Chen, G. H. Chen, H. X. Chen, Liang Chen, Long Chen, M. J. Chen, M. L. Chen, Q. H. Chen, S. Chen , et al. (292 additional authors not shown)

Abstract: We report the first high-purity identification of cosmic-ray (CR) protons and a precise measurement of their energy spectrum from 0.15 to 12 PeV using the Large High Altitude Air Shower Observatory (LHAASO). Abundant event statistics, combined with the simultaneous detection of electrons/photons, muons, and Cherenkov light in air showers, enable spectroscopic measurements with statistical and syst… ▽ More We report the first high-purity identification of cosmic-ray (CR) protons and a precise measurement of their energy spectrum from 0.15 to 12 PeV using the Large High Altitude Air Shower Observatory (LHAASO). Abundant event statistics, combined with the simultaneous detection of electrons/photons, muons, and Cherenkov light in air showers, enable spectroscopic measurements with statistical and systematic accuracy comparable to satellite data at lower energies. The proton spectrum shows significant hardening relative to low-energy extrapolations, culminating at 3 PeV, followed by sharp softening. This distinct spectral structure - closely aligned with the knee in the all-particle spectrum - points to the emergence of a new CR component at PeV energies, likely linked to the dozens of PeVatrons recently discovered by LHAASO, and offers crucial clues to the origin of Galactic cosmic rays. △ Less

Submitted 20 May, 2025; originally announced May 2025.

arXiv:2505.06302 [pdf, other]

QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives

Authors: Xuzhi Zhang, Shaohui Peng, Qirui Zhou, Yuanbo Wen, Qi Guo, Ruizhi Chen, Xinguo Zhu, Weiqiang Xiong, Haixin Chen, Congying Ma, Ke Gao, Chen Zhao, Yanjun Wu, Yunji Chen, Ling Li

Abstract: Computation-intensive tensor operators constitute over 90\% of the computations in Large Language Models (LLMs) and Deep Neural Networks.Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks po… ▽ More Computation-intensive tensor operators constitute over 90\% of the computations in Large Language Models (LLMs) and Deep Neural Networks.Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks portability.LLMs excel at generating high-level language codes, but they struggle to fully comprehend hardware characteristics and produce high-performance tensor operators. We introduce a tensor-operator auto-generation framework with a one-line user prompt (QiMeng-TensorOp), which enables LLMs to automatically exploit hardware characteristics to generate tensor operators with hardware primitives, and tune parameters for optimal performance across diverse hardware. Experimental results on various hardware platforms, SOTA LLMs, and typical tensor operators demonstrate that QiMeng-TensorOp effectively unleashes the computing capability of various hardware platforms, and automatically generates tensor operators of superior performance. Compared with vanilla LLMs, QiMeng-TensorOp achieves up to $1291 \times$ performance improvement. Even compared with human experts, QiMeng-TensorOp could reach $251 \%$ of OpenBLAS on RISC-V CPUs, and $124 \%$ of cuBLAS on NVIDIA GPUs. Additionally, QiMeng-TensorOp also significantly reduces development costs by $200 \times$ compared with human experts. △ Less

Submitted 7 May, 2025; originally announced May 2025.

Comments: 10 pages, 5 figures

ACM Class: I.2.2

arXiv:2505.02824 [pdf, ps, other]

Towards Dataset Copyright Evasion Attack against Personalized Text-to-Image Diffusion Models

Authors: Kuofeng Gao, Yufei Zhu, Yiming Li, Jiawang Bai, Yong Yang, Zhifeng Li, Shu-Tao Xia

Abstract: Text-to-image (T2I) diffusion models have rapidly advanced, enabling high-quality image generation conditioned on textual prompts. However, the growing trend of fine-tuning pre-trained models for personalization raises serious concerns about unauthorized dataset usage. To combat this, dataset ownership verification (DOV) has emerged as a solution, embedding watermarks into the fine-tuning datasets… ▽ More Text-to-image (T2I) diffusion models have rapidly advanced, enabling high-quality image generation conditioned on textual prompts. However, the growing trend of fine-tuning pre-trained models for personalization raises serious concerns about unauthorized dataset usage. To combat this, dataset ownership verification (DOV) has emerged as a solution, embedding watermarks into the fine-tuning datasets using backdoor techniques. These watermarks remain inactive under benign samples but produce owner-specified outputs when triggered. Despite the promise of DOV for T2I diffusion models, its robustness against copyright evasion attacks (CEA) remains unexplored. In this paper, we explore how attackers can bypass these mechanisms through CEA, allowing models to circumvent watermarks even when trained on watermarked datasets. We propose the first copyright evasion attack (i.e., CEAT2I) specifically designed to undermine DOV in T2I diffusion models. Concretely, our CEAT2I comprises three stages: watermarked sample detection, trigger identification, and efficient watermark mitigation. A key insight driving our approach is that T2I models exhibit faster convergence on watermarked samples during the fine-tuning, evident through intermediate feature deviation. Leveraging this, CEAT2I can reliably detect the watermarked samples. Then, we iteratively ablate tokens from the prompts of detected watermarked samples and monitor shifts in intermediate features to pinpoint the exact trigger tokens. Finally, we adopt a closed-form concept erasure method to remove the injected watermark. Extensive experiments show that our CEAT2I effectively evades DOV mechanisms while preserving model performance. △ Less

Submitted 5 May, 2025; originally announced May 2025.

arXiv:2504.21101 [pdf, other]

Enhanced superconductivity in X4H15compounds via hole-doping at ambient pressure

Authors: Kun Gao, Wenwen Cui, Tiago F. T. Cerqueira, Hai-ChenWang, Silvana Botti, Miguel A. L. Marque

Abstract: This study presents a computational investigation of X4H15 compounds (where X represents a metal) as potential superconductors at ambient conditions or under pressure. Through systematic density functional theory calculations and electron-phonon coupling analysis, we demonstrate that electronic structure engineering via hole doping dramatically enhances the superconducting properties of these mate… ▽ More This study presents a computational investigation of X4H15 compounds (where X represents a metal) as potential superconductors at ambient conditions or under pressure. Through systematic density functional theory calculations and electron-phonon coupling analysis, we demonstrate that electronic structure engineering via hole doping dramatically enhances the superconducting properties of these materials. While electron-doped compounds with X4+ cations (Ti, Zr, Hf, Th) exhibit modest transition temperatures of 1-9 K, hole-doped systems with X3+cations (Y, Tb, Dy, Ho,Er, Tm, Lu) show remarkably higher values of approximately 50 K at ambient pressure. Superconductivity in hole-doped compounds originates from stronger coupling between electrons and both cation and hydrogen phonon modes. Although pristine X3+4H15compounds are thermodynamically unstable, we propose a viable synthesis route via controlled hole doping of the charge-compensated YZr3H15 compound. Our calculations predict that even minimal concentrations of excess Y could induce high-temperature superconductivity while preserving structural integrity. This work reveals how strategic electronic structure modulation can optimize superconducting properties in hydride systems, establishing a promising pathway toward practical high-temperature conventional super-conductors at ambient pressure △ Less

Submitted 29 April, 2025; originally announced April 2025.

arXiv:2504.20094 [pdf, ps, other]

MATCHA: Can Multi-Agent Collaboration Build a Trustworthy Conversational Recommender?

Authors: Zheng Hui, Xiaokai Wei, Yexi Jiang, Kevin Gao, Chen Wang, Frank Ong, Se-eun Yoon, Rachit Pareek, Michelle Gong

Abstract: In this paper, we propose a multi-agent collaboration framework called MATCHA for conversational recommendation system, leveraging large language models (LLMs) to enhance personalization and user engagement. Users can request recommendations via free-form text and receive curated lists aligned with their interests, preferences, and constraints. Our system introduces specialized agents for intent a… ▽ More In this paper, we propose a multi-agent collaboration framework called MATCHA for conversational recommendation system, leveraging large language models (LLMs) to enhance personalization and user engagement. Users can request recommendations via free-form text and receive curated lists aligned with their interests, preferences, and constraints. Our system introduces specialized agents for intent analysis, candidate generation, ranking, re-ranking, explainability, and safeguards. These agents collaboratively improve recommendations accuracy, diversity, and safety. On eight metrics, our model achieves superior or comparable performance to the current state-of-the-art. Through comparisons with six baseline models, our approach addresses key challenges in conversational recommendation systems for game recommendations, including: (1) handling complex, user-specific requests, (2) enhancing personalization through multi-agent collaboration, (3) empirical evaluation and deployment, and (4) ensuring safe and trustworthy interactions. △ Less

Submitted 25 April, 2025; originally announced April 2025.

arXiv:2504.19182 [pdf]

Coulomb Crystallization of Highly Charged Ni^12+ Ions in a Linear Paul Trap

Authors: Shaolong Chen, Zhiqiang Zhou, Guosheng Zhang, Jun Xiao, Yao Huang, Kelin Gao, Hua Guan

Abstract: Optical clocks have garnered widespread attention due to their unparalleled precision in time-frequency standards, geodetic measurements, and fundamental physics research. Among emerging developments, highly charged ion (HCI)-based optical clocks have attracted significant scientific interest owing to their exceptional resilience against electromagnetic perturbations and enhanced sensitivity to va… ▽ More Optical clocks have garnered widespread attention due to their unparalleled precision in time-frequency standards, geodetic measurements, and fundamental physics research. Among emerging developments, highly charged ion (HCI)-based optical clocks have attracted significant scientific interest owing to their exceptional resilience against electromagnetic perturbations and enhanced sensitivity to variations in the fine-structure constant ($α$). While the recent successful demonstration of an Ar$^{13+}$ optical clock has validated the feasibility of HCI-based systems, Ni$^{12+}$ -- featuring an ultranarrow clock transition linewidth -- stands out as a superior candidate for achieving HCI optical clocks with $10^{-19}$ level uncertainty and stability. In this work, we report the Coulomb crystallization of nickel highly charged ions (Ni-HCIs). Through a precision deceleration and sympathetic cooling protocol in a room-temperature Paul trap, high-energy Ni-HCI bunches were sympathetically cooled from megakelvin to the 100-millikelvin range using laser-cooled Be$^{+}$ ions. This work represents a pivotal step toward the realization of an optical clock based on the Ni$^{12+}$ ion. △ Less

Submitted 27 April, 2025; originally announced April 2025.

Comments: 18 pages,8 figures

arXiv:2504.12100 [pdf, other]

Generalized Visual Relation Detection with Diffusion Models

Authors: Kaifeng Gao, Siqi Chen, Hanwang Zhang, Jun Xiao, Yueting Zhuang, Qianru Sun

Abstract: Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image. Although recent VRD models have achieved impressive performance, they are all restricted to pre-defined relation categories, while failing to consider the semantic ambiguity characteristic of visual relations. Unlike objects, the appearance of visual relations is always subtle and can… ▽ More Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image. Although recent VRD models have achieved impressive performance, they are all restricted to pre-defined relation categories, while failing to consider the semantic ambiguity characteristic of visual relations. Unlike objects, the appearance of visual relations is always subtle and can be described by multiple predicate words from different perspectives, e.g., ``ride'' can be depicted as ``race'' and ``sit on'', from the sports and spatial position views, respectively. To this end, we propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner, termed Diff-VRD. We model the diffusion process in a latent space and generate all possible relations in the image as an embedding sequence. During the generation, the visual and text embeddings of subject-object pairs serve as conditional signals and are injected via cross-attention. After the generation, we design a subsequent matching stage to assign the relation words to subject-object pairs by considering their semantic similarities. Benefiting from the diffusion-based generative process, our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets. To properly evaluate this generalized VRD task, we introduce two evaluation metrics, i.e., text-to-image retrieval and SPICE PR Curve inspired by image captioning. Extensive experiments in both human-object interaction (HOI) detection and scene graph generation (SGG) benchmarks attest to the superiority and effectiveness of Diff-VRD. △ Less

Submitted 16 April, 2025; originally announced April 2025.

Comments: Under review at IEEE TCSVT. The Appendix is provided additionally

arXiv:2504.10812 [pdf, other]

E2E Parking Dataset: An Open Benchmark for End-to-End Autonomous Parking

Authors: Kejia Gao, Liguo Zhou, Mingjun Liu, Alois Knoll

Abstract: End-to-end learning has shown great potential in autonomous parking, yet the lack of publicly available datasets limits reproducibility and benchmarking. While prior work introduced a visual-based parking model and a pipeline for data generation, training, and close-loop test, the dataset itself was not released. To bridge this gap, we create and open-source a high-quality dataset for end-to-end a… ▽ More End-to-end learning has shown great potential in autonomous parking, yet the lack of publicly available datasets limits reproducibility and benchmarking. While prior work introduced a visual-based parking model and a pipeline for data generation, training, and close-loop test, the dataset itself was not released. To bridge this gap, we create and open-source a high-quality dataset for end-to-end autonomous parking. Using the original model, we achieve an overall success rate of 85.16% with lower average position and orientation errors (0.24 meters and 0.34 degrees). △ Less

Submitted 14 April, 2025; originally announced April 2025.

arXiv:2503.22747 [pdf, other]

LeForecast: Enterprise Hybrid Forecast by Time Series Intelligence

Authors: Zheng Tan, Yiwen Nie, Wenfa Wu, Guanyu Zhang, Yanze Liu, Xinyuan Tian, Kailin Gao, Mengya Liu, Qijiang Cheng, Haipeng Jiang, Yingzheng Ma, Wei Zheng, Yuci Zhu, Yuanyuan Sun, Xiangyu Lei, Xiyu Guan, Wanqing Huang, Shouming Liu, Xiangquan Meng, Pengzhan Qu, Chao Yang, Jiaxuan Fan, Yuan He, Hongsheng Qi, Yangzhou Du

Abstract: Demand is spiking in industrial fields for multidisciplinary forecasting, where a broad spectrum of sectors needs planning and forecasts to streamline intelligent business management, such as demand forecasting, product planning, inventory optimization, etc. Specifically, these tasks expecting intelligent approaches to learn from sequentially collected historical data and then foresee most possibl… ▽ More Demand is spiking in industrial fields for multidisciplinary forecasting, where a broad spectrum of sectors needs planning and forecasts to streamline intelligent business management, such as demand forecasting, product planning, inventory optimization, etc. Specifically, these tasks expecting intelligent approaches to learn from sequentially collected historical data and then foresee most possible trend, i.e. time series forecasting. Challenge of it lies in interpreting complex business contexts and the efficiency and generalisation of modelling. With aspirations of pre-trained foundational models for such purpose, given their remarkable success of large foundation model across legions of tasks, we disseminate \leforecast{}, an enterprise intelligence platform tailored for time series tasks. It integrates advanced interpretations of time series data and multi-source information, and a three-pillar modelling engine combining a large foundation model (Le-TSFM), multimodal model and hybrid model to derive insights, predict or infer futures, and then drive optimisation across multiple sectors in enterprise operations. The framework is composed by a model pool, model profiling module, and two different fusion approaches regarding original model architectures. Experimental results verify the efficiency of our trail fusion concepts: router-based fusion network and coordination of large and small models, resulting in high costs for redundant development and maintenance of models. This work reviews deployment of LeForecast and its performance in three industrial use cases. Our comprehensive experiments indicate that LeForecast is a profound and practical platform for efficient and competitive performance. And we do hope that this work can enlighten the research and grounding of time series techniques in accelerating enterprise. △ Less

Submitted 26 March, 2025; originally announced March 2025.

arXiv:2503.21824 [pdf, other]

Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations

Authors: Haitong Liu, Kuofeng Gao, Yang Bai, Jinmin Li, Jinxiao Shan, Tao Dai, Shu-Tao Xia

Abstract: Recently, video-based large language models (video-based LLMs) have achieved impressive performance across various video comprehension tasks. However, this rapid advancement raises significant privacy and security concerns, particularly regarding the unauthorized use of personal video data in automated annotation by video-based LLMs. These unauthorized annotated video-text pairs can then be used t… ▽ More Recently, video-based large language models (video-based LLMs) have achieved impressive performance across various video comprehension tasks. However, this rapid advancement raises significant privacy and security concerns, particularly regarding the unauthorized use of personal video data in automated annotation by video-based LLMs. These unauthorized annotated video-text pairs can then be used to improve the performance of downstream tasks, such as text-to-video generation. To safeguard personal videos from unauthorized use, we propose two series of protective video watermarks with imperceptible adversarial perturbations, named Ramblings and Mutes. Concretely, Ramblings aim to mislead video-based LLMs into generating inaccurate captions for the videos, thereby degrading the quality of video annotations through inconsistencies between video content and captions. Mutes, on the other hand, are designed to prompt video-based LLMs to produce exceptionally brief captions, lacking descriptive detail. Extensive experiments demonstrate that our video watermarking methods effectively protect video data by significantly reducing video annotation performance across various video-based LLMs, showcasing both stealthiness and robustness in protecting personal video content. Our code is available at https://github.com/ttthhl/Protecting_Your_Video_Content. △ Less

Submitted 26 March, 2025; originally announced March 2025.

Comments: Accepted by CVPR 2025

arXiv:2503.11580 [pdf, other]

Complementary Collective Spin Descriptions of Superradiant Ramsey Spectroscopy

Authors: Ke-Xin Gao, Yuan Zhang, Shi-Lei Su, Gang Chen, Chongxin Shan, Klaus Mølmer

Abstract: A recent experiment demonstrated delayed superradiance from strontium-88 atoms, which are coupled to a longitudinal mode of a cavity while being excited by laser pulses propagating along a transversal direction [Nat. Commun. 15, 1084 (2024)]. A coherent picture of the atomic ensemble dynamics in this experiment requires complementary representations of the external driving dynamics and the superra… ▽ More A recent experiment demonstrated delayed superradiance from strontium-88 atoms, which are coupled to a longitudinal mode of a cavity while being excited by laser pulses propagating along a transversal direction [Nat. Commun. 15, 1084 (2024)]. A coherent picture of the atomic ensemble dynamics in this experiment requires complementary representations of the external driving dynamics and the superradiant dynamics. To complement previous analyses, we introduce these representations by considering in-phase and out-of-phase superpositions of transverse collective spin components of two atomic sub-ensembles, and analyze the dynamics with the corresponding collective Dicke states and Bloch vectors. This approach also explains Ramsey spectroscopy experiments with the ensemble, and it may be employed to explore other phenomena, such as weak-to-strong coupling phase transitions and triggered superradiance. △ Less

Submitted 15 May, 2025; v1 submitted 14 March, 2025; originally announced March 2025.

Comments: 12 pages, 10 figures

arXiv:2503.11240 [pdf, other]

Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

Authors: Zijing Hu, Fengda Zhang, Long Chen, Kun Kuang, Jiahui Li, Kaifeng Gao, Jun Xiao, Xin Wang, Wenwu Zhu

Abstract: Diffusion models have achieved remarkable success in text-to-image generation. However, their practical applications are hindered by the misalignment between generated images and corresponding text prompts. To tackle this issue, reinforcement learning (RL) has been considered for diffusion model fine-tuning. Yet, RL's effectiveness is limited by the challenge of sparse reward, where feedback is on… ▽ More Diffusion models have achieved remarkable success in text-to-image generation. However, their practical applications are hindered by the misalignment between generated images and corresponding text prompts. To tackle this issue, reinforcement learning (RL) has been considered for diffusion model fine-tuning. Yet, RL's effectiveness is limited by the challenge of sparse reward, where feedback is only available at the end of the generation process. This makes it difficult to identify which actions during the denoising process contribute positively to the final generated image, potentially leading to ineffective or unnecessary denoising policies. To this end, this paper presents a novel RL-based framework that addresses the sparse reward problem when training diffusion models. Our framework, named $\text{B}^2\text{-DiffuRL}$, employs two strategies: \textbf{B}ackward progressive training and \textbf{B}ranch-based sampling. For one thing, backward progressive training focuses initially on the final timesteps of denoising process and gradually extends the training interval to earlier timesteps, easing the learning difficulty from sparse rewards. For another, we perform branch-based sampling for each training interval. By comparing the samples within the same branch, we can identify how much the policies of the current training interval contribute to the final image, which helps to learn effective policies instead of unnecessary ones. $\text{B}^2\text{-DiffuRL}$ is compatible with existing optimization algorithms. Extensive experiments demonstrate the effectiveness of $\text{B}^2\text{-DiffuRL}$ in improving prompt-image alignment and maintaining diversity in generated images. The code for this work is available. △ Less

Submitted 26 March, 2025; v1 submitted 14 March, 2025; originally announced March 2025.

Comments: Accepted to CVPR 2025, add references

arXiv:2503.07203 [pdf]

POINT: a web-based platform for pharmacological investigation enhanced by multi-omics networks and knowledge graphs

Authors: Zihao He, Liu Liu, Dongchen Han, Kai Gao, Lei Dong, Dechao Bu, Peipei Huo, Zhihao Wang, Wenxin Deng, Jingjia Liu, Jin-cheng Guo, Yi Zhao, Yang Wu

Abstract: Network pharmacology (NP) explores pharmacological mechanisms through biological networks. Multi-omics data enable multi-layer network construction under diverse conditions, requiring integration into NP analyses. We developed POINT, a novel NP platform enhanced by multi-omics biological networks, advanced algorithms, and knowledge graphs (KGs) featuring network-based and KG-based analytical funct… ▽ More Network pharmacology (NP) explores pharmacological mechanisms through biological networks. Multi-omics data enable multi-layer network construction under diverse conditions, requiring integration into NP analyses. We developed POINT, a novel NP platform enhanced by multi-omics biological networks, advanced algorithms, and knowledge graphs (KGs) featuring network-based and KG-based analytical functions. In the network-based analysis, users can perform NP studies flexibly using 1,158 multi-omics biological networks encompassing proteins, transcription factors, and non-coding RNAs across diverse cell line-, tissue- and disease-specific conditions. Network-based analysis-including random walk with restart (RWR), GSEA, and diffusion profile (DP) similarity algorithms-supports tasks such as target prediction, functional enrichment, and drug screening. We merged networks from experimental sources to generate a pre-integrated multi-layer human network for evaluation. RWR demonstrated superior performance with a 33.1% average ranking improvement over the second-best algorithm, PageRank, in identifying known targets across 2,002 drugs. Additionally, multi-layer networks significantly improve the ability to identify FDA-approved drug-disease pairs compared to the single-layer network. For KG-based analysis, we compiled three high-quality KGs to construct POINT KG, which cross-references over 90% of network-based predictions. We illustrated the platform's capabilities through two case studies. POINT bridges the gap between multi-omics networks and drug discovery; it is freely accessible at http://point.gene.ac/. △ Less

Submitted 10 March, 2025; originally announced March 2025.

Comments: 45 pages. 7 figures

arXiv:2503.06687 [pdf, other]

UniGenX: Unified Generation of Sequence and Structure with Autoregressive Diffusion

Authors: Gongbo Zhang, Yanting Li, Renqian Luo, Pipi Hu, Zeru Zhao, Lingbo Li, Guoqing Liu, Zun Wang, Ran Bi, Kaiyuan Gao, Liya Guo, Yu Xie, Chang Liu, Jia Zhang, Tian Xie, Robert Pinsler, Claudio Zeni, Ziheng Lu, Yingce Xia, Marwin Segler, Maik Riechert, Li Yuan, Lei Chen, Haiguang Liu, Tao Qin

Abstract: Unified generation of sequence and structure for scientific data (e.g., materials, molecules, proteins) is a critical task. Existing approaches primarily rely on either autoregressive sequence models or diffusion models, each offering distinct advantages and facing notable limitations. Autoregressive models, such as GPT, Llama, and Phi-4, have demonstrated remarkable success in natural language ge… ▽ More Unified generation of sequence and structure for scientific data (e.g., materials, molecules, proteins) is a critical task. Existing approaches primarily rely on either autoregressive sequence models or diffusion models, each offering distinct advantages and facing notable limitations. Autoregressive models, such as GPT, Llama, and Phi-4, have demonstrated remarkable success in natural language generation and have been extended to multimodal tasks (e.g., image, video, and audio) using advanced encoders like VQ-VAE to represent complex modalities as discrete sequences. However, their direct application to scientific domains is challenging due to the high precision requirements and the diverse nature of scientific data. On the other hand, diffusion models excel at generating high-dimensional scientific data, such as protein, molecule, and material structures, with remarkable accuracy. Yet, their inability to effectively model sequences limits their potential as general-purpose multimodal foundation models. To address these challenges, we propose UniGenX, a unified framework that combines autoregressive next-token prediction with conditional diffusion models. This integration leverages the strengths of autoregressive models to ease the training of conditional diffusion models, while diffusion-based generative heads enhance the precision of autoregressive predictions. We validate the effectiveness of UniGenX on material and small molecule generation tasks, achieving a significant leap in state-of-the-art performance for material crystal structure prediction and establishing new state-of-the-art results for small molecule structure prediction, de novo design, and conditional generation. Notably, UniGenX demonstrates significant improvements, especially in handling long sequences for complex structures, showcasing its efficacy as a versatile tool for scientific data generation. △ Less

Submitted 9 March, 2025; originally announced March 2025.

arXiv:2503.06353 [pdf]

The AI Pentad, the CHARME$^{2}$D Model, and an Assessment of Current-State AI Regulation

Authors: Di Kevin Gao, Sudip Mittal, Jiming Wu, Hongwei Du, Jingdao Chen, Shahram Rahimi

Abstract: Artificial Intelligence (AI) has made remarkable progress in the past few years with AI-enabled applications beginning to permeate every aspect of our society. Despite the widespread consensus on the need to regulate AI, there remains a lack of a unified approach to framing, developing, and assessing AI regulations. Many of the existing methods take a value-based approach, for example, accountabil… ▽ More Artificial Intelligence (AI) has made remarkable progress in the past few years with AI-enabled applications beginning to permeate every aspect of our society. Despite the widespread consensus on the need to regulate AI, there remains a lack of a unified approach to framing, developing, and assessing AI regulations. Many of the existing methods take a value-based approach, for example, accountability, fairness, free from bias, transparency, and trust. However, these methods often face challenges at the outset due to disagreements in academia over the subjective nature of these definitions. This paper aims to establish a unifying model for AI regulation from the perspective of core AI components. We first introduce the AI Pentad, which comprises the five essential components of AI: humans and organizations, algorithms, data, computing, and energy. We then review AI regulatory enablers, including AI registration and disclosure, AI monitoring, and AI enforcement mechanisms. Subsequently, we present the CHARME$^{2}$D Model to explore further the relationship between the AI Pentad and AI regulatory enablers. Finally, we apply the CHARME$^{2}$D model to assess AI regulatory efforts in the European Union (EU), China, the United Arab Emirates (UAE), the United Kingdom (UK), and the United States (US), highlighting their strengths, weaknesses, and gaps. This comparative evaluation offers insights for future legislative work in the AI domain. △ Less

Submitted 8 March, 2025; originally announced March 2025.

arXiv:2503.01873 [pdf, other]

Online Pseudo-average Shifting Attention(PASA) for Robust Low-precision LLM Inference: Algorithms and Numerical Analysis

Authors: Long Cheng, Qichen Liao, Fan Wu, Junlin Mu, Tengfei Han, Zhe Qiu, Lianqiang Li, Tianyi Liu, Fangzheng Miao, Keming Gao, Liang Wang, Zhen Zhang, Qiande Yin

Abstract: Attention calculation is extremely time-consuming for long-sequence inference tasks, such as text or image/video generation, in large models. To accelerate this process, we developed a low-precision, mathematically-equivalent algorithm called PASA, based on Flash Attention. PASA introduces two novel techniques: online pseudo-average shifting and global recovering. These techniques enable the use o… ▽ More Attention calculation is extremely time-consuming for long-sequence inference tasks, such as text or image/video generation, in large models. To accelerate this process, we developed a low-precision, mathematically-equivalent algorithm called PASA, based on Flash Attention. PASA introduces two novel techniques: online pseudo-average shifting and global recovering. These techniques enable the use of half-precision computation throughout the Flash Attention process without incurring overflow instability or unacceptable numerical accuracy loss. This algorithm enhances performance on memory-restricted AI hardware architectures, such as the Ascend Neural-network Processing Unit(NPU), by reducing data movement and increasing computational FLOPs. The algorithm is validated using both designed random benchmarks and real large models. We find that the large bias and amplitude of attention input data are critical factors contributing to numerical overflow ($>65504$ for half precision) in two different categories of large models (Qwen2-7B language models and Stable-Video-Diffusion multi-modal models). Specifically, overflow arises due to the large bias in the sequence dimension and the resonance mechanism between the query and key in the head dimension of the Stable-Video-Diffusion models. The resonance mechanism is defined as phase coincidence or 180-degree phase shift between query and key matrices. It will remarkably amplify the element values of attention score matrix. This issue also applies to the Qwen models. Additionally, numerical accuracy is assessed through root mean square error (RMSE) and by comparing the final generated texts and videos to those produced using high-precision attention. △ Less

Submitted 25 February, 2025; originally announced March 2025.

Comments: 21 Pages, 14 figures, conference paper

arXiv:2503.01853 [pdf]

Direct detonation initiation and propagation in methane/air mixtures containing coal particles

Authors: Shengnan Li, Shangpeng Li, Shumeng Xie, Yong Xu, Ke Gao, Huangwei Zhang

Abstract: The mechanisms of direct detonation initiation (DDI) in methane/air mixtures containing coal particles are investigated through simulations conducted using the Eulerian-Lagrangian method in a two-dimensional configuration. Methane-air combustion is modelled with a detailed chemical mechanism involving 36 species and 219 reactions, while coal particle surface reactions are computed using a kinetic/… ▽ More The mechanisms of direct detonation initiation (DDI) in methane/air mixtures containing coal particles are investigated through simulations conducted using the Eulerian-Lagrangian method in a two-dimensional configuration. Methane-air combustion is modelled with a detailed chemical mechanism involving 36 species and 219 reactions, while coal particle surface reactions are computed using a kinetic/diffusion-limited rate model. The findings indicate that shock waves generated from the hotspot can initiate detonation through heterogeneous and homogeneous reactions, with contributions from both methane and particle combustion. Coal particle surface reactions provide the dominant energy for detonation initiation, whereas gas-phase reactions enhance detonation stability during propagation. The difficulty of achieving detonation initiation exhibits a non-linear dependence on particle concentrations and gas equivalence ratios. An optimal particle concentration and gas equivalence ratio for successful DDI is identified. Smaller particles are found to facilitate detonation initiation more effectively. Key processes in DDI of two-phase mixtures are identified, including particle heating, methane combustion, and particle burning. Three DDI modes, critical, stable, and cell-free, are observed based on particle concentration. As particle concentration increases, the temperatures of both particles and gas become close, initially rising and then decreasing with further increases in particle concentration. Additionally, the introduction of coal particles gives rise to two distinct stages in gas-phase reactions. △ Less

Submitted 21 February, 2025; originally announced March 2025.

arXiv:2503.00566 [pdf, ps, other]

Instructor-Worker Large Language Model System for Policy Recommendation: a Case Study on Air Quality Analysis of the January 2025 Los Angeles Wildfires

Authors: Kyle Gao, Dening Lu, Liangzhi Li, Nan Chen, Hongjie He, Linlin Xu, Jonathan Li

Abstract: The Los Angeles wildfires of January 2025 caused more than 250 billion dollars in damage and lasted for nearly an entire month before containment. Following our previous work, the Digital Twin Building, we modify and leverage the multi-agent large language model framework as well as the cloud-mapping integration to study the air quality during the Los Angeles wildfires. Recent advances in large la… ▽ More The Los Angeles wildfires of January 2025 caused more than 250 billion dollars in damage and lasted for nearly an entire month before containment. Following our previous work, the Digital Twin Building, we modify and leverage the multi-agent large language model framework as well as the cloud-mapping integration to study the air quality during the Los Angeles wildfires. Recent advances in large language models have allowed for out-of-the-box automated large-scale data analysis. We use a multi-agent large language system comprised of an Instructor agent and Worker agents. Upon receiving the users' instructions, the Instructor agent retrieves the data from the cloud platform and produces instruction prompts to the Worker agents. The Worker agents then analyze the data and provide summaries. The summaries are finally input back into the Instructor agent, which then provides the final data analysis. We test this system's capability for data-based policy recommendation by assessing our Instructor-Worker LLM system's health recommendations based on air quality during the Los Angeles wildfires. △ Less

Submitted 5 June, 2025; v1 submitted 1 March, 2025; originally announced March 2025.

arXiv:2502.18281 [pdf, other]

The Maximum $T_c$ of Conventional Superconductors at Ambient Pressure

Authors: Kun Gao, Tiago F. T. Cerqueira, Antonio Sanna, Yue-Wen Fang, Đorđe Dangić, Ion Errea, Hai-Chen Wang, Silvana Botti, Miguel A. L. Marques

Abstract: The theoretical maximum critical temperature ($T_c$) for conventional superconductors at ambient pressure remains a fundamental question in condensed matter physics. Through analysis of electron-phonon calculations for over 20,000 metals, we critically examine this question. We find that while hydride metals can exhibit maximum phonon frequencies of more than 5000 K, the crucial logarithmic averag… ▽ More The theoretical maximum critical temperature ($T_c$) for conventional superconductors at ambient pressure remains a fundamental question in condensed matter physics. Through analysis of electron-phonon calculations for over 20,000 metals, we critically examine this question. We find that while hydride metals can exhibit maximum phonon frequencies of more than 5000 K, the crucial logarithmic average frequency $ω_\text{log}$ rarely exceeds 1800 K. Our data reveals an inherent trade-off between $ω_\text{log}$ and the electron-phonon coupling constant $λ$, suggesting that the optimal Eliashberg function that maximizes $T_c$ is unphysical. Based on our calculations, we identify Li$_2$AgH$_6$ and its sibling Li$_2$AuH$_6$ as theoretical materials that likely approach the practical limit for conventional superconductivity at ambient pressure. Analysis of thermodynamic stability indicates that compounds with higher predicted $T_c$ values are increasingly unstable, making their synthesis challenging. While fundamental physical laws do not strictly limit $T_c$ to low-temperatures, our analysis suggests that achieving room-temperature conventional superconductivity at ambient pressure is extremely unlikely. △ Less

Submitted 25 February, 2025; originally announced February 2025.

arXiv:2502.15447 [pdf, other]

doi 10.1016/j.xinn.2025.100802

Ultra-high-energy $γ$-ray emission associated with the tail of a bow-shock pulsar wind nebula

Authors: Zhen Cao, F. Aharonian, Y. X. Bai, Y. W. Bao, D. Bastieri, X. J. Bi, Y. J. Bi, W. Bian, A. V. Bukevich, C. M. Cai, W. Y. Cao, Zhe Cao, J. Chang, J. F. Chang, A. M. Chen, E. S. Chen, H. X. Chen, Liang Chen, Long Chen, M. J. Chen, M. L. Chen, Q. H. Chen, S. Chen, S. H. Chen, S. Z. Chen , et al. (274 additional authors not shown)

Abstract: In this study, we present a comprehensive analysis of an unidentified point-like ultra-high-energy (UHE) $γ$-ray source, designated as 1LHAASO J1740+0948u, situated in the vicinity of the middle-aged pulsar PSR J1740+1000. The detection significance reached 17.1$σ$ (9.4$σ$) above 25$\,$TeV (100$\,$TeV). The source energy spectrum extended up to 300$\,$TeV, which was well fitted by a log-parabola f… ▽ More In this study, we present a comprehensive analysis of an unidentified point-like ultra-high-energy (UHE) $γ$-ray source, designated as 1LHAASO J1740+0948u, situated in the vicinity of the middle-aged pulsar PSR J1740+1000. The detection significance reached 17.1$σ$ (9.4$σ$) above 25$\,$TeV (100$\,$TeV). The source energy spectrum extended up to 300$\,$TeV, which was well fitted by a log-parabola function with $N0 = (1.93\pm0.23) \times 10^{-16} \rm{TeV^{-1}\,cm^{-2}\,s^{-2}}$, $α= 2.14\pm0.27$, and $β= 1.20\pm0.41$ at E0 = 30$\,$TeV. The associated pulsar, PSR J1740+1000, resides at a high galactic latitude and powers a bow-shock pulsar wind nebula (BSPWN) with an extended X-ray tail. The best-fit position of the gamma-ray source appeared to be shifted by $0.2^{\circ}$ with respect to the pulsar position. As the (i) currently identified pulsar halos do not demonstrate such offsets, and (ii) centroid of the gamma-ray emission is approximately located at the extension of the X-ray tail, we speculate that the UHE $γ$-ray emission may originate from re-accelerated electron/positron pairs that are advected away in the bow-shock tail. △ Less

Submitted 24 February, 2025; v1 submitted 21 February, 2025; originally announced February 2025.

Comments: Corrected spelling errors in several author names

Journal ref: The Innovation (2025), 100802

arXiv:2502.14934 [pdf, other]

Fast and Accurate Blind Flexible Docking

Authors: Zizhuo Zhang, Lijun Wu, Kaiyuan Gao, Jiangchao Yao, Tao Qin, Bo Han

Abstract: Molecular docking that predicts the bound structures of small molecules (ligands) to their protein targets, plays a vital role in drug discovery. However, existing docking methods often face limitations: they either overlook crucial structural changes by assuming protein rigidity or suffer from low computational efficiency due to their reliance on generative models for structure sampling. To addre… ▽ More Molecular docking that predicts the bound structures of small molecules (ligands) to their protein targets, plays a vital role in drug discovery. However, existing docking methods often face limitations: they either overlook crucial structural changes by assuming protein rigidity or suffer from low computational efficiency due to their reliance on generative models for structure sampling. To address these challenges, we propose FABFlex, a fast and accurate regression-based multi-task learning model designed for realistic blind flexible docking scenarios, where proteins exhibit flexibility and binding pocket sites are unknown (blind). Specifically, FABFlex's architecture comprises three specialized modules working in concert: (1) A pocket prediction module that identifies potential binding sites, addressing the challenges inherent in blind docking scenarios. (2) A ligand docking module that predicts the bound (holo) structures of ligands from their unbound (apo) states. (3) A pocket docking module that forecasts the holo structures of protein pockets from their apo conformations. Notably, FABFlex incorporates an iterative update mechanism that serves as a conduit between the ligand and pocket docking modules, enabling continuous structural refinements. This approach effectively integrates the three subtasks of blind flexible docking-pocket identification, ligand conformation prediction, and protein flexibility modeling-into a unified, coherent framework. Extensive experiments on public benchmark datasets demonstrate that FABFlex not only achieves superior effectiveness in predicting accurate binding modes but also exhibits a significant speed advantage (208 $\times$) compared to existing state-of-the-art methods. Our code is released at https://github.com/tmlr-group/FABFlex. △ Less

Submitted 20 February, 2025; originally announced February 2025.

Comments: 25 pages, Accepted by ICLR 2025

arXiv:2502.11184 [pdf, ps, other]

Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs

Authors: Wenxuan Wang, Xiaoyuan Liu, Kuiyi Gao, Jen-tse Huang, Youliang Yuan, Pinjia He, Shuai Wang, Zhaopeng Tu

Abstract: Multimodal Large Language Models (MLLMs) have expanded the capabilities of traditional language models by enabling interaction through both text and images. However, ensuring the safety of these models remains a significant challenge, particularly in accurately identifying whether multimodal content is safe or unsafe-a capability we term safety awareness. In this paper, we introduce MMSafeAware, t… ▽ More Multimodal Large Language Models (MLLMs) have expanded the capabilities of traditional language models by enabling interaction through both text and images. However, ensuring the safety of these models remains a significant challenge, particularly in accurately identifying whether multimodal content is safe or unsafe-a capability we term safety awareness. In this paper, we introduce MMSafeAware, the first comprehensive multimodal safety awareness benchmark designed to evaluate MLLMs across 29 safety scenarios with 1500 carefully curated image-prompt pairs. MMSafeAware includes both unsafe and over-safety subsets to assess models abilities to correctly identify unsafe content and avoid over-sensitivity that can hinder helpfulness. Evaluating nine widely used MLLMs using MMSafeAware reveals that current models are not sufficiently safe and often overly sensitive; for example, GPT-4V misclassifies 36.1% of unsafe inputs as safe and 59.9% of benign inputs as unsafe. We further explore three methods to improve safety awareness-prompting-based approaches, visual contrastive decoding, and vision-centric reasoning fine-tuning-but find that none achieve satisfactory performance. Our findings highlight the profound challenges in developing MLLMs with robust safety awareness, underscoring the need for further research in this area. All the code and data will be publicly available to facilitate future research. △ Less

Submitted 3 June, 2025; v1 submitted 16 February, 2025; originally announced February 2025.

Comments: Accepted by ACL 2025

arXiv:2502.09723 [pdf, other]

QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query Language

Authors: Qingsong Zou, Jingyu Xiao, Qing Li, Zhi Yan, Yuhang Wang, Li Xu, Wenxuan Wang, Kuofeng Gao, Ruoyu Li, Yong Jiang

Abstract: Recent advances in large language models (LLMs) have demonstrated remarkable potential in the field of natural language processing. Unfortunately, LLMs face significant security and ethical risks. Although techniques such as safety alignment are developed for defense, prior researches reveal the possibility of bypassing such defenses through well-designed jailbreak attacks. In this paper, we propo… ▽ More Recent advances in large language models (LLMs) have demonstrated remarkable potential in the field of natural language processing. Unfortunately, LLMs face significant security and ethical risks. Although techniques such as safety alignment are developed for defense, prior researches reveal the possibility of bypassing such defenses through well-designed jailbreak attacks. In this paper, we propose QueryAttack, a novel framework to examine the generalizability of safety alignment. By treating LLMs as knowledge databases, we translate malicious queries in natural language into structured non-natural query language to bypass the safety alignment mechanisms of LLMs. We conduct extensive experiments on mainstream LLMs, and the results show that QueryAttack not only can achieve high attack success rates (ASRs), but also can jailbreak various defense methods. Furthermore, we tailor a defense method against QueryAttack, which can reduce ASR by up to $64\%$ on GPT-4-1106. Our code is available at https://github.com/horizonsinzqs/QueryAttack. △ Less

Submitted 26 May, 2025; v1 submitted 13 February, 2025; originally announced February 2025.

Comments: To appear in ACL 2025

arXiv:2502.07527 [pdf, ps, other]

Nature Language Model: Deciphering the Language of Nature for Scientific Discovery

Authors: Yingce Xia, Peiran Jin, Shufang Xie, Liang He, Chuan Cao, Renqian Luo, Guoqing Liu, Yue Wang, Zequn Liu, Yuan-Jyue Chen, Zekun Guo, Yeqi Bai, Pan Deng, Yaosen Min, Ziheng Lu, Hongxia Hao, Han Yang, Jielan Li, Chang Liu, Jia Zhang, Jianwei Zhu, Ran Bi, Kehan Wu, Wei Zhang, Kaiyuan Gao , et al. (21 additional authors not shown)

Abstract: Foundation models have revolutionized natural language processing and artificial intelligence, significantly enhancing how machines comprehend and generate human languages. Inspired by the success of these foundation models, researchers have developed foundation models for individual scientific domains, including small molecules, materials, proteins, DNA, RNA and even cells. However, these models… ▽ More Foundation models have revolutionized natural language processing and artificial intelligence, significantly enhancing how machines comprehend and generate human languages. Inspired by the success of these foundation models, researchers have developed foundation models for individual scientific domains, including small molecules, materials, proteins, DNA, RNA and even cells. However, these models are typically trained in isolation, lacking the ability to integrate across different scientific domains. Recognizing that entities within these domains can all be represented as sequences, which together form the "language of nature", we introduce Nature Language Model (NatureLM), a sequence-based science foundation model designed for scientific discovery. Pre-trained with data from multiple scientific domains, NatureLM offers a unified, versatile model that enables various applications including: (i) generating and optimizing small molecules, proteins, RNA, and materials using text instructions; (ii) cross-domain generation/design, such as protein-to-molecule and protein-to-RNA generation; and (iii) top performance across different domains, matching or surpassing state-of-the-art specialist models. NatureLM offers a promising generalist approach for various scientific tasks, including drug discovery (hit generation/optimization, ADMET optimization, synthesis), novel material design, and the development of therapeutic proteins or nucleotides. We have developed NatureLM models in different sizes (1 billion, 8 billion, and 46.7 billion parameters) and observed a clear improvement in performance as the model size increases. △ Less

Submitted 20 June, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

Comments: 95 pages

arXiv:2502.06802 [pdf, other]

Solving the Content Gap in Roblox Game Recommendations: LLM-Based Profile Generation and Reranking

Authors: Chen Wang, Xiaokai Wei, Yexi Jiang, Frank Ong, Kevin Gao, Xiao Yu, Zheng Hui, Se-eun Yoon, Philip Yu, Michelle Gong

Abstract: With the vast and dynamic user-generated content on Roblox, creating effective game recommendations requires a deep understanding of game content. Traditional recommendation models struggle with the inconsistent and sparse nature of game text features such as titles and descriptions. Recent advancements in large language models (LLMs) offer opportunities to enhance recommendation systems by analyz… ▽ More With the vast and dynamic user-generated content on Roblox, creating effective game recommendations requires a deep understanding of game content. Traditional recommendation models struggle with the inconsistent and sparse nature of game text features such as titles and descriptions. Recent advancements in large language models (LLMs) offer opportunities to enhance recommendation systems by analyzing in-game text data. This paper addresses two challenges: generating high-quality, structured text features for games without extensive human annotation, and validating these features to ensure they improve recommendation relevance. We propose an approach that extracts in-game text and uses LLMs to infer attributes such as genre and gameplay objectives from raw player interactions. Additionally, we introduce an LLM-based re-ranking mechanism to assess the effectiveness of the generated text features, enhancing personalization and user satisfaction. Beyond recommendations, our approach supports applications such as user engagement-based integrity detection, already deployed in production. This scalable framework demonstrates the potential of in-game text understanding to improve recommendation quality on Roblox and adapt recommendations to its unique, user-generated ecosystem. △ Less

Submitted 1 February, 2025; originally announced February 2025.

arXiv:2502.05822 [pdf, other]

HCMRM: A High-Consistency Multimodal Relevance Model for Search Ads

Authors: Guobing Gan, Kaiming Gao, Li Wang, Shen Jiang, Peng Jiang

Abstract: Search advertising is essential for merchants to reach the target users on short video platforms. Short video ads aligned with user search intents are displayed through relevance matching and bid ranking mechanisms. This paper focuses on improving query-to-video relevance matching to enhance the effectiveness of ranking in ad systems. Recent vision-language pre-training models have demonstrated pr… ▽ More Search advertising is essential for merchants to reach the target users on short video platforms. Short video ads aligned with user search intents are displayed through relevance matching and bid ranking mechanisms. This paper focuses on improving query-to-video relevance matching to enhance the effectiveness of ranking in ad systems. Recent vision-language pre-training models have demonstrated promise in various multimodal tasks. However, their contribution to downstream query-video relevance tasks is limited, as the alignment between the pair of visual signals and text differs from the modeling of the triplet of the query, visual signals, and video text. In addition, our previous relevance model provides limited ranking capabilities, largely due to the discrepancy between the binary cross-entropy fine-tuning objective and the ranking objective. To address these limitations, we design a high-consistency multimodal relevance model (HCMRM). It utilizes a simple yet effective method to enhance the consistency between pre-training and relevance tasks. Specifically, during the pre-training phase, along with aligning visual signals and video text, several keywords are extracted from the video text as pseudo-queries to perform the triplet relevance modeling. For the fine-tuning phase, we introduce a hierarchical softmax loss, which enables the model to learn the order within labels while maximizing the distinction between positive and negative samples. This promotes the fusion ranking of relevance and bidding in the subsequent ranking stage. The proposed method has been deployed in the Kuaishou search advertising system for over a year, contributing to a 6.1% reduction in the proportion of irrelevant ads and a 1.4% increase in ad revenue. △ Less

Submitted 9 February, 2025; originally announced February 2025.

Comments: Accepted by WWW 2025 (Industry Track)

arXiv:2502.05769 [pdf, other]

Digital Twin Buildings: 3D Modeling, GIS Integration, and Visual Descriptions Using Gaussian Splatting, ChatGPT/Deepseek, and Google Maps Platform

Authors: Kyle Gao, Dening Lu, Liangzhi Li, Nan Chen, Hongjie He, Linlin Xu, Jonathan Li

Abstract: Urban digital twins are virtual replicas of cities that use multi-source data and data analytics to optimize urban planning, infrastructure management, and decision-making. Towards this, we propose a framework focused on the single-building scale. By connecting to cloud mapping platforms such as Google Map Platforms APIs, by leveraging state-of-the-art multi-agent Large Language Models data analys… ▽ More Urban digital twins are virtual replicas of cities that use multi-source data and data analytics to optimize urban planning, infrastructure management, and decision-making. Towards this, we propose a framework focused on the single-building scale. By connecting to cloud mapping platforms such as Google Map Platforms APIs, by leveraging state-of-the-art multi-agent Large Language Models data analysis using ChatGPT(4o) and Deepseek-V3/R1, and by using our Gaussian Splatting-based mesh extraction pipeline, our Digital Twin Buildings framework can retrieve a building's 3D model, visual descriptions, and achieve cloud-based mapping integration with large language model-based data analytics using a building's address, postal code, or geographic coordinates. △ Less

Submitted 20 April, 2025; v1 submitted 8 February, 2025; originally announced February 2025.

Comments: -Fixed minor typo

arXiv:2502.04848 [pdf, other]

Broadband $γ$-ray spectrum of supernova remnant Cassiopeia A

Authors: Zhen Cao, F. Aharonian, Y. X. Bai, Y. W. Bao, D. Bastieri, X. J. Bi, Y. J. Bi, W. Bian, A. V. Bukevich, C. M. Cai, W. Y. Cao, Zhe Cao, J. Chang, J. F. Chang, A. M. Chen, E. S. Chen, H. X. Chen, Liang Chen, Long Chen, M. J. Chen, M. L. Chen, Q. H. Chen, S. Chen, S. H. Chen, S. Z. Chen , et al. (293 additional authors not shown)

Abstract: The core-collapse supernova remnant (SNR) Cassiopeia A (Cas A) is one of the brightest galactic radio sources with an angular radius of $\sim$ 2.5 $\arcmin$. Although no extension of this source has been detected in the $γ$-ray band, using more than 1000 days of LHAASO data above $\sim 0.8$ TeV, we find that its spectrum is significantly softer than those obtained with Imaging Air Cherenkov Telesc… ▽ More The core-collapse supernova remnant (SNR) Cassiopeia A (Cas A) is one of the brightest galactic radio sources with an angular radius of $\sim$ 2.5 $\arcmin$. Although no extension of this source has been detected in the $γ$-ray band, using more than 1000 days of LHAASO data above $\sim 0.8$ TeV, we find that its spectrum is significantly softer than those obtained with Imaging Air Cherenkov Telescopes (IACTs) and its flux near $\sim 1$ TeV is about two times higher. In combination with analyses of more than 16 years of \textit{Fermi}-LAT data covering $0.1 \, \mathrm{GeV} - 1 \, \mathrm{TeV}$, we find that the spectrum above 30 GeV deviates significantly from a single power-law, and is best described by a smoothly broken power-law with a spectral index of $1.90 \pm 0.15_\mathrm{stat}$ ($3.41 \pm 0.19_\mathrm{stat}$) below (above) a break energy of $0.63 \pm 0.21_\mathrm{stat} \, \mathrm{TeV}$. Given differences in the angular resolution of LHAASO-WCDA and IACTs, TeV $γ$-ray emission detected with LHAASO may have a significant contribution from regions surrounding the SNR illuminated by particles accelerated earlier, which, however, are treated as background by IACTs. Detailed modelling can be used to constrain acceleration processes of TeV particles in the early stage of SNR evolution. △ Less

Submitted 7 February, 2025; originally announced February 2025.

arXiv:2502.02753 [pdf, other]

MuST: Multi-Head Skill Transformer for Long-Horizon Dexterous Manipulation with Skill Progress

Authors: Kai Gao, Fan Wang, Erica Aduh, Dylan Randle, Jane Shi

Abstract: Robot picking and packing tasks require dexterous manipulation skills, such as rearranging objects to establish a good grasping pose, or placing and pushing items to achieve tight packing. These tasks are challenging for robots due to the complexity and variability of the required actions. To tackle the difficulty of learning and executing long-horizon tasks, we propose a novel framework called th… ▽ More Robot picking and packing tasks require dexterous manipulation skills, such as rearranging objects to establish a good grasping pose, or placing and pushing items to achieve tight packing. These tasks are challenging for robots due to the complexity and variability of the required actions. To tackle the difficulty of learning and executing long-horizon tasks, we propose a novel framework called the Multi-Head Skill Transformer (MuST). This model is designed to learn and sequentially chain together multiple motion primitives (skills), enabling robots to perform complex sequences of actions effectively. MuST introduces a "progress value" for each skill, guiding the robot on which skill to execute next and ensuring smooth transitions between skills. Additionally, our model is capable of expanding its skill set and managing various sequences of sub-tasks efficiently. Extensive experiments in both simulated and real-world environments demonstrate that MuST significantly enhances the robot's ability to perform long-horizon dexterous manipulation tasks. △ Less

Submitted 4 February, 2025; originally announced February 2025.

Comments: Accepted by ICRA 2025 (2025 IEEE International Conference on Robotics & Automation)

arXiv:2502.02493 [pdf, ps, other]

EasySpec: Layer-Parallel Speculative Decoding for Efficient Multi-GPU Utilization

Authors: Yize Wu, Ke Gao, Yanjun Wu

Abstract: Speculative decoding is an effective and lossless method for Large Language Model (LLM) inference acceleration. It employs a smaller model to generate a draft token sequence, which is then verified by the original base model. In multi-GPU systems, inference latency can be further reduced through tensor parallelism (TP), while the optimal TP size of the draft model is typically smaller than that of… ▽ More Speculative decoding is an effective and lossless method for Large Language Model (LLM) inference acceleration. It employs a smaller model to generate a draft token sequence, which is then verified by the original base model. In multi-GPU systems, inference latency can be further reduced through tensor parallelism (TP), while the optimal TP size of the draft model is typically smaller than that of the base model, leading to GPU idling during the drafting stage. To solve this problem, we propose EasySpec, a layer-parallel speculation strategy that optimizes the efficiency of multi-GPU utilization.EasySpec breaks the sequential execution order of layers in the drafting model, enabling multi-layer parallelization across devices, albeit with some induced approximation errors. After each drafting-and-verification iteration, the draft model's key-value (KV) cache is calibrated in a single forward pass, preventing long-term error accumulation at minimal additional latency. We evaluated EasySpec on several mainstream open-source LLMs, using smaller versions of models from the same series as drafters. The results demonstrate that EasySpec can achieve a peak speedup of 4.17x compared to vanilla decoding, while preserving the original distribution of the base LLMs. Specifically, the drafting stage can be accelerated by up to 1.62x with a maximum accuracy drop of only 7%, requiring no training or fine-tuning on the draft models. △ Less

Submitted 4 February, 2025; originally announced February 2025.

MSC Class: I.2.11

arXiv:2501.19143 [pdf, other]

Imitation Game for Adversarial Disillusion with Multimodal Generative Chain-of-Thought Role-Play

Authors: Ching-Chun Chang, Fan-Yun Chen, Shih-Hong Gu, Kai Gao, Hanrui Wang, Isao Echizen

Abstract: As the cornerstone of artificial intelligence, machine perception confronts a fundamental threat posed by adversarial illusions. These adversarial attacks manifest in two primary forms: deductive illusion, where specific stimuli are crafted based on the victim model's general decision logic, and inductive illusion, where the victim model's general decision logic is shaped by specific stimuli. The… ▽ More As the cornerstone of artificial intelligence, machine perception confronts a fundamental threat posed by adversarial illusions. These adversarial attacks manifest in two primary forms: deductive illusion, where specific stimuli are crafted based on the victim model's general decision logic, and inductive illusion, where the victim model's general decision logic is shaped by specific stimuli. The former exploits the model's decision boundaries to create a stimulus that, when applied, interferes with its decision-making process. The latter reinforces a conditioned reflex in the model, embedding a backdoor during its learning phase that, when triggered by a stimulus, causes aberrant behaviours. The multifaceted nature of adversarial illusions calls for a unified defence framework, addressing vulnerabilities across various forms of attack. In this study, we propose a disillusion paradigm based on the concept of an imitation game. At the heart of the imitation game lies a multimodal generative agent, steered by chain-of-thought reasoning, which observes, internalises and reconstructs the semantic essence of a sample, liberated from the classic pursuit of reversing the sample to its original state. As a proof of concept, we conduct experimental simulations using a multimodal generative dialogue agent and evaluates the methodology under a variety of attack scenarios. △ Less

Submitted 31 January, 2025; originally announced January 2025.

arXiv:2501.16663 [pdf, other]

Data Duplication: A Novel Multi-Purpose Attack Paradigm in Machine Unlearning

Authors: Dayong Ye, Tianqing Zhu, Jiayang Li, Kun Gao, Bo Liu, Leo Yu Zhang, Wanlei Zhou, Yang Zhang

Abstract: Duplication is a prevalent issue within datasets. Existing research has demonstrated that the presence of duplicated data in training datasets can significantly influence both model performance and data privacy. However, the impact of data duplication on the unlearning process remains largely unexplored. This paper addresses this gap by pioneering a comprehensive investigation into the role of dat… ▽ More Duplication is a prevalent issue within datasets. Existing research has demonstrated that the presence of duplicated data in training datasets can significantly influence both model performance and data privacy. However, the impact of data duplication on the unlearning process remains largely unexplored. This paper addresses this gap by pioneering a comprehensive investigation into the role of data duplication, not only in standard machine unlearning but also in federated and reinforcement unlearning paradigms. Specifically, we propose an adversary who duplicates a subset of the target model's training set and incorporates it into the training set. After training, the adversary requests the model owner to unlearn this duplicated subset, and analyzes the impact on the unlearned model. For example, the adversary can challenge the model owner by revealing that, despite efforts to unlearn it, the influence of the duplicated subset remains in the model. Moreover, to circumvent detection by de-duplication techniques, we propose three novel near-duplication methods for the adversary, each tailored to a specific unlearning paradigm. We then examine their impacts on the unlearning process when de-duplication techniques are applied. Our findings reveal several crucial insights: 1) the gold standard unlearning method, retraining from scratch, fails to effectively conduct unlearning under certain conditions; 2) unlearning duplicated data can lead to significant model degradation in specific scenarios; and 3) meticulously crafted duplicates can evade detection by de-duplication methods. △ Less

Submitted 11 March, 2025; v1 submitted 27 January, 2025; originally announced January 2025.

Comments: Accepted at USENIX Security 2025

arXiv:2501.15396 [pdf]

doi 10.1038/s41746-025-01507-3

Foundations of a Knee Joint Digital Twin from qMRI Biomarkers for Osteoarthritis and Knee Replacement

Authors: Gabrielle Hoyer, Kenneth T Gao, Felix G Gassert, Johanna Luitjens, Fei Jiang, Sharmila Majumdar, Valentina Pedoia

Abstract: This study forms the basis of a digital twin system of the knee joint, using advanced quantitative MRI (qMRI) and machine learning to advance precision health in osteoarthritis (OA) management and knee replacement (KR) prediction. We combined deep learning-based segmentation of knee joint structures with dimensionality reduction to create an embedded feature space of imaging biomarkers. Through cr… ▽ More This study forms the basis of a digital twin system of the knee joint, using advanced quantitative MRI (qMRI) and machine learning to advance precision health in osteoarthritis (OA) management and knee replacement (KR) prediction. We combined deep learning-based segmentation of knee joint structures with dimensionality reduction to create an embedded feature space of imaging biomarkers. Through cross-sectional cohort analysis and statistical modeling, we identified specific biomarkers, including variations in cartilage thickness and medial meniscus shape, that are significantly associated with OA incidence and KR outcomes. Integrating these findings into a comprehensive framework represents a considerable step toward personalized knee-joint digital twins, which could enhance therapeutic strategies and inform clinical decision-making in rheumatological care. This versatile and reliable infrastructure has the potential to be extended to broader clinical applications in precision health. △ Less

Submitted 25 January, 2025; originally announced January 2025.

Comments: This manuscript builds on an earlier preprint version available on Research Square: https://doi.org/10.21203/rs.3.rs-4317958/v1

Journal ref: npj Digit. Med. 8, 118 (2025)

arXiv:2501.13340 [pdf, other]

Retrievals Can Be Detrimental: A Contrastive Backdoor Attack Paradigm on Retrieval-Augmented Diffusion Models

Authors: Hao Fang, Xiaohang Sui, Hongyao Yu, Kuofeng Gao, Jiawei Kong, Sijin Yu, Bin Chen, Hao Wu, Shu-Tao Xia

Abstract: Diffusion models (DMs) have recently demonstrated remarkable generation capability. However, their training generally requires huge computational resources and large-scale datasets. To solve these, recent studies empower DMs with the advanced Retrieval-Augmented Generation (RAG) technique and propose retrieval-augmented diffusion models (RDMs). By incorporating rich knowledge from an auxiliary dat… ▽ More Diffusion models (DMs) have recently demonstrated remarkable generation capability. However, their training generally requires huge computational resources and large-scale datasets. To solve these, recent studies empower DMs with the advanced Retrieval-Augmented Generation (RAG) technique and propose retrieval-augmented diffusion models (RDMs). By incorporating rich knowledge from an auxiliary database, RAG enhances diffusion models' generation and generalization ability while significantly reducing model parameters. Despite the great success, RAG may introduce novel security issues that warrant further investigation. In this paper, we reveal that the RDM is susceptible to backdoor attacks by proposing a multimodal contrastive attack approach named BadRDM. Our framework fully considers RAG's characteristics and is devised to manipulate the retrieved items for given text triggers, thereby further controlling the generated contents. Specifically, we first insert a tiny portion of images into the retrieval database as target toxicity surrogates. Subsequently, a malicious variant of contrastive learning is adopted to inject backdoors into the retriever, which builds shortcuts from triggers to the toxicity surrogates. Furthermore, we enhance the attacks through novel entropy-based selection and generative augmentation strategies that can derive better toxicity surrogates. Extensive experiments on two mainstream tasks demonstrate the proposed BadRDM achieves outstanding attack effects while preserving the model's benign utility. △ Less

Submitted 9 March, 2025; v1 submitted 22 January, 2025; originally announced January 2025.

arXiv:2501.12948 [pdf, other]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Authors: DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu , et al. (175 additional authors not shown)

Abstract: We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters… ▽ More We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama. △ Less

Submitted 22 January, 2025; originally announced January 2025.

arXiv:2501.01843 [pdf, other]

A stable phase-locking-free single beam optical lattice with multiple configurations

Authors: Yirong Wang, Xiaoyu Dai, Xue Zhao, Guangren Sun, Kuiyi Gao, Wei Zhang

Abstract: Optical lattices formed by interfering laser beams are widely used to trap and manipulate atoms for quantum simulation, metrology, and computation. To stabilize optical lattices in experiments, it is usually challenging to implement delicate phase-locking systems with complicated optics and electronics to reduce the relative phase fluctuation of multiple laser beams. Here we report a phase-locking… ▽ More Optical lattices formed by interfering laser beams are widely used to trap and manipulate atoms for quantum simulation, metrology, and computation. To stabilize optical lattices in experiments, it is usually challenging to implement delicate phase-locking systems with complicated optics and electronics to reduce the relative phase fluctuation of multiple laser beams. Here we report a phase-locking-free scheme to implement optical lattices by passing a single laser beam through a prism with n-fold symmetric facets and large apex angles. The scheme ensures a stable optical lattice since the interference occurs among different deflected parts of a single laser beam without any moving component. Various lattice configurations, including a triangular lattice and a quasi-crystal lattice with ten-fold symmetry are demonstrated. In both cases, stability measurements show a change of lattice constant in less than 1.14%, and a drift of lattice position in less than 1.61%. △ Less

Submitted 3 January, 2025; originally announced January 2025.

arXiv:2501.00625 [pdf, ps, other]

Gaussian Building Mesh (GBM): Extract a Building's 3D Mesh with Google Earth and Gaussian Splatting

Authors: Kyle Gao, Liangzhi Li, Hongjie He, Dening Lu, Linlin Xu, Jonathan Li

Abstract: Recently released open-source pre-trained foundational image segmentation and object detection models (SAM2+GroundingDINO) allow for geometrically consistent segmentation of objects of interest in multi-view 2D images. Users can use text-based or click-based prompts to segment objects of interest without requiring labeled training datasets. Gaussian Splatting allows for the learning of the 3D repr… ▽ More Recently released open-source pre-trained foundational image segmentation and object detection models (SAM2+GroundingDINO) allow for geometrically consistent segmentation of objects of interest in multi-view 2D images. Users can use text-based or click-based prompts to segment objects of interest without requiring labeled training datasets. Gaussian Splatting allows for the learning of the 3D representation of a scene's geometry and radiance based on 2D images. Combining Google Earth Studio, SAM2+GroundingDINO, 2D Gaussian Splatting, and our improvements in mask refinement based on morphological operations and contour simplification, we created a pipeline to extract the 3D mesh of any building based on its name, address, or geographic coordinates. △ Less

Submitted 5 June, 2025; v1 submitted 31 December, 2024; originally announced January 2025.

arXiv:2412.19437 [pdf, other]

DeepSeek-V3 Technical Report

Authors: DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao , et al. (175 additional authors not shown)

Abstract: We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for loa… ▽ More We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3. △ Less

Submitted 18 February, 2025; v1 submitted 26 December, 2024; originally announced December 2024.

arXiv:2412.15398 [pdf, other]

Tabletop Object Rearrangement: Structure, Complexity, and Efficient Combinatorial Search-Based Solutions

Authors: Kai Gao

Abstract: This thesis provides an in-depth structural analysis and efficient algorithmic solutions for tabletop object rearrangement with overhand grasps (TORO), a foundational task in advancing intelligent robotic manipulation. Rearranging multiple objects in a confined workspace presents two primary challenges: sequencing actions to minimize pick-and-place operations - an NP-hard problem in TORO - and det… ▽ More This thesis provides an in-depth structural analysis and efficient algorithmic solutions for tabletop object rearrangement with overhand grasps (TORO), a foundational task in advancing intelligent robotic manipulation. Rearranging multiple objects in a confined workspace presents two primary challenges: sequencing actions to minimize pick-and-place operations - an NP-hard problem in TORO - and determining temporary object placements ("buffer poses") within a cluttered environment, which is essential yet highly complex. For TORO with available external free space, this work investigates the minimum buffer space, or "running buffer size," required for temporary relocations, presenting both theoretical insights and exact algorithms. For TORO without external free space, the concept of lazy buffer verification is introduced, with its efficiency evaluated across various manipulator configurations, including single-arm, dual-arm, and mobile manipulators. △ Less

Submitted 31 January, 2025; v1 submitted 19 December, 2024; originally announced December 2024.

Comments: PhD Thesis of Kai Gao, written under the direction of Prof. Jingjin Yu

arXiv:2412.12453 [pdf, other]

Multimodal Classification and Out-of-distribution Detection for Multimodal Intent Understanding

Authors: Hanlei Zhang, Qianrui Zhou, Hua Xu, Jianhua Su, Roberto Evans, Kai Gao

Abstract: Multimodal intent understanding is a significant research area that requires effective leveraging of multiple modalities to analyze human language. Existing methods face two main challenges in this domain. Firstly, they have limitations in capturing the nuanced and high-level semantics underlying complex in-distribution (ID) multimodal intents. Secondly, they exhibit poor generalization when confr… ▽ More Multimodal intent understanding is a significant research area that requires effective leveraging of multiple modalities to analyze human language. Existing methods face two main challenges in this domain. Firstly, they have limitations in capturing the nuanced and high-level semantics underlying complex in-distribution (ID) multimodal intents. Secondly, they exhibit poor generalization when confronted with unseen out-of-distribution (OOD) data in real-world scenarios. To address these issues, we propose a novel method for both ID classification and OOD detection (MIntOOD). We first introduce a weighted feature fusion network that models multimodal representations. This network dynamically learns the importance of each modality, adapting to multimodal contexts. To develop discriminative representations for both tasks, we synthesize pseudo-OOD data from convex combinations of ID data and engage in multimodal representation learning from both coarse-grained and fine-grained perspectives. The coarse-grained perspective focuses on distinguishing between ID and OOD binary classes, while the fine-grained perspective not only enhances the discrimination between different ID classes but also captures instance-level interactions between ID and OOD samples, promoting proximity among similar instances and separation from dissimilar ones. We establish baselines for three multimodal intent datasets and build an OOD benchmark. Extensive experiments on these datasets demonstrate that our method significantly improves OOD detection performance with a 3~10% increase in AUROC scores while achieving new state-of-the-art results in ID classification. Data and codes are available at https://github.com/thuiar/MIntOOD. △ Less

Submitted 23 May, 2025; v1 submitted 16 December, 2024; originally announced December 2024.

Comments: Accepted by IEEE Transactions on Multimedia

Showing 1–50 of 373 results for author: Gao, K