-
Loki's Dance of Illusions: A Comprehensive Survey of Hallucination in Large Language Models
Authors:
Chaozhuo Li,
Pengbo Wang,
Chenxu Wang,
Litian Zhang,
Zheng Liu,
Qiwei Ye,
Yuanbo Xu,
Feiran Huang,
Xi Zhang,
Philip S. Yu
Abstract:
Edgar Allan Poe noted, "Truth often lurks in the shadow of error," highlighting the deep complexity intrinsic to the interplay between truth and falsehood, notably under conditions of cognitive and informational asymmetry. This dynamic is strikingly evident in large language models (LLMs). Despite their impressive linguistic generation capabilities, LLMs sometimes produce information that appears…
▽ More
Edgar Allan Poe noted, "Truth often lurks in the shadow of error," highlighting the deep complexity intrinsic to the interplay between truth and falsehood, notably under conditions of cognitive and informational asymmetry. This dynamic is strikingly evident in large language models (LLMs). Despite their impressive linguistic generation capabilities, LLMs sometimes produce information that appears factually accurate but is, in reality, fabricated, an issue often referred to as 'hallucinations'. The prevalence of these hallucinations can mislead users, affecting their judgments and decisions. In sectors such as finance, law, and healthcare, such misinformation risks causing substantial economic losses, legal disputes, and health risks, with wide-ranging consequences.In our research, we have methodically categorized, analyzed the causes, detection methods, and solutions related to LLM hallucinations. Our efforts have particularly focused on understanding the roots of hallucinations and evaluating the efficacy of current strategies in revealing the underlying logic, thereby paving the way for the development of innovative and potent approaches. By examining why certain measures are effective against hallucinations, our study aims to foster a comprehensive approach to tackling this issue within the domain of LLMs.
△ Less
Submitted 6 June, 2025;
originally announced July 2025.
-
Discovery and Preliminary Characterization of a Third Interstellar Object: 3I/ATLAS
Authors:
Darryl Z. Seligman,
Marco Micheli,
Davide Farnocchia,
Larry Denneau,
John W. Noonan,
Henry H. Hsieh,
Toni Santana-Ros,
John Tonry,
Katie Auchettl,
Luca Conversi,
Maxime Devogèle,
Laura Faggioli,
Adina D. Feinstein,
Marco Fenucci,
Marin Ferrais,
Tessa Frincke,
Olivier R. Hainaut,
Kyle Hart,
Andrew Hoffman,
Carrie E. Holt,
Willem B. Hoogendam,
Mark E. Huber,
Emmanuel Jehin,
Theodore Kareta,
Jacqueline V. Keane
, et al. (20 additional authors not shown)
Abstract:
We report initial observations aimed at the characterization of a third interstellar object candidate. This object, 3I/ATLAS or C/2025 N1 (ATLAS), was discovered on 2025 July 1 UT and has an orbital eccentricity of $e\sim6.1$, perihelion of $q\sim 1.36$ au, inclination of $\sim175^\circ$, and hyperbolic velocity of $V_\infty\sim 58$ km s$^{-1}$. We report deep stacked images obtained using the Can…
▽ More
We report initial observations aimed at the characterization of a third interstellar object candidate. This object, 3I/ATLAS or C/2025 N1 (ATLAS), was discovered on 2025 July 1 UT and has an orbital eccentricity of $e\sim6.1$, perihelion of $q\sim 1.36$ au, inclination of $\sim175^\circ$, and hyperbolic velocity of $V_\infty\sim 58$ km s$^{-1}$. We report deep stacked images obtained using the Canada-France-Hawaii Telescope and the Very Large Telescope that resolve a compact coma. Using images obtained from several smaller ground-based telescopes, we find minimal light curve variation for the object over a $\sim4$ day time span. The visible/near-infrared spectral slope of the object is 17.1$\pm$0.2 %/100 nm, comparable to other interstellar objects and primitive solar system small bodies (comets and D-type asteroids), although this result is likely affected by some coma contamination. 3I/ATLAS will be observable through early September 2025, then unobservable by Earth-based observatories near perihelion due to low solar elongation. It will be observable again from the ground in late November 2025. Although this limitation unfortunately prohibits detailed Earth-based observations at perihelion when the activity of 3I/ATLAS is likely to peak, spacecraft at Mars could be used to make valuable observations at this time. Additional photometric, spectroscopic, and polarimetric monitoring of 3I/ATLAS by ground- and space-based telescopes, and possibly spacecraft based at Mars, are highly encouraged for characterizing 3I/ATLAS's rotational light curve, activity evolution, nongravitational acceleration, and compositional indicators of formation conditions.
△ Less
Submitted 7 July, 2025; v1 submitted 3 July, 2025;
originally announced July 2025.
-
DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning
Authors:
Hang Wu,
Hongkai Chen,
Yujun Cai,
Chang Liu,
Qingwen Ye,
Ming-Hsuan Yang,
Yiwei Wang
Abstract:
Grounding natural language queries in graphical user interfaces (GUIs) poses unique challenges due to the diversity of visual elements, spatial clutter, and the ambiguity of language. In this paper, we introduce DiMo-GUI, a training-free framework for GUI grounding that leverages two core strategies: dynamic visual grounding and modality-aware optimization. Instead of treating the GUI as a monolit…
▽ More
Grounding natural language queries in graphical user interfaces (GUIs) poses unique challenges due to the diversity of visual elements, spatial clutter, and the ambiguity of language. In this paper, we introduce DiMo-GUI, a training-free framework for GUI grounding that leverages two core strategies: dynamic visual grounding and modality-aware optimization. Instead of treating the GUI as a monolithic image, our method splits the input into textual elements and iconic elements, allowing the model to reason over each modality independently using general-purpose vision-language models. When predictions are ambiguous or incorrect, DiMo-GUI dynamically focuses attention by generating candidate focal regions centered on the model's initial predictions and incrementally zooms into subregions to refine the grounding result. This hierarchical refinement process helps disambiguate visually crowded layouts without the need for additional training or annotations. We evaluate our approach on standard GUI grounding benchmarks and demonstrate consistent improvements over baseline inference pipelines, highlighting the effectiveness of combining modality separation with region-focused reasoning.
△ Less
Submitted 11 June, 2025;
originally announced July 2025.
-
Structured Attention Matters to Multimodal LLMs in Document Understanding
Authors:
Chang Liu,
Hongkai Chen,
Yujun Cai,
Hang Wu,
Qingwen Ye,
Ming-Hsuan Yang,
Yiwei Wang
Abstract:
Document understanding remains a significant challenge for multimodal large language models (MLLMs). While previous research has primarily focused on locating evidence pages through precise multimodal queries, our work investigates a fundamental yet overlooked aspect: how input format influences document comprehension performance. Through systematic analysis, we discover that raw OCR text often im…
▽ More
Document understanding remains a significant challenge for multimodal large language models (MLLMs). While previous research has primarily focused on locating evidence pages through precise multimodal queries, our work investigates a fundamental yet overlooked aspect: how input format influences document comprehension performance. Through systematic analysis, we discover that raw OCR text often impairs rather than improves MLLMs' performance, which is a counterintuitive finding we attribute to attention dispersion and structure loss. To further substantiate our hypothesis, we propose a novel structure-preserving approach that encodes document elements using the LaTex paradigm, maintaining the hierarchical organization and spatial relationships critical for comprehension. Our attention analysis reveals that structured text induces structured attention patterns on both textual and visual content, directing models to focus on semantically meaningful regions while reducing attention waste. This approach significantly enhances MLLMs' document question answering performance across diverse document types without requiring architectural modifications or additional training.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
A Large Outburst, Coma Asymmetries, and the Color of Comet 243P/NEAT
Authors:
Michael S. P. Kelley,
Silvia Protopapa,
Dennis Bodewits,
Aren N. Heinze,
Youssef Moulane,
Quanzhi Ye,
Bryce Bolin,
Simon Conseil,
Tony L. Farnham,
Lori Feaga,
Xing Gao,
Chih-Hao Hsia,
Emmanuel Jehin,
Shrinivas R. Kulkarni,
Russ R. Laher,
Tim Lister,
Frank J. Masci,
Josiah Purdum,
Bin Yang
Abstract:
Water ice is a fundamental building material of comets and other bodies in the outer solar system. Yet, the properties of cometary water ice are challenging to study, due to its volatility and the typical distances at which comets are observed. Cometary outbursts, impulsive mass-loss events that can liberate large amounts of material, offer opportunities to directly observe and characterize cometa…
▽ More
Water ice is a fundamental building material of comets and other bodies in the outer solar system. Yet, the properties of cometary water ice are challenging to study, due to its volatility and the typical distances at which comets are observed. Cometary outbursts, impulsive mass-loss events that can liberate large amounts of material, offer opportunities to directly observe and characterize cometary water ice. We present a study of comet 243P/NEAT, instigated by a $-3$ mag outburst that occurred in December 2018. Optical images and a 251-day lightcurve were examined to characterize the outburst and the comet's quiescent activity. Variations in the quiescent lightcurve appear to be dominated by coma asymmetries, rather than changing activity levels as the comet approached and receded from the Sun. Furthermore, the lightcurve shows evidence for 1 to 2 additional small outbursts ($-0.3$ mag) occurring in September 2018. The large December 2018 outburst likely ejected water ice grains, yet no signatures of ice were found in color photometry, a color map, nor a near-infrared spectrum. We discuss possible dynamical and thermal reasons for this non-detection. In this context, we examined the comae of comets 103P/Hartley 2 and C/2013 US$_{10}$ (Catalina), and show that a one-to-one mapping between continuum color and the presence of water ice cannot be supported. We also discuss possible causes for the large outburst, and find that there is an apparent grouping in the kinetic energy per mass estimates for the outbursts of 5 comets.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
ARSAR-Net: Intelligent SAR Imaging with Adaptive Regularization
Authors:
Shiping Fu,
Yufan Chen,
Zhe Zhang,
Xiaolan Qiu,
Qixiang Ye
Abstract:
Deep unfolding networks have recently emerged as a promising approach for synthetic aperture radar (SAR) imaging. However, baseline unfolding networks, typically derived from iterative reconstruction algorithms such as the alternating direction method of multipliers (ADMM), lack generalization capability across scenes, primarily because their regularizers are empirically designed rather than learn…
▽ More
Deep unfolding networks have recently emerged as a promising approach for synthetic aperture radar (SAR) imaging. However, baseline unfolding networks, typically derived from iterative reconstruction algorithms such as the alternating direction method of multipliers (ADMM), lack generalization capability across scenes, primarily because their regularizers are empirically designed rather than learned from data. In this study, we introduce a learnable regularizer into the unfolding network and propose a SAR imaging network with adaptive regularization (ARSAR-Net), which aims to generalize across heterogeneous scenes including offshore ships, islands, urban areas, and mountainous terrain. Furthermore, two variants of ARSAR-Net are developed, targeting improved imaging efficiency and reconstruction quality, respectively. Extensive validation through simulated and real-data experiments demonstrates three key advantages of ARSAR-Net: (1) a 50% increase in imaging speed over existing unfolding networks, (2) a PSNR gain of up to 2.0 dB in imaging quality, and (3) enhanced adaptability to complex scenes. These advancements establish a new paradigm for computationally efficient and generalizable SAR imaging systems.
△ Less
Submitted 26 June, 2025; v1 submitted 23 June, 2025;
originally announced June 2025.
-
MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs
Authors:
Yongqi Fan,
Yating Wang,
Guandong Wang,
Jie Zhai,
Jingping Liu,
Qi Ye,
Tong Ruan
Abstract:
Open-ended question answering (QA) is a key task for evaluating the capabilities of large language models (LLMs). Compared to closed-ended QA, it demands longer answer statements, more nuanced reasoning processes, and diverse expressions, making refined and interpretable automatic evaluation both crucial and challenging. Traditional metrics like ROUGE and BERTScore struggle to capture semantic sim…
▽ More
Open-ended question answering (QA) is a key task for evaluating the capabilities of large language models (LLMs). Compared to closed-ended QA, it demands longer answer statements, more nuanced reasoning processes, and diverse expressions, making refined and interpretable automatic evaluation both crucial and challenging. Traditional metrics like ROUGE and BERTScore struggle to capture semantic similarities due to different patterns between model responses and reference answers. Current LLM-based evaluation approaches, such as pairwise or listwise comparisons of candidate answers, lack intuitive interpretability. While pointwise scoring of each response provides some descriptions, it fails to adapt across different question contents. Most notably, existing methods overlook the distinction between factoid and non-factoid questions. To address these challenges, we propose \textbf{MinosEval}, a novel evaluation method that first distinguishes open-ended questions and then ranks candidate answers using different evaluation strategies. For factoid questions, it applies an adaptive key-point scoring strategy, while for non-factoid questions, it uses an instance-aware listwise ranking strategy. Experiments on multiple open-ended QA datasets, including self-built ones with more candidate responses to complement community resources, show that MinosEval better aligns with human annotations and offers more interpretable results.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Human Locomotion Implicit Modeling Based Real-Time Gait Phase Estimation
Authors:
Yuanlong Ji,
Xingbang Yang,
Ruoqi Zhao,
Qihan Ye,
Quan Zheng,
Yubo Fan
Abstract:
Gait phase estimation based on inertial measurement unit (IMU) signals facilitates precise adaptation of exoskeletons to individual gait variations. However, challenges remain in achieving high accuracy and robustness, particularly during periods of terrain changes. To address this, we develop a gait phase estimation neural network based on implicit modeling of human locomotion, which combines tem…
▽ More
Gait phase estimation based on inertial measurement unit (IMU) signals facilitates precise adaptation of exoskeletons to individual gait variations. However, challenges remain in achieving high accuracy and robustness, particularly during periods of terrain changes. To address this, we develop a gait phase estimation neural network based on implicit modeling of human locomotion, which combines temporal convolution for feature extraction with transformer layers for multi-channel information fusion. A channel-wise masked reconstruction pre-training strategy is proposed, which first treats gait phase state vectors and IMU signals as joint observations of human locomotion, thus enhancing model generalization. Experimental results demonstrate that the proposed method outperforms existing baseline approaches, achieving a gait phase RMSE of $2.729 \pm 1.071%$ and phase rate MAE of $0.037 \pm 0.016%$ under stable terrain conditions with a look-back window of 2 seconds, and a phase RMSE of $3.215 \pm 1.303%$ and rate MAE of $0.050 \pm 0.023%$ under terrain transitions. Hardware validation on a hip exoskeleton further confirms that the algorithm can reliably identify gait cycles and key events, adapting to various continuous motion scenarios. This research paves the way for more intelligent and adaptive exoskeleton systems, enabling safer and more efficient human-robot interaction across diverse real-world environments.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Graph Neural Networks in Modern AI-aided Drug Discovery
Authors:
Odin Zhang,
Haitao Lin,
Xujun Zhang,
Xiaorui Wang,
Zhenxing Wu,
Qing Ye,
Weibo Zhao,
Jike Wang,
Kejun Ying,
Yu Kang,
Chang-yu Hsieh,
Tingjun Hou
Abstract:
Graph neural networks (GNNs), as topology/structure-aware models within deep learning, have emerged as powerful tools for AI-aided drug discovery (AIDD). By directly operating on molecular graphs, GNNs offer an intuitive and expressive framework for learning the complex topological and geometric features of drug-like molecules, cementing their role in modern molecular modeling. This review provide…
▽ More
Graph neural networks (GNNs), as topology/structure-aware models within deep learning, have emerged as powerful tools for AI-aided drug discovery (AIDD). By directly operating on molecular graphs, GNNs offer an intuitive and expressive framework for learning the complex topological and geometric features of drug-like molecules, cementing their role in modern molecular modeling. This review provides a comprehensive overview of the methodological foundations and representative applications of GNNs in drug discovery, spanning tasks such as molecular property prediction, virtual screening, molecular generation, biomedical knowledge graph construction, and synthesis planning. Particular attention is given to recent methodological advances, including geometric GNNs, interpretable models, uncertainty quantification, scalable graph architectures, and graph generative frameworks. We also discuss how these models integrate with modern deep learning approaches, such as self-supervised learning, multi-task learning, meta-learning and pre-training. Throughout this review, we highlight the practical challenges and methodological bottlenecks encountered when applying GNNs to real-world drug discovery pipelines, and conclude with a discussion on future directions.
△ Less
Submitted 7 June, 2025;
originally announced June 2025.
-
Classification and enumeration of solid-solid phase transition mechanisms
Authors:
Fang-Cheng Wang,
Qi-Jun Ye,
Yu-Cheng Zhu,
Xin-Zheng Li
Abstract:
Crystal-structure match (CSM), the atom-to-atom correspondence between two crystalline phases, is used extensively to describe solid-solid phase transition (SSPT) mechanisms. However, existing computational methods cannot account for all possible CSMs. Here, we propose a formalism to classify all CSMs into a tree structure, which is independent of the choices of unit cell and supercell. We rigorou…
▽ More
Crystal-structure match (CSM), the atom-to-atom correspondence between two crystalline phases, is used extensively to describe solid-solid phase transition (SSPT) mechanisms. However, existing computational methods cannot account for all possible CSMs. Here, we propose a formalism to classify all CSMs into a tree structure, which is independent of the choices of unit cell and supercell. We rigorously proved that only a finite number of noncongruent CSMs are of practical interest. By representing CSMs as integer matrices, we introduce the crystmatch method to exhaustively enumerate them, which uncontroversially solves the CSM optimization problem under any geometric criterion. For most SSPTs, crystmatch can reproduce all known deformation mechanisms and CSMs within 10 CPU minutes, while also revealing thousands of new candidates. The resulting database can be further used for comparing experimental phenomena, high-throughput energy barrier calculations, or machine learning.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
STELLA: Towards Protein Function Prediction with Multimodal LLMs Integrating Sequence-Structure Representations
Authors:
Hongwang Xiao,
Wenjun Lin,
Xi Chen,
Hui Wang,
Kai Chen,
Jiashan Li,
Yuancheng Sun,
Sicheng Dai,
Boya Wu,
Qiwei Ye
Abstract:
Protein biology focuses on the intricate relationships among sequences, structures, and functions. Deciphering protein functions is crucial for understanding biological processes, advancing drug discovery, and enabling synthetic biology applications. Since protein sequences determine tertiary structures, which in turn govern functions, integrating sequence and structure information is essential fo…
▽ More
Protein biology focuses on the intricate relationships among sequences, structures, and functions. Deciphering protein functions is crucial for understanding biological processes, advancing drug discovery, and enabling synthetic biology applications. Since protein sequences determine tertiary structures, which in turn govern functions, integrating sequence and structure information is essential for accurate prediction of protein functions. Traditional protein language models (pLMs) have advanced protein-related tasks by learning representations from large-scale sequence and structure data. However, pLMs are limited in integrating broader contextual knowledge, particularly regarding functional modalities that are fundamental to protein biology. In contrast, large language models (LLMs) have exhibited outstanding performance in contextual understanding, reasoning, and generation across diverse domains. Leveraging these capabilities, STELLA is proposed as a multimodal LLM integrating protein sequence-structure representations with general knowledge to address protein function prediction. Through multimodal instruction tuning (MMIT) using the proposed OPI-Struc dataset, STELLA achieves state-of-the-art performance in two function-related tasks-functional description prediction (FP) and enzyme-catalyzed reaction prediction (EP). This study highlights the potential of multimodal LLMs as an alternative paradigm to pLMs to advance protein biology research.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation
Authors:
Mengkang Hu,
Yuhang Zhou,
Wendong Fan,
Yuzhou Nie,
Bowei Xia,
Tao Sun,
Ziyu Ye,
Zhaoxuan Jin,
Yingru Li,
Qiguang Chen,
Zeyu Zhang,
Yifeng Wang,
Qianshuo Ye,
Bernard Ghanem,
Ping Luo,
Guohao Li
Abstract:
Large Language Model (LLM)-based multi-agent systems show promise for automating real-world tasks but struggle to transfer across domains due to their domain-specific nature. Current approaches face two critical shortcomings: they require complete architectural redesign and full retraining of all components when applied to new domains. We introduce Workforce, a hierarchical multi-agent framework t…
▽ More
Large Language Model (LLM)-based multi-agent systems show promise for automating real-world tasks but struggle to transfer across domains due to their domain-specific nature. Current approaches face two critical shortcomings: they require complete architectural redesign and full retraining of all components when applied to new domains. We introduce Workforce, a hierarchical multi-agent framework that decouples strategic planning from specialized execution through a modular architecture comprising: (i) a domain-agnostic Planner for task decomposition, (ii) a Coordinator for subtask management, and (iii) specialized Workers with domain-specific tool-calling capabilities. This decoupling enables cross-domain transferability during both inference and training phases: During inference, Workforce seamlessly adapts to new domains by adding or modifying worker agents; For training, we introduce Optimized Workforce Learning (OWL), which improves generalization across domains by optimizing a domain-agnostic planner with reinforcement learning from real-world feedback. To validate our approach, we evaluate Workforce on the GAIA benchmark, covering various realistic, multi-domain agentic tasks. Experimental results demonstrate Workforce achieves open-source state-of-the-art performance (69.70%), outperforming commercial systems like OpenAI's Deep Research by 2.34%. More notably, our OWL-trained 32B model achieves 52.73% accuracy (+16.37%) and demonstrates performance comparable to GPT-4o on challenging tasks. To summarize, by enabling scalable generalization and modular domain transfer, our work establishes a foundation for the next generation of general-purpose AI assistants.
△ Less
Submitted 10 June, 2025; v1 submitted 29 May, 2025;
originally announced May 2025.
-
Transparent and heat-insulation bionic hydrogel-based smart window system for long-term cooling and waste heat collection
Authors:
Qianwang Ye,
Hanqing Dai,
Yukun Yan,
Liwei Wang,
Xinlin Du,
Yimeng Wang,
Zhile Han,
Wanlu Zhang,
Ruiqian Guo
Abstract:
With the energy crisis and climate warming, the position of a new generation of smart windows is becoming increasingly important, and materials or systems that can have high blocking of near-infrared (NIR) and ultraviolet (UV) and high transmittance of visible light (VIS) are needed. Currently, it is difficult for smart heat-insulation materials to achieve high transmittance of VIS, good UV isolat…
▽ More
With the energy crisis and climate warming, the position of a new generation of smart windows is becoming increasingly important, and materials or systems that can have high blocking of near-infrared (NIR) and ultraviolet (UV) and high transmittance of visible light (VIS) are needed. Currently, it is difficult for smart heat-insulation materials to achieve high transmittance of VIS, good UV isolation, outstanding cooling and thermal insulation, and excellent waste heat collection. Here, we design a novel composite hydrogel to achieve an average 92% VIS transmittance, efficient UV absorption , 11 Celsius degree of thermal insulation, and sensing properties. Interestingly, we designed a transparent heat insulation system with this composite hydrogel to obtain about 22 Celsius degree of the record-breaking insulation performance for 168 hours, waste heat collection and reutilization, and temperature sensing. Our findings provide new ideas and possibilities for designing transparent and heat-insulation smart window systems.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Determination of melting temperature of hexagonal ice using Lee-Yang phase transition theory
Authors:
Ling Liu,
Yihua Dong,
Qijun Ye,
Xin-Zheng Li
Abstract:
Lee-Yang phase transition theory is a milestone in statistical physics. Its applications in realistic systems, however, had been substantially hindered by availability of practical schemes to calculate the Lee-Yang zeros. In this manuscript, we extend the scheme we have designed earlier [Phys. Rev. E 109, 024118 (2024)] and report simulation results for the melting temperature (T) of ice Ih under…
▽ More
Lee-Yang phase transition theory is a milestone in statistical physics. Its applications in realistic systems, however, had been substantially hindered by availability of practical schemes to calculate the Lee-Yang zeros. In this manuscript, we extend the scheme we have designed earlier [Phys. Rev. E 109, 024118 (2024)] and report simulation results for the melting temperature (T) of ice Ih under ambient pressure. The enhanced sampling technique is shown to be crucial for accessing Lee-Yang zeros accurately. The real and imaginary parts of our Lee-Yang edges demonstrate linear scaling of sizes, which can lead to a melting T of 248.15 K for the TIP4P/2005 potential in the thermodynamic limit. This result is in close quantitative agreement with previous coexistence simulations, achieved with cheaper computational costs and without prior knowledge of the phase transition. With these, we demonstrate the applicability of Lee-Yang phase transition theory in realistic molecular systems, and provide a feasible scheme for high-throughput calculations in determining the phase transition temperature.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
ReDDiT: Rehashing Noise for Discrete Visual Generation
Authors:
Tianren Ma,
Xiaosong Zhang,
Boyu Yang,
Junlan Feng,
Qixiang Ye
Abstract:
Discrete diffusion models are gaining traction in the visual generative area for their efficiency and compatibility. However, the pioneered attempts still fall behind the continuous counterparts, which we attribute to the noise (absorbing state) design and sampling heuristics. In this study, we propose the rehashing noise framework for discrete diffusion transformer, termed ReDDiT, to extend absor…
▽ More
Discrete diffusion models are gaining traction in the visual generative area for their efficiency and compatibility. However, the pioneered attempts still fall behind the continuous counterparts, which we attribute to the noise (absorbing state) design and sampling heuristics. In this study, we propose the rehashing noise framework for discrete diffusion transformer, termed ReDDiT, to extend absorbing states and improve expressive capacity of discrete diffusion models. ReDDiT enriches the potential paths that latent variables can traverse during training with randomized multi-index corruption. The derived rehash sampler, which reverses the randomized absorbing paths, guarantees the diversity and low discrepancy of the generation process. These reformulations lead to more consistent and competitive generation quality, mitigating the need for heavily tuned randomness. Experiments show that ReDDiT significantly outperforms the baseline (reducing gFID from 6.18 to 1.61) and is on par with the continuous counterparts with higher efficiency.
△ Less
Submitted 29 May, 2025; v1 submitted 26 May, 2025;
originally announced May 2025.
-
OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Model
Authors:
Zhenhao Zhang,
Ye Shi,
Lingxiao Yang,
Suting Ni,
Qi Ye,
Jingya Wang
Abstract:
Understanding and synthesizing realistic 3D hand-object interactions (HOI) is critical for applications ranging from immersive AR/VR to dexterous robotics. Existing methods struggle with generalization, performing well on closed-set objects and predefined tasks but failing to handle unseen objects or open-vocabulary instructions. We introduce OpenHOI, the first framework for open-world HOI synthes…
▽ More
Understanding and synthesizing realistic 3D hand-object interactions (HOI) is critical for applications ranging from immersive AR/VR to dexterous robotics. Existing methods struggle with generalization, performing well on closed-set objects and predefined tasks but failing to handle unseen objects or open-vocabulary instructions. We introduce OpenHOI, the first framework for open-world HOI synthesis, capable of generating long-horizon manipulation sequences for novel objects guided by free-form language commands. Our approach integrates a 3D Multimodal Large Language Model (MLLM) fine-tuned for joint affordance grounding and semantic task decomposition, enabling precise localization of interaction regions (e.g., handles, buttons) and breakdown of complex instructions (e.g., "Find a water bottle and take a sip") into executable sub-tasks. To synthesize physically plausible interactions, we propose an affordance-driven diffusion model paired with a training-free physics refinement stage that minimizes penetration and optimizes affordance alignment. Evaluations across diverse scenarios demonstrate OpenHOI's superiority over state-of-the-art methods in generalizing to novel object categories, multi-stage tasks, and complex language instructions. Our project page at \href{https://openhoi.github.io}
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs
Authors:
Xiaoyu Xu,
Xiang Yue,
Yang Liu,
Qingqing Ye,
Haibo Hu,
Minxin Du
Abstract:
Unlearning in large language models (LLMs) is intended to remove the influence of specific data, yet current evaluations rely heavily on token-level metrics such as accuracy and perplexity. We show that these metrics can be misleading: models often appear to forget, but their original behavior can be rapidly restored with minimal fine-tuning, revealing that unlearning may obscure information rathe…
▽ More
Unlearning in large language models (LLMs) is intended to remove the influence of specific data, yet current evaluations rely heavily on token-level metrics such as accuracy and perplexity. We show that these metrics can be misleading: models often appear to forget, but their original behavior can be rapidly restored with minimal fine-tuning, revealing that unlearning may obscure information rather than erase it. To diagnose this phenomenon, we introduce a representation-level evaluation framework using PCA-based similarity and shift, centered kernel alignment, and Fisher information. Applying this toolkit across six unlearning methods, three domains (text, code, math), and two open-source LLMs, we uncover a critical distinction between reversible and irreversible forgetting. In reversible cases, models suffer token-level collapse yet retain latent features; in irreversible cases, deeper representational damage occurs. We further provide a theoretical account linking shallow weight perturbations near output layers to misleading unlearning signals, and show that reversibility is modulated by task type and hyperparameters. Our findings reveal a fundamental gap in current evaluation practices and establish a new diagnostic foundation for trustworthy unlearning in LLMs. We provide a unified toolkit for analyzing LLM representation changes under unlearning and relearning: https://github.com/XiaoyuXU1/Representational_Analysis_Tools.git.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
RefiDiff: Refinement-Aware Diffusion for Efficient Missing Data Imputation
Authors:
Md Atik Ahamed,
Qiang Ye,
Qiang Cheng
Abstract:
Missing values in high-dimensional, mixed-type datasets pose significant challenges for data imputation, particularly under Missing Not At Random (MNAR) mechanisms. Existing methods struggle to integrate local and global data characteristics, limiting performance in MNAR and high-dimensional settings. We propose an innovative framework, RefiDiff, combining local machine learning predictions with a…
▽ More
Missing values in high-dimensional, mixed-type datasets pose significant challenges for data imputation, particularly under Missing Not At Random (MNAR) mechanisms. Existing methods struggle to integrate local and global data characteristics, limiting performance in MNAR and high-dimensional settings. We propose an innovative framework, RefiDiff, combining local machine learning predictions with a novel Mamba-based denoising network capturing interrelationships among distant features and samples. Our approach leverages pre-refinement for initial warm-up imputations and post-refinement to polish results, enhancing stability and accuracy. By encoding mixed-type data into unified tokens, RefiDiff enables robust imputation without architectural or hyperparameter tuning. RefiDiff outperforms state-of-the-art (SOTA) methods across missing-value settings, excelling in MNAR with a 4x faster training time than SOTA DDPM-based approaches. Extensive evaluations on nine real-world datasets demonstrate its robustness, scalability, and effectiveness in handling complex missingness patterns.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
Joint stochastic localization and applications
Authors:
Tom Alberts,
Yiming Xu,
Qiang Ye
Abstract:
Stochastic localization is a pathwise analysis technique originating from convex geometry. This paper explores certain algorithmic aspects of stochastic localization as a computational tool. First, we unify various existing stochastic localization schemes and discuss their localization rates and regularization. We then introduce a joint stochastic localization framework for constructing couplings…
▽ More
Stochastic localization is a pathwise analysis technique originating from convex geometry. This paper explores certain algorithmic aspects of stochastic localization as a computational tool. First, we unify various existing stochastic localization schemes and discuss their localization rates and regularization. We then introduce a joint stochastic localization framework for constructing couplings between probability distributions. As an initial application, we extend the optimal couplings between normal distributions under the 2-Wasserstein distance to log-concave distributions and derive a normal approximation result. As a further application, we introduce a family of distributional distances based on the couplings induced by joint stochastic localization. Under a specific choice of the localization process, the induced distance is topologically equivalent to the 2-Wasserstein distance for probability measures supported on a common compact set. Moreover, weighted versions of this distance are related to several statistical divergences commonly used in practice. The proposed distances also motivate new methods for distribution estimation that are of independent interest.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Does Low Rank Adaptation Lead to Lower Robustness against Training-Time Attacks?
Authors:
Zi Liang,
Haibo Hu,
Qingqing Ye,
Yaxin Xiao,
Ronghua Li
Abstract:
Low rank adaptation (LoRA) has emerged as a prominent technique for fine-tuning large language models (LLMs) thanks to its superb efficiency gains over previous methods. While extensive studies have examined the performance and structural properties of LoRA, its behavior upon training-time attacks remain underexplored, posing significant security risks. In this paper, we theoretically investigate…
▽ More
Low rank adaptation (LoRA) has emerged as a prominent technique for fine-tuning large language models (LLMs) thanks to its superb efficiency gains over previous methods. While extensive studies have examined the performance and structural properties of LoRA, its behavior upon training-time attacks remain underexplored, posing significant security risks. In this paper, we theoretically investigate the security implications of LoRA's low-rank structure during fine-tuning, in the context of its robustness against data poisoning and backdoor attacks. We propose an analytical framework that models LoRA's training dynamics, employs the neural tangent kernel to simplify the analysis of the training process, and applies information theory to establish connections between LoRA's low rank structure and its vulnerability against training-time attacks. Our analysis indicates that LoRA exhibits better robustness to backdoor attacks than full fine-tuning, while becomes more vulnerable to untargeted data poisoning due to its over-simplified information geometry. Extensive experimental evaluations have corroborated our theoretical findings.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Neuro-Symbolic Query Compiler
Authors:
Yuyao Zhang,
Zhicheng Dou,
Xiaoxi Li,
Jiajie Jin,
Yongkang Wu,
Zhonghua Li,
Qi Ye,
Ji-Rong Wen
Abstract:
Precise recognition of search intent in Retrieval-Augmented Generation (RAG) systems remains a challenging goal, especially under resource constraints and for complex queries with nested structures and dependencies. This paper presents QCompiler, a neuro-symbolic framework inspired by linguistic grammar rules and compiler design, to bridge this gap. It theoretically designs a minimal yet sufficien…
▽ More
Precise recognition of search intent in Retrieval-Augmented Generation (RAG) systems remains a challenging goal, especially under resource constraints and for complex queries with nested structures and dependencies. This paper presents QCompiler, a neuro-symbolic framework inspired by linguistic grammar rules and compiler design, to bridge this gap. It theoretically designs a minimal yet sufficient Backus-Naur Form (BNF) grammar $G[q]$ to formalize complex queries. Unlike previous methods, this grammar maintains completeness while minimizing redundancy. Based on this, QCompiler includes a Query Expression Translator, a Lexical Syntax Parser, and a Recursive Descent Processor to compile queries into Abstract Syntax Trees (ASTs) for execution. The atomicity of the sub-queries in the leaf nodes ensures more precise document retrieval and response generation, significantly improving the RAG system's ability to address complex queries.
△ Less
Submitted 17 May, 2025;
originally announced May 2025.
-
Hierarchical Document Refinement for Long-context Retrieval-augmented Generation
Authors:
Jiajie Jin,
Xiaoxi Li,
Guanting Dong,
Yuyao Zhang,
Yutao Zhu,
Yongkang Wu,
Zhonghua Li,
Qi Ye,
Zhicheng Dou
Abstract:
Real-world RAG applications often encounter long-context input scenarios, where redundant information and noise results in higher inference costs and reduced performance. To address these challenges, we propose LongRefiner, an efficient plug-and-play refiner that leverages the inherent structural characteristics of long documents. LongRefiner employs dual-level query analysis, hierarchical documen…
▽ More
Real-world RAG applications often encounter long-context input scenarios, where redundant information and noise results in higher inference costs and reduced performance. To address these challenges, we propose LongRefiner, an efficient plug-and-play refiner that leverages the inherent structural characteristics of long documents. LongRefiner employs dual-level query analysis, hierarchical document structuring, and adaptive refinement through multi-task learning on a single foundation model. Experiments on seven QA datasets demonstrate that LongRefiner achieves competitive performance in various scenarios while using 10x fewer computational costs and latency compared to the best baseline. Further analysis validates that LongRefiner is scalable, efficient, and effective, providing practical insights for real-world long-text RAG applications. Our code is available at https://github.com/ignorejjj/LongRefiner.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Demonstration of low-overhead quantum error correction codes
Authors:
Ke Wang,
Zhide Lu,
Chuanyu Zhang,
Gongyu Liu,
Jiachen Chen,
Yanzhe Wang,
Yaozu Wu,
Shibo Xu,
Xuhao Zhu,
Feitong Jin,
Yu Gao,
Ziqi Tan,
Zhengyi Cui,
Ning Wang,
Yiren Zou,
Aosai Zhang,
Tingting Li,
Fanhao Shen,
Jiarun Zhong,
Zehang Bao,
Zitian Zhu,
Yihang Han,
Yiyang He,
Jiayuan Shen,
Han Wang
, et al. (17 additional authors not shown)
Abstract:
Quantum computers hold the potential to surpass classical computers in solving complex computational problems. However, the fragility of quantum information and the error-prone nature of quantum operations make building large-scale, fault-tolerant quantum computers a prominent challenge. To combat errors, pioneering experiments have demonstrated a variety of quantum error correction codes. Yet, mo…
▽ More
Quantum computers hold the potential to surpass classical computers in solving complex computational problems. However, the fragility of quantum information and the error-prone nature of quantum operations make building large-scale, fault-tolerant quantum computers a prominent challenge. To combat errors, pioneering experiments have demonstrated a variety of quantum error correction codes. Yet, most of these codes suffer from low encoding efficiency, and their scalability is hindered by prohibitively high resource overheads. Here, we report the demonstration of two low-overhead quantum low-density parity-check (qLDPC) codes, a distance-4 bivariate bicycle code and a distance-3 qLDPC code, on our latest superconducting processor, Kunlun, featuring 32 long-range-coupled transmon qubits. Utilizing a two-dimensional architecture with overlapping long-range couplers, we demonstrate simultaneous measurements of all nonlocal weight-6 stabilizers via the periodic execution of an efficient syndrome extraction circuit. We achieve a logical error rate per logical qubit per cycle of $(8.91 \pm 0.17)\%$ for the distance-4 bivariate bicycle code with four logical qubits and $(7.77 \pm 0.12)\%$ for the distance-3 qLDPC code with six logical qubits. Our results establish the feasibility of implementing various qLDPC codes with long-range coupled superconducting processors, marking a crucial step towards large-scale low-overhead quantum error correction.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Seed1.5-VL Technical Report
Authors:
Dong Guo,
Faming Wu,
Feida Zhu,
Fuxing Leng,
Guang Shi,
Haobin Chen,
Haoqi Fan,
Jian Wang,
Jianyu Jiang,
Jiawei Wang,
Jingji Chen,
Jingjia Huang,
Kang Lei,
Liping Yuan,
Lishu Luo,
Pengfei Liu,
Qinghao Ye,
Rui Qian,
Shen Yan,
Shixiong Zhao,
Shuai Peng,
Shuangye Li,
Sihang Yuan,
Sijin Wu,
Tianheng Cheng
, et al. (172 additional authors not shown)
Abstract:
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluati…
▽ More
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
OBLIVIATE: Robust and Practical Machine Unlearning for Large Language Models
Authors:
Xiaoyu Xu,
Minxin Du,
Qingqing Ye,
Haibo Hu
Abstract:
Large language models (LLMs) trained over extensive corpora risk memorizing sensitive, copyrighted, or toxic content. To address this, we propose OBLIVIATE, a robust unlearning framework that removes targeted data while preserving model utility. The framework follows a structured process: extracting target tokens, building retain sets, and fine-tuning with a tailored loss function comprising three…
▽ More
Large language models (LLMs) trained over extensive corpora risk memorizing sensitive, copyrighted, or toxic content. To address this, we propose OBLIVIATE, a robust unlearning framework that removes targeted data while preserving model utility. The framework follows a structured process: extracting target tokens, building retain sets, and fine-tuning with a tailored loss function comprising three components -- masking, distillation, and world fact. Using low-rank adapters (LoRA), it ensures efficiency without compromising unlearning quality. We conduct experiments on multiple datasets, including the Harry Potter series, WMDP, and TOFU, using a comprehensive suite of metrics: forget quality (new document-level memorization score), model utility, and fluency. Results demonstrate its effectiveness in resisting membership inference attacks, minimizing the impact on retained data, and maintaining robustness across diverse scenarios.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Compact Recurrent Transformer with Persistent Memory
Authors:
Edison Mucllari,
Zachary Daniels,
David Zhang,
Qiang Ye
Abstract:
The Transformer architecture has shown significant success in many language processing and visual tasks. However, the method faces challenges in efficiently scaling to long sequences because the self-attention computation is quadratic with respect to the input length. To overcome this limitation, several approaches scale to longer sequences by breaking long sequences into a series of segments, res…
▽ More
The Transformer architecture has shown significant success in many language processing and visual tasks. However, the method faces challenges in efficiently scaling to long sequences because the self-attention computation is quadratic with respect to the input length. To overcome this limitation, several approaches scale to longer sequences by breaking long sequences into a series of segments, restricting self-attention to local dependencies between tokens within each segment and using a memory mechanism to manage information flow between segments. However, these approached generally introduce additional compute overhead that restricts them from being used for applications where limited compute memory and power are of great concern (such as edge computing). We propose a novel and efficient Compact Recurrent Transformer (CRT), which combines shallow Transformer models that process short local segments with recurrent neural networks to compress and manage a single persistent memory vector that summarizes long-range global information between segments. We evaluate CRT on WordPTB and WikiText-103 for next-token-prediction tasks, as well as on the Toyota Smarthome video dataset for classification. CRT achieves comparable or superior prediction results to full-length Transformers in the language datasets while using significantly shorter segments (half or quarter size) and substantially reduced FLOPs. Our approach also demonstrates state-of-the-art performance on the Toyota Smarthome video dataset.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese
Authors:
Peilin Zhou,
Bruce Leon,
Xiang Ying,
Can Zhang,
Yifan Shao,
Qichen Ye,
Dading Chong,
Zhiling Jin,
Chenxuan Xie,
Meng Cao,
Yuxin Gu,
Sixin Hong,
Jing Ren,
Jian Chen,
Chao Liu,
Yining Hua
Abstract:
As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese.…
▽ More
As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese. To address this gap, we introduce BrowseComp-ZH, a high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions spanning 11 diverse domains. Each question is reverse-engineered from a short, objective, and easily verifiable answer (e.g., a date, number, or proper noun). A two-stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. We benchmark over 20 state-of-the-art language models and agentic search systems on our proposed BrowseComp-ZH. Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI's DeepResearch, reaches just 42.9%. These results demonstrate the considerable difficulty of BrowseComp-ZH, where success demands not only effective retrieval strategies, but also sophisticated reasoning and information reconciliation -- capabilities that current models still struggle to master. Our dataset, construction guidelines, and benchmark results have been publicly released at https://github.com/PALIN2018/BrowseComp-ZH.
△ Less
Submitted 1 May, 2025; v1 submitted 27 April, 2025;
originally announced April 2025.
-
From Randomized Response to Randomized Index: Answering Subset Counting Queries with Local Differential Privacy
Authors:
Qingqing Ye,
Liantong Yu,
Kai Huang,
Xiaokui Xiao,
Weiran Liu,
Haibo Hu
Abstract:
Local Differential Privacy (LDP) is the predominant privacy model for safeguarding individual data privacy. Existing perturbation mechanisms typically require perturbing the original values to ensure acceptable privacy, which inevitably results in value distortion and utility deterioration. In this work, we propose an alternative approach -- instead of perturbing values, we apply randomization to…
▽ More
Local Differential Privacy (LDP) is the predominant privacy model for safeguarding individual data privacy. Existing perturbation mechanisms typically require perturbing the original values to ensure acceptable privacy, which inevitably results in value distortion and utility deterioration. In this work, we propose an alternative approach -- instead of perturbing values, we apply randomization to indexes of values while ensuring rigorous LDP guarantees. Inspired by the deniability of randomized indexes, we present CRIAD for answering subset counting queries on set-value data. By integrating a multi-dummy, multi-sample, and multi-group strategy, CRIAD serves as a fully scalable solution that offers flexibility across various privacy requirements and domain sizes, and achieves more accurate query results than any existing methods. Through comprehensive theoretical analysis and extensive experimental evaluations, we validate the effectiveness of CRIAD and demonstrate its superiority over traditional value-perturbation mechanisms.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
The resonance parameters of the vector charmonium-like state $G(3900)$
Authors:
Quanxing Ye,
Zhenyu Zhang,
Meng-Lin Du,
Ulf-G. Meißner,
Peng-Yu Niu,
Qian Wang
Abstract:
Motivated by the updated analysis of the $G(3900)$ by the BESIII collaboration, we perform a global analysis of the cross sections of the $e^+e^-\to D\bar{D}$, $e^+e^-\to D\bar{D}^*+c.c.$, $e^+e^-\to D^*\bar{D}^*$ processes, especially focusing on the properties of the $G(3900)$. As the energy region of interest is limited by the next opening threshold, i.e. the $D_1\bar{D}$ threshold, we focus on…
▽ More
Motivated by the updated analysis of the $G(3900)$ by the BESIII collaboration, we perform a global analysis of the cross sections of the $e^+e^-\to D\bar{D}$, $e^+e^-\to D\bar{D}^*+c.c.$, $e^+e^-\to D^*\bar{D}^*$ processes, especially focusing on the properties of the $G(3900)$. As the energy region of interest is limited by the next opening threshold, i.e. the $D_1\bar{D}$ threshold, we focus on the energy region $[3.7,4.25]~\mathrm{GeV}$, where three charmonia $ψ(1D)$, $ψ(3S)$ and $ψ(2D)$ explicitly contribute to the cross sections. By constructing the $P$-wave contact interaction between the $(D,D^*)$ doublet and its antiparticle in the heavy quark limit, we extract the physical scattering amplitude by solving the Lippmann-Schwinger equation. No matter whether three or two charmonium states are included in our framework, we always find a dynamically generated state corresponding to the $G(3900)$, which suggests it to be a $P$-wave dynamically generated state. We also predict several dynamically generated states in the corresponding $1^{-+}$ channel. These states can be further searched for in the electron-positron annihilation process involving the emission of a single photon.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
Dual Utilization of Perturbation for Stream Data Publication under Local Differential Privacy
Authors:
Rong Du,
Qingqing Ye,
Yaxin Xiao,
Liantong Yu,
Yue Fu,
Haibo Hu
Abstract:
Stream data from real-time distributed systems such as IoT, tele-health, and crowdsourcing has become an important data source. However, the collection and analysis of user-generated stream data raise privacy concerns due to the potential exposure of sensitive information. To address these concerns, local differential privacy (LDP) has emerged as a promising standard. Nevertheless, applying LDP to…
▽ More
Stream data from real-time distributed systems such as IoT, tele-health, and crowdsourcing has become an important data source. However, the collection and analysis of user-generated stream data raise privacy concerns due to the potential exposure of sensitive information. To address these concerns, local differential privacy (LDP) has emerged as a promising standard. Nevertheless, applying LDP to stream data presents significant challenges, as stream data often involves a large or even infinite number of values. Allocating a given privacy budget across these data points would introduce overwhelming LDP noise to the original stream data.
Beyond existing approaches that merely use perturbed values for estimating statistics, our design leverages them for both perturbation and estimation. This dual utilization arises from a key observation: each user knows their own ground truth and perturbed values, enabling a precise computation of the deviation error caused by perturbation. By incorporating this deviation into the perturbation process of subsequent values, the previous noise can be calibrated. Following this insight, we introduce the Iterative Perturbation Parameterization (IPP) method, which utilizes current perturbed results to calibrate the subsequent perturbation process. To enhance the robustness of calibration and reduce sensitivity, two algorithms, namely Accumulated Perturbation Parameterization (APP) and Clipped Accumulated Perturbation Parameterization (CAPP) are further developed. We prove that these three algorithms satisfy $w$-event differential privacy while significantly improving utility. Experimental results demonstrate that our techniques outperform state-of-the-art LDP stream publishing solutions in terms of utility, while retaining the same privacy guarantee.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Multi-class Item Mining under Local Differential Privacy
Authors:
Yulian Mao,
Qingqing Ye,
Rong Du,
Qi Wang,
Kai Huang,
Haibo Hu
Abstract:
Item mining, a fundamental task for collecting statistical data from users, has raised increasing privacy concerns. To address these concerns, local differential privacy (LDP) was proposed as a privacy-preserving technique. Existing LDP item mining mechanisms primarily concentrate on global statistics, i.e., those from the entire dataset. Nevertheless, they fall short of user-tailored tasks such a…
▽ More
Item mining, a fundamental task for collecting statistical data from users, has raised increasing privacy concerns. To address these concerns, local differential privacy (LDP) was proposed as a privacy-preserving technique. Existing LDP item mining mechanisms primarily concentrate on global statistics, i.e., those from the entire dataset. Nevertheless, they fall short of user-tailored tasks such as personalized recommendations, whereas classwise statistics can improve task accuracy with fine-grained information. Meanwhile, the introduction of class labels brings new challenges. Label perturbation may result in invalid items for aggregation. To this end, we propose frameworks for multi-class item mining, along with two mechanisms: validity perturbation to reduce the impact of invalid data, and correlated perturbation to preserve the relationship between labels and items. We also apply these optimized methods to two multi-class item mining queries: frequency estimation and top-$k$ item mining. Through theoretical analysis and extensive experiments, we verify the effectiveness and superiority of these methods.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
Deep learning to improve the discovery of near-Earth asteroids in the Zwicky Transient Facility
Authors:
Belén Yu Irureta-Goyena,
George Helou,
Jean-Paul Kneib,
Frank Masci,
Thomas Prince,
Kumar Venkataramani,
Quanzhi Ye,
Joseph Masiero,
Frédéric Dux,
Mathieu Salzmann
Abstract:
We present a novel pipeline that uses a convolutional neural network (CNN) to improve the detection capability of near-Earth asteroids (NEAs) in the context of planetary defense. Our work aims to minimize the dependency on human intervention of the current approach adopted by the Zwicky Transient Facility (ZTF). The target NEAs have a high proper motion of up to tens of degrees per day and thus ap…
▽ More
We present a novel pipeline that uses a convolutional neural network (CNN) to improve the detection capability of near-Earth asteroids (NEAs) in the context of planetary defense. Our work aims to minimize the dependency on human intervention of the current approach adopted by the Zwicky Transient Facility (ZTF). The target NEAs have a high proper motion of up to tens of degrees per day and thus appear as streaks of light in the images. We trained our CNNs to detect these streaks using three datasets: a set with real asteroid streaks, a set with synthetic (i.e., simulated) streaks and a mixed set, and tested the resultant models on real survey images. The results achieved were almost identical across the three models: $0.843\pm0.005$ in completeness and $0.820\pm0.025$ in precision. The bias on streak measurements reported by the CNNs was $1.84\pm0.03$ pixels in streak position, $0.817\pm0.026$ degrees in streak angle and $-0.048\pm0.003$ in fractional bias in streak length (computed as the absolute length bias over the streak length, with the negative sign indicating an underestimation). We compared the performance of our CNN trained with a mix of synthetic and real streaks to that of the ZTF human scanners by analyzing a set of 317 streaks flagged as valid by the scanners. Our pipeline detected $80~\%$ of the streaks found by the scanners and 697 additional streaks that were subsequently verified by the scanners to be valid streaks. These results suggest that our automated pipeline can complement the work of the human scanners at no cost for the precision and find more objects than the current approach. They also prove that the synthetic streaks were realistic enough to be used for augmenting training sets when insufficient real streaks are available or exploring the simulation of streaks with unusual characteristics that have not yet been detected.
△ Less
Submitted 30 May, 2025; v1 submitted 16 April, 2025;
originally announced April 2025.
-
GPS: Distilling Compact Memories via Grid-based Patch Sampling for Efficient Online Class-Incremental Learning
Authors:
Mingchuan Ma,
Yuhao Zhou,
Jindi Lv,
Yuxin Tian,
Dan Si,
Shujian Li,
Qing Ye,
Jiancheng Lv
Abstract:
Online class-incremental learning aims to enable models to continuously adapt to new classes with limited access to past data, while mitigating catastrophic forgetting. Replay-based methods address this by maintaining a small memory buffer of previous samples, achieving competitive performance. For effective replay under constrained storage, recent approaches leverage distilled data to enhance the…
▽ More
Online class-incremental learning aims to enable models to continuously adapt to new classes with limited access to past data, while mitigating catastrophic forgetting. Replay-based methods address this by maintaining a small memory buffer of previous samples, achieving competitive performance. For effective replay under constrained storage, recent approaches leverage distilled data to enhance the informativeness of memory. However, such approaches often involve significant computational overhead due to the use of bi-level optimization. Motivated by these limitations, we introduce Grid-based Patch Sampling (GPS), a lightweight and effective strategy for distilling informative memory samples without relying on a trainable model. GPS generates informative samples by sampling a subset of pixels from the original image, yielding compact low-resolution representations that preserve both semantic content and structural information. During replay, these representations are reassembled to support training and evaluation. Experiments on extensive benchmarks demonstrate that GRS can be seamlessly integrated into existing replay frameworks, leading to 3%-4% improvements in average end accuracy under memory-constrained settings, with limited computational overhead.
△ Less
Submitted 14 April, 2025; v1 submitted 14 April, 2025;
originally announced April 2025.
-
On walk domination: Between different types of walks and $m_3$-path
Authors:
Hangdi Chen,
Yuhan Ma,
Qingjie Ye
Abstract:
This paper investigates the domination relationships among various types of walks connecting two non-adjacent vertices in a graph. In particular, we center our attention on the problem which is proposed in [S. B. Tondato, Graphs Combin. 40 (2024)]. A \textit{\( uv \)-\( m_3 \) path} is a \( uv \)-induced path of length at least three. A walk between two non-adjacent vertices in a graph $G$ is call…
▽ More
This paper investigates the domination relationships among various types of walks connecting two non-adjacent vertices in a graph. In particular, we center our attention on the problem which is proposed in [S. B. Tondato, Graphs Combin. 40 (2024)]. A \textit{\( uv \)-\( m_3 \) path} is a \( uv \)-induced path of length at least three. A walk between two non-adjacent vertices in a graph $G$ is called a weakly toll walk if the first and last vertices in the walk are adjacent only to the second and second-to-last vertices, respectively, and these intermediate vertices may appear more than once in the walk. And an $l_k$-path is an induced path of length at most $k$ between two non-adjacent vertices in a graph $G$. We study the domination between weakly toll walks, $l_k$-paths ($k\in \left\{2,3\right\}$) and different types of walks connecting two non-adjacent vertices $u$ and $v$ of a graph (shortest paths, tolled walks, weakly toll walks, $l_k$-paths for $k\in \left\{2,3\right\}$, $m_3$-path), and show how these give rise to characterizations of graph classes.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
Technical Report: Full Version of Analyzing and Optimizing Perturbation of DP-SGD Geometrically
Authors:
Jiawei Duan,
Haibo Hu,
Qingqing Ye,
Xinyue Sun
Abstract:
Differential privacy (DP) has become a prevalent privacy model in a wide range of machine learning tasks, especially after the debut of DP-SGD. However, DP-SGD, which directly perturbs gradients in the training iterations, fails to mitigate the negative impacts of noise on gradient direction. As a result, DP-SGD is often inefficient. Although various solutions (e.g., clipping to reduce the sensiti…
▽ More
Differential privacy (DP) has become a prevalent privacy model in a wide range of machine learning tasks, especially after the debut of DP-SGD. However, DP-SGD, which directly perturbs gradients in the training iterations, fails to mitigate the negative impacts of noise on gradient direction. As a result, DP-SGD is often inefficient. Although various solutions (e.g., clipping to reduce the sensitivity of gradients and amplifying privacy bounds to save privacy budgets) are proposed to trade privacy for model efficiency, the root cause of its inefficiency is yet unveiled.
In this work, we first generalize DP-SGD and theoretically derive the impact of DP noise on the training process. Our analysis reveals that, in terms of a perturbed gradient, only the noise on direction has eminent impact on the model efficiency while that on magnitude can be mitigated by optimization techniques, i.e., fine-tuning gradient clipping and learning rate. Besides, we confirm that traditional DP introduces biased noise on the direction when adding unbiased noise to the gradient itself. Overall, the perturbation of DP-SGD is actually sub-optimal from a geometric perspective. Motivated by this, we design a geometric perturbation strategy GeoDP within the DP framework, which perturbs the direction and the magnitude of a gradient, respectively. By directly reducing the noise on the direction, GeoDP mitigates the negative impact of DP noise on model efficiency with the same DP guarantee. Extensive experiments on two public datasets (i.e., MNIST and CIFAR-10), one synthetic dataset and three prevalent models (i.e., Logistic Regression, CNN and ResNet) confirm the effectiveness and generality of our strategy.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
A Robust Model-Based Approach for Continuous-Time Policy Evaluation with Unknown Lévy Process Dynamics
Authors:
Qihao Ye,
Xiaochuan Tian,
Yuhua Zhu
Abstract:
This paper develops a model-based framework for continuous-time policy evaluation (CTPE) in reinforcement learning, incorporating both Brownian and Lévy noise to model stochastic dynamics influenced by rare and extreme events. Our approach formulates the policy evaluation problem as solving a partial integro-differential equation (PIDE) for the value function with unknown coefficients. A key chall…
▽ More
This paper develops a model-based framework for continuous-time policy evaluation (CTPE) in reinforcement learning, incorporating both Brownian and Lévy noise to model stochastic dynamics influenced by rare and extreme events. Our approach formulates the policy evaluation problem as solving a partial integro-differential equation (PIDE) for the value function with unknown coefficients. A key challenge in this setting is accurately recovering the unknown coefficients in the stochastic dynamics, particularly when driven by Lévy processes with heavy tail effects. To address this, we propose a robust numerical approach that effectively handles both unbiased and censored trajectory datasets. This method combines maximum likelihood estimation with an iterative tail correction mechanism, improving the stability and accuracy of coefficient recovery. Additionally, we establish a theoretical bound for the policy evaluation error based on coefficient recovery error. Through numerical experiments, we demonstrate the effectiveness and robustness of our method in recovering heavy-tailed Lévy dynamics and verify the theoretical error analysis in policy evaluation.
△ Less
Submitted 24 April, 2025; v1 submitted 2 April, 2025;
originally announced April 2025.
-
AdvSGM: Differentially Private Graph Learning via Adversarial Skip-gram Model
Authors:
Sen Zhang,
Qingqing Ye,
Haibo Hu,
Jianliang Xu
Abstract:
The skip-gram model (SGM), which employs a neural network to generate node vectors, serves as the basis for numerous popular graph embedding techniques. However, since the training datasets contain sensitive linkage information, the parameters of a released SGM may encode private information and pose significant privacy risks. Differential privacy (DP) is a rigorous standard for protecting individ…
▽ More
The skip-gram model (SGM), which employs a neural network to generate node vectors, serves as the basis for numerous popular graph embedding techniques. However, since the training datasets contain sensitive linkage information, the parameters of a released SGM may encode private information and pose significant privacy risks. Differential privacy (DP) is a rigorous standard for protecting individual privacy in data analysis. Nevertheless, when applying differential privacy to skip-gram in graphs, it becomes highly challenging due to the complex link relationships, which potentially result in high sensitivity and necessitate substantial noise injection. To tackle this challenge, we present AdvSGM, a differentially private skip-gram for graphs via adversarial training. Our core idea is to leverage adversarial training to privatize skip-gram while improving its utility. Towards this end, we develop a novel adversarial training module by devising two optimizable noise terms that correspond to the parameters of a skip-gram. By fine-tuning the weights between modules within AdvSGM, we can achieve differentially private gradient updates without additional noise injection. Extensive experimental results on six real-world graph datasets show that AdvSGM preserves high data utility across different downstream tasks.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Delving Deep into Semantic Relation Distillation
Authors:
Zhaoyi Yan,
Kangjun Liu,
Qixiang Ye
Abstract:
Knowledge distillation has become a cornerstone technique in deep learning, facilitating the transfer of knowledge from complex models to lightweight counterparts. Traditional distillation approaches focus on transferring knowledge at the instance level, but fail to capture nuanced semantic relationships within the data. In response, this paper introduces a novel methodology, Semantics-based Relat…
▽ More
Knowledge distillation has become a cornerstone technique in deep learning, facilitating the transfer of knowledge from complex models to lightweight counterparts. Traditional distillation approaches focus on transferring knowledge at the instance level, but fail to capture nuanced semantic relationships within the data. In response, this paper introduces a novel methodology, Semantics-based Relation Knowledge Distillation (SeRKD), which reimagines knowledge distillation through a semantics-relation lens among each sample. By leveraging semantic components, \ie, superpixels, SeRKD enables a more comprehensive and context-aware transfer of knowledge, which skillfully integrates superpixel-based semantic extraction with relation-based knowledge distillation for a sophisticated model compression and distillation. Particularly, the proposed method is naturally relevant in the domain of Vision Transformers (ViTs), where visual tokens serve as fundamental units of representation. Experimental evaluations on benchmark datasets demonstrate the superiority of SeRKD over existing methods, underscoring its efficacy in enhancing model performance and generalization capabilities.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Ferret: An Efficient Online Continual Learning Framework under Varying Memory Constraints
Authors:
Yuhao Zhou,
Yuxin Tian,
Jindi Lv,
Mingjia Shi,
Yuanxi Li,
Qing Ye,
Shuhao Zhang,
Jiancheng Lv
Abstract:
In the realm of high-frequency data streams, achieving real-time learning within varying memory constraints is paramount. This paper presents Ferret, a comprehensive framework designed to enhance online accuracy of Online Continual Learning (OCL) algorithms while dynamically adapting to varying memory budgets. Ferret employs a fine-grained pipeline parallelism strategy combined with an iterative g…
▽ More
In the realm of high-frequency data streams, achieving real-time learning within varying memory constraints is paramount. This paper presents Ferret, a comprehensive framework designed to enhance online accuracy of Online Continual Learning (OCL) algorithms while dynamically adapting to varying memory budgets. Ferret employs a fine-grained pipeline parallelism strategy combined with an iterative gradient compensation algorithm, ensuring seamless handling of high-frequency data with minimal latency, and effectively counteracting the challenge of stale gradients in parallel training. To adapt to varying memory budgets, its automated model partitioning and pipeline planning optimizes performance regardless of memory limitations. Extensive experiments across 20 benchmarks and 5 integrated OCL algorithms show Ferret's remarkable efficiency, achieving up to 3.7$\times$ lower memory overhead to reach the same online accuracy compared to competing methods. Furthermore, Ferret consistently outperforms these methods across diverse memory budgets, underscoring its superior adaptability. These findings position Ferret as a premier solution for efficient and adaptive OCL framework in real-time environments.
△ Less
Submitted 15 March, 2025;
originally announced March 2025.
-
In Search of the Potentially Hazardous Asteroids in the Taurid Resonant Swarm
Authors:
Jasmine Li,
Quanzhi Ye,
Denis Vida,
David L. Clark,
Eric C. Bellm,
Richard Dekany,
Matthew J. Graham,
Frank J. Masci,
Josiah Purdum,
Benjamin Racine,
Avery Wold
Abstract:
The Taurid Complex is a large interplanetary system that contains comet 2P/Encke, several meteoroid streams, and possibly a number of near-Earth asteroids. The size and nature of the system has led to the speculation that it was formed through a large-scale cometary breakup. Numerical investigations have suggested that planetary dynamics can create a resonant region with a large number of objects…
▽ More
The Taurid Complex is a large interplanetary system that contains comet 2P/Encke, several meteoroid streams, and possibly a number of near-Earth asteroids. The size and nature of the system has led to the speculation that it was formed through a large-scale cometary breakup. Numerical investigations have suggested that planetary dynamics can create a resonant region with a large number of objects concentrated in a small segment of the orbit, known as the Taurid swarm, which approaches the Earth in certain years and provides favorable conditions to study the Taurid Complex. Recent meteor observations confirmed the existence of the swarm for mm- to m-sized objects. Here we present a dedicated telescopic search for potentially hazardous asteroids and other macroscopic objects in the Taurid swarm using the Zwicky Transient Facility survey. We determine from our non-detection that there are no more than 9--14 $H\leq24$ (equivalent to a diameter of $D\gtrsim100$~m) objects in the swarm, suggesting that the Encke--Taurid progenitor was $\sim10$~km in size. A progenitor of such a size is compatible with the prediction of state-of-the-art Solar System dynamical models, which expects $\sim0.1$ $D>10$~km objects on Encke-like orbits at any given time.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
Privacy for Free: Leveraging Local Differential Privacy Perturbed Data from Multiple Services
Authors:
Rong Du,
Qingqing Ye,
Yue Fu,
Haibo Hu
Abstract:
Local Differential Privacy (LDP) has emerged as a widely adopted privacy-preserving technique in modern data analytics, enabling users to share statistical insights while maintaining robust privacy guarantees. However, current LDP applications assume a single service gathering perturbed information from users. In reality, multiple services may be interested in collecting users' data, which poses p…
▽ More
Local Differential Privacy (LDP) has emerged as a widely adopted privacy-preserving technique in modern data analytics, enabling users to share statistical insights while maintaining robust privacy guarantees. However, current LDP applications assume a single service gathering perturbed information from users. In reality, multiple services may be interested in collecting users' data, which poses privacy burdens to users as more such services emerge. To address this issue, this paper proposes a framework for collecting and aggregating data based on perturbed information from multiple services, regardless of their estimated statistics (e.g., mean or distribution) and perturbation mechanisms.
Then for mean estimation, we introduce the Unbiased Averaging (UA) method and its optimized version, User-level Weighted Averaging (UWA). The former utilizes biased perturbed data, while the latter assigns weights to different perturbed results based on perturbation information, thereby achieving minimal variance. For distribution estimation, we propose the User-level Likelihood Estimation (ULE), which treats all perturbed results from a user as a whole for maximum likelihood estimation. Experimental results demonstrate that our framework and constituting methods significantly improve the accuracy of both mean and distribution estimation.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning
Authors:
Qinghao Ye,
Xianhan Zeng,
Fu Li,
Chunyuan Li,
Haoqi Fan
Abstract:
Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions. However, the evaluation of detailed image captioning remains underexplored due to outdated evaluation metrics and coarse annotations. In this paper, we introduce DeCapBench along with a novel metric…
▽ More
Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions. However, the evaluation of detailed image captioning remains underexplored due to outdated evaluation metrics and coarse annotations. In this paper, we introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks. DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units, termed primitive information units, and assessing them individually. Our evaluation shows that DCScore aligns more closely with human judgment than other rule-based or model-based metrics. Concurrently, DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models. Additionally, we present an automatic fine-grained feedback collection method, FeedQuill, for preference optimization based on our advanced metric, showing robust generalization capabilities across auto-generated preference data. Extensive experiments on multiple VLMs demonstrate that our method not only significantly reduces hallucinations but also enhances performance across various benchmarks, achieving superior detail captioning performance while surpassing GPT-4o.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
Mol-CADiff: Causality-Aware Autoregressive Diffusion for Molecule Generation
Authors:
Md Atik Ahamed,
Qiang Ye,
Qiang Cheng
Abstract:
The design of novel molecules with desired properties is a key challenge in drug discovery and materials science. Traditional methods rely on trial-and-error, while recent deep learning approaches have accelerated molecular generation. However, existing models struggle with generating molecules based on specific textual descriptions. We introduce Mol-CADiff, a novel diffusion-based framework that…
▽ More
The design of novel molecules with desired properties is a key challenge in drug discovery and materials science. Traditional methods rely on trial-and-error, while recent deep learning approaches have accelerated molecular generation. However, existing models struggle with generating molecules based on specific textual descriptions. We introduce Mol-CADiff, a novel diffusion-based framework that uses causal attention mechanisms for text-conditional molecular generation. Our approach explicitly models the causal relationship between textual prompts and molecular structures, overcoming key limitations in existing methods. We enhance dependency modeling both within and across modalities, enabling precise control over the generation process. Our extensive experiments demonstrate that Mol-CADiff outperforms state-of-the-art methods in generating diverse, novel, and chemically valid molecules, with better alignment to specified properties, enabling more intuitive language-driven molecular design.
△ Less
Submitted 7 March, 2025;
originally announced March 2025.
-
Generalized Diffusion Detector: Mining Robust Features from Diffusion Models for Domain-Generalized Detection
Authors:
Boyong He,
Yuxiang Ji,
Qianwen Ye,
Zhuoyue Tan,
Liaoni Wu
Abstract:
Domain generalization (DG) for object detection aims to enhance detectors' performance in unseen scenarios. This task remains challenging due to complex variations in real-world applications. Recently, diffusion models have demonstrated remarkable capabilities in diverse scene generation, which inspires us to explore their potential for improving DG tasks. Instead of generating images, our method…
▽ More
Domain generalization (DG) for object detection aims to enhance detectors' performance in unseen scenarios. This task remains challenging due to complex variations in real-world applications. Recently, diffusion models have demonstrated remarkable capabilities in diverse scene generation, which inspires us to explore their potential for improving DG tasks. Instead of generating images, our method extracts multi-step intermediate features during the diffusion process to obtain domain-invariant features for generalized detection. Furthermore, we propose an efficient knowledge transfer framework that enables detectors to inherit the generalization capabilities of diffusion models through feature and object-level alignment, without increasing inference time. We conduct extensive experiments on six challenging DG benchmarks. The results demonstrate that our method achieves substantial improvements of 14.0% mAP over existing DG approaches across different domains and corruption types. Notably, our method even outperforms most domain adaptation methods without accessing any target domain data. Moreover, the diffusion-guided detectors show consistent improvements of 15.9% mAP on average compared to the baseline. Our work aims to present an effective approach for domain-generalized detection and provide potential insights for robust visual recognition in real-world scenarios. The code is available at https://github.com/heboyong/Generalized-Diffusion-Detector.
△ Less
Submitted 4 June, 2025; v1 submitted 3 March, 2025;
originally announced March 2025.
-
Adaptive Keyframe Sampling for Long Video Understanding
Authors:
Xi Tang,
Jihao Qiu,
Lingxi Xie,
Yunjie Tian,
Jianbin Jiao,
Qixiang Ye
Abstract:
Multimodal large language models (MLLMs) have enabled open-world visual understanding by injecting visual input as extra tokens into large language models (LLMs) as contexts. However, when the visual input changes from a single image to a long video, the above paradigm encounters difficulty because the vast amount of video tokens has significantly exceeded the maximal capacity of MLLMs. Therefore,…
▽ More
Multimodal large language models (MLLMs) have enabled open-world visual understanding by injecting visual input as extra tokens into large language models (LLMs) as contexts. However, when the visual input changes from a single image to a long video, the above paradigm encounters difficulty because the vast amount of video tokens has significantly exceeded the maximal capacity of MLLMs. Therefore, existing video-based MLLMs are mostly established upon sampling a small portion of tokens from input data, which can cause key information to be lost and thus produce incorrect answers. This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS). It inserts a plug-and-play module known as keyframe selection, which aims to maximize the useful information with a fixed number of video tokens. We formulate keyframe selection as an optimization involving (1) the relevance between the keyframes and the prompt, and (2) the coverage of the keyframes over the video, and present an adaptive algorithm to approximate the best solution. Experiments on two long video understanding benchmarks validate that Adaptive Keyframe Sampling improves video QA accuracy (beyond strong baselines) upon selecting informative keyframes. Our study reveals the importance of information pre-filtering in video-based MLLMs. Code is available at https://github.com/ncTimTang/AKS.
△ Less
Submitted 28 February, 2025;
originally announced February 2025.
-
A Sample-Level Evaluation and Generative Framework for Model Inversion Attacks
Authors:
Haoyang Li,
Li Bai,
Qingqing Ye,
Haibo Hu,
Yaxin Xiao,
Huadi Zheng,
Jianliang Xu
Abstract:
Model Inversion (MI) attacks, which reconstruct the training dataset of neural networks, pose significant privacy concerns in machine learning. Recent MI attacks have managed to reconstruct realistic label-level private data, such as the general appearance of a target person from all training images labeled on him. Beyond label-level privacy, in this paper we show sample-level privacy, the private…
▽ More
Model Inversion (MI) attacks, which reconstruct the training dataset of neural networks, pose significant privacy concerns in machine learning. Recent MI attacks have managed to reconstruct realistic label-level private data, such as the general appearance of a target person from all training images labeled on him. Beyond label-level privacy, in this paper we show sample-level privacy, the private information of a single target sample, is also important but under-explored in the MI literature due to the limitations of existing evaluation metrics. To address this gap, this study introduces a novel metric tailored for training-sample analysis, namely, the Diversity and Distance Composite Score (DDCS), which evaluates the reconstruction fidelity of each training sample by encompassing various MI attack attributes. This, in turn, enhances the precision of sample-level privacy assessments.
Leveraging DDCS as a new evaluative lens, we observe that many training samples remain resilient against even the most advanced MI attack. As such, we further propose a transfer learning framework that augments the generative capabilities of MI attackers through the integration of entropy loss and natural gradient descent. Extensive experiments verify the effectiveness of our framework on improving state-of-the-art MI attacks over various metrics including DDCS, coverage and FID. Finally, we demonstrate that DDCS can also be useful for MI defense, by identifying samples susceptible to MI attacks in an unsupervised manner.
△ Less
Submitted 26 February, 2025;
originally announced February 2025.
-
YOLOv12: Attention-Centric Real-Time Object Detectors
Authors:
Yunjie Tian,
Qixiang Ye,
David Doermann
Abstract:
Enhancing the network architecture of the YOLO framework has been crucial for a long time, but has focused on CNN-based improvements despite the proven superiority of attention mechanisms in modeling capabilities. This is because attention-based models cannot match the speed of CNN-based models. This paper proposes an attention-centric YOLO framework, namely YOLOv12, that matches the speed of prev…
▽ More
Enhancing the network architecture of the YOLO framework has been crucial for a long time, but has focused on CNN-based improvements despite the proven superiority of attention mechanisms in modeling capabilities. This is because attention-based models cannot match the speed of CNN-based models. This paper proposes an attention-centric YOLO framework, namely YOLOv12, that matches the speed of previous CNN-based ones while harnessing the performance benefits of attention mechanisms. YOLOv12 surpasses all popular real-time object detectors in accuracy with competitive speed. For example, YOLOv12-N achieves 40.6% mAP with an inference latency of 1.64 ms on a T4 GPU, outperforming advanced YOLOv10-N / YOLOv11-N by 2.1%/1.2% mAP with a comparable speed. This advantage extends to other model scales. YOLOv12 also surpasses end-to-end real-time detectors that improve DETR, such as RT-DETR / RT-DETRv2: YOLOv12-S beats RT-DETR-R18 / RT-DETRv2-R18 while running 42% faster, using only 36% of the computation and 45% of the parameters. More comparisons are shown in Figure 1.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Ansatz-free Hamiltonian learning with Heisenberg-limited scaling
Authors:
Hong-Ye Hu,
Muzhou Ma,
Weiyuan Gong,
Qi Ye,
Yu Tong,
Steven T. Flammia,
Susanne F. Yelin
Abstract:
Learning the unknown interactions that govern a quantum system is crucial for quantum information processing, device benchmarking, and quantum sensing. The problem, known as Hamiltonian learning, is well understood under the assumption that interactions are local, but this assumption may not hold for arbitrary Hamiltonians. Previous methods all require high-order inverse polynomial dependency with…
▽ More
Learning the unknown interactions that govern a quantum system is crucial for quantum information processing, device benchmarking, and quantum sensing. The problem, known as Hamiltonian learning, is well understood under the assumption that interactions are local, but this assumption may not hold for arbitrary Hamiltonians. Previous methods all require high-order inverse polynomial dependency with precision, unable to surpass the standard quantum limit and reach the gold standard Heisenberg-limited scaling. Whether Heisenberg-limited Hamiltonian learning is possible without prior assumptions about the interaction structures, a challenge we term \emph{ansatz-free Hamiltonian learning}, remains an open question. In this work, we present a quantum algorithm to learn arbitrary sparse Hamiltonians without any structure constraints using only black-box queries of the system's real-time evolution and minimal digital controls to attain Heisenberg-limited scaling in estimation error. Our method is also resilient to state-preparation-and-measurement errors, enhancing its practical feasibility. We numerically demonstrate our ansatz-free protocol for learning physical Hamiltonians and validating analog quantum simulations, benchmarking our performance against the state-of-the-art Heisenberg-limited learning approach. Moreover, we establish a fundamental trade-off between total evolution time and quantum control on learning arbitrary interactions, revealing the intrinsic interplay between controllability and total evolution time complexity for any learning algorithm. These results pave the way for further exploration into Heisenberg-limited Hamiltonian learning in complex quantum systems under minimal assumptions, potentially enabling new benchmarking and verification protocols.
△ Less
Submitted 30 June, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
CMQCIC-Bench: A Chinese Benchmark for Evaluating Large Language Models in Medical Quality Control Indicator Calculation
Authors:
Guangya Yu,
Yanhao Li,
Zongying Jiang,
Yuxiong Jin,
Li Dai,
Yupian Lin,
Ruihui Hou,
Weiyan Zhang,
Yongqi Fan,
Qi Ye,
Jingping Liu,
Tong Ruan
Abstract:
Medical quality control indicators are essential to assess the qualifications of healthcare institutions for medical services. With the impressive performance of large language models (LLMs) like GPT-4 in the medical field, leveraging these technologies for the Medical Quality Control Indicator Calculation (MQCIC) presents a promising approach. In this work, (1) we introduce a real-world task MQCI…
▽ More
Medical quality control indicators are essential to assess the qualifications of healthcare institutions for medical services. With the impressive performance of large language models (LLMs) like GPT-4 in the medical field, leveraging these technologies for the Medical Quality Control Indicator Calculation (MQCIC) presents a promising approach. In this work, (1) we introduce a real-world task MQCIC and propose an open-source Chinese electronic medical records (EMRs)-based dataset (CMQCIC-Bench) comprising 785 instances and 76 indicators. (2) We propose a semi-automatic method to enhance the rule representation. Then we propose the Clinical Facts-based Inferential Rule (CF-IR) method that disentangles the clinical fact verification and inferential rule reasoning actions. (3) We conduct comprehensive experiments on 20 representative LLMs, covering general and medical models. Our findings reveal that CF-IR outperforms Chain-of-Thought methods in MQCIC tasks. (4) We conduct an error analysis and investigate the capabilities of clinical fact verification and inferential rule reasoning, providing insights to improve performance in the MQCIC further. The dataset and code is available in this repo https://anonymous.4open.science/r/C-MQCIC-1151.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Motion planning for highly-dynamic unconditioned reflexes based on chained Signed Distance Functions
Authors:
Ken Lin,
Qi Ye,
Tin Lun Lam,
Zhibin Li,
Jiming Chen,
Gaofeng Li
Abstract:
The unconditioned reflex (e.g., protective reflex), which is the innate reaction of the organism and usually performed through the spinal cord rather than the brain, can enable organisms to escape harms from environments. In this paper, we propose an online, highly-dynamic motion planning algorithm to endow manipulators the highly-dynamic unconditioned reflexes to humans and/or environments. Our m…
▽ More
The unconditioned reflex (e.g., protective reflex), which is the innate reaction of the organism and usually performed through the spinal cord rather than the brain, can enable organisms to escape harms from environments. In this paper, we propose an online, highly-dynamic motion planning algorithm to endow manipulators the highly-dynamic unconditioned reflexes to humans and/or environments. Our method is based on a chained version of Signed Distance Functions (SDFs), which can be pre-computed and stored. Our proposed algorithm is divided into two stages. In the offline stage, we create 3 groups of local SDFs to store the geometric information of the manipulator and its working environment. In the online stage, the pre-computed local SDFs are chained together according the configuration of the manipulator, to provide global geometric information about the environment. While the point clouds of the dynamic objects serve as query points to look up these local SDFs for quickly generating escape velocity. Then we propose a modified geometric Jacobian matrix and use the Jacobian-pseudo-inverse method to generate real-time reflex behaviors to avoid the static and dynamic obstacles in the environment. The benefits of our method are validated in both static and dynamic scenarios. In the static scenario, our method identifies the path solutions with lower time consumption and shorter trajectory length compared to existing solutions. In the dynamic scenario, our method can reliably pursue the dynamic target point, avoid dynamic obstacles, and react to these obstacles within 1ms, which surpasses the unconditioned reflex reaction time of humans.
△ Less
Submitted 18 February, 2025; v1 submitted 15 February, 2025;
originally announced February 2025.