-
Quantum-Enhanced Change Detection and Joint Communication-Detection
Authors:
Zihao Gong,
Saikat Guha
Abstract:
Quick detection of transmittance changes in optical channel is crucial for secure communication. We demonstrate that pre-shared entanglement using two-mode squeezed vacuum states significantly reduces detection latency compared to classical and entanglement-augmented coherent-state probes. The change detection latency is inversely proportional to the quantum relative entropy (QRE), which goes to i…
▽ More
Quick detection of transmittance changes in optical channel is crucial for secure communication. We demonstrate that pre-shared entanglement using two-mode squeezed vacuum states significantly reduces detection latency compared to classical and entanglement-augmented coherent-state probes. The change detection latency is inversely proportional to the quantum relative entropy (QRE), which goes to infinity in the absence of thermal noise, suggesting idealized instantaneous detection. However, in realistic scenarios, we show that QRE scales logarithmically with the inverse of the thermal noise mean photon number. We propose a receiver that achieves this scaling and quantify its performance gains over existing methods. Additionally, we explore the fundamental trade-off between communication capacity and change detection latency, highlighting how pre-shared entanglement enhances both.
△ Less
Submitted 15 June, 2025; v1 submitted 24 April, 2025;
originally announced April 2025.
-
A novel hybrid neural network of fluid-structure interaction prediction for two cylinders in tandem arrangement
Authors:
Yanfang Lyu,
Yunyang Zhang,
Zhiqiang Gong,
Xiao Kang,
Wen Yao,
Yongmao Pei
Abstract:
Deep learning has shown promise in improving computing efficiency while ensuring modeling accuracy in fluid-structure interaction (FSI) analysis. However, its current capabilities are limited when it comes to constructing multi-object coupling systems with dynamic boundaries. To address such limitation, a novel FSI neural solver integrated by a fluid deep learning model with multi-time steps and a…
▽ More
Deep learning has shown promise in improving computing efficiency while ensuring modeling accuracy in fluid-structure interaction (FSI) analysis. However, its current capabilities are limited when it comes to constructing multi-object coupling systems with dynamic boundaries. To address such limitation, a novel FSI neural solver integrated by a fluid deep learning model with multi-time steps and a structural dynamic solver is proposed to accurately and reliably predict the vortex-induced vibration (VIV) evolution for two cylinders in tandem. This well-designed model in the form of end-to-end can precisely predict the instantaneous flow field state at the subsequent time by coupling the temporal flow fields of historical multi-time sequences and the current structural responses, moreover, derives the structural state at the next time. Furthermore, the novel fluid deep learning model consists of a wall shear model utilizing a multilayer perception network and flow field model with U-shaped architecture jointing the Fourier neural operator and modified convolution long-short term memory model. Both models effectively capture coupling transfer forces and predict instantaneous flow fields, with the latter demonstrating superior accuracy compared to Convolutional Neural Network- or Unet- based models with similar parameters. The prediction speed of the proposed models realizes an improvement of over 1000 times compared with the numerical simulation. Significantly, the proposed FSI neural model demonstrates exceptional capability in constructing the nonlinear complex multi- vibration systems and has substantial potential for advancing FSI modeling of flexible structures featuring pronounced nonlinear deformation boundaries.
△ Less
Submitted 24 April, 2025; v1 submitted 21 April, 2025;
originally announced April 2025.
-
DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks
Authors:
Yupei Liu,
Yuqi Jia,
Jinyuan Jia,
Dawn Song,
Neil Zhenqiang Gong
Abstract:
LLM-integrated applications and agents are vulnerable to prompt injection attacks, where an attacker injects prompts into their inputs to induce attacker-desired outputs. A detection method aims to determine whether a given input is contaminated by an injected prompt. However, existing detection methods have limited effectiveness against state-of-the-art attacks, let alone adaptive ones. In this w…
▽ More
LLM-integrated applications and agents are vulnerable to prompt injection attacks, where an attacker injects prompts into their inputs to induce attacker-desired outputs. A detection method aims to determine whether a given input is contaminated by an injected prompt. However, existing detection methods have limited effectiveness against state-of-the-art attacks, let alone adaptive ones. In this work, we propose DataSentinel, a game-theoretic method to detect prompt injection attacks. Specifically, DataSentinel fine-tunes an LLM to detect inputs contaminated with injected prompts that are strategically adapted to evade detection. We formulate this as a minimax optimization problem, with the objective of fine-tuning the LLM to detect strong adaptive attacks. Furthermore, we propose a gradient-based method to solve the minimax optimization problem by alternating between the inner max and outer min problems. Our evaluation results on multiple benchmark datasets and LLMs show that DataSentinel effectively detects both existing and adaptive prompt injection attacks.
△ Less
Submitted 15 May, 2025; v1 submitted 15 April, 2025;
originally announced April 2025.
-
Zero-shot Autonomous Microscopy for Scalable and Intelligent Characterization of 2D Materials
Authors:
Jingyun Yang,
Ruoyan Avery Yin,
Chi Jiang,
Yuepeng Hu,
Xiaokai Zhu,
Xingjian Hu,
Sutharsika Kumar,
Xiao Wang,
Xiaohua Zhai,
Keran Rong,
Yunyue Zhu,
Tianyi Zhang,
Zongyou Yin,
Jing Kong,
Neil Zhenqiang Gong,
Zhichu Ren,
Haozhe Wang
Abstract:
Characterization of atomic-scale materials traditionally requires human experts with months to years of specialized training. Even for trained human operators, accurate and reliable characterization remains challenging when examining newly discovered materials such as two-dimensional (2D) structures. This bottleneck drives demand for fully autonomous experimentation systems capable of comprehendin…
▽ More
Characterization of atomic-scale materials traditionally requires human experts with months to years of specialized training. Even for trained human operators, accurate and reliable characterization remains challenging when examining newly discovered materials such as two-dimensional (2D) structures. This bottleneck drives demand for fully autonomous experimentation systems capable of comprehending research objectives without requiring large training datasets. In this work, we present ATOMIC (Autonomous Technology for Optical Microscopy & Intelligent Characterization), an end-to-end framework that integrates foundation models to enable fully autonomous, zero-shot characterization of 2D materials. Our system integrates the vision foundation model (i.e., Segment Anything Model), large language models (i.e., ChatGPT), unsupervised clustering, and topological analysis to automate microscope control, sample scanning, image segmentation, and intelligent analysis through prompt engineering, eliminating the need for additional training. When analyzing typical MoS2 samples, our approach achieves 99.7% segmentation accuracy for single layer identification, which is equivalent to that of human experts. In addition, the integrated model is able to detect grain boundary slits that are challenging to identify with human eyes. Furthermore, the system retains robust accuracy despite variable conditions including defocus, color temperature fluctuations, and exposure variations. It is applicable to a broad spectrum of common 2D materials-including graphene, MoS2, WSe2, SnSe-regardless of whether they were fabricated via chemical vapor deposition or mechanical exfoliation. This work represents the implementation of foundation models to achieve autonomous analysis, establishing a scalable and data-efficient characterization paradigm that fundamentally transforms the approach to nanoscale materials research.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
Earth-Adapter: Bridge the Geospatial Domain Gaps with Mixture of Frequency Adaptation
Authors:
Xiaoxing Hu,
Ziyang Gong,
Yupei Wang,
Yuru Jia,
Gen Luo,
Xue Yang
Abstract:
Parameter-Efficient Fine-Tuning (PEFT) is a technique that allows us to adapt powerful Foundation Models (FMs) to diverse downstream tasks while preserving and unleashing their inherent capabilities. However, we have observed that existing PEFT methods, which are often designed with natural imagery in mind, struggle when applied to Remote Sensing (RS) scenarios. This is primarily due to their inab…
▽ More
Parameter-Efficient Fine-Tuning (PEFT) is a technique that allows us to adapt powerful Foundation Models (FMs) to diverse downstream tasks while preserving and unleashing their inherent capabilities. However, we have observed that existing PEFT methods, which are often designed with natural imagery in mind, struggle when applied to Remote Sensing (RS) scenarios. This is primarily due to their inability to handle artifact influences, a problem particularly severe in RS image features. To tackle this challenge, we introduce Earth-Adapter, the first PEFT method specifically designed for RS artifacts conquering. Earth-Adapter introduces a novel Mixture of Frequency Adaptation process that combines a Mixture of Adapter (MoA) with Discrete Fourier Transformation (DFT). By utilizing DFT, Earth-Adapter can decompose features into different frequency components, precisely separating artifacts from original features. The MoA then dynamically assigns weights to each adapter expert, allowing for the combination of features across various frequency domains. These simple-yet-effective approaches enable Earth-Adapter to more efficiently overcome the disturbances caused by artifacts than previous PEFT methods, significantly enhancing the FMs' performance on RS scenarios. Experiments on Domain Adaptation (DA), and Domain Generalization (DG) semantic segmentation benchmarks showcase the Earth-Adapter's effectiveness. Compared with baseline Rein, Earth-Adapter significantly improves 9.0% mIoU in DA and 3.1% mIoU in DG benchmarks. Our code will be released at https://github.com/VisionXLab/Earth-Adapter.
△ Less
Submitted 16 April, 2025; v1 submitted 8 April, 2025;
originally announced April 2025.
-
Anomalous Maxwell-Garnett theory for photonic time crystals
Authors:
Zheng Gong,
Ruoxi Chen,
Hongsheng Chen,
Xiao Lin
Abstract:
Maxwell-Garnett theory, dating back to James Clerk Maxwell-Garnett's foundational work in 1904, provides a simple yet powerful framework to describe the inhomogeneous structure as an effective homogeneous medium, which significantly reduces the overall complexity of analysis, calculation, and design. As such, the Maxwell-Garnett theory enables many practical applications in diverse realms, ranging…
▽ More
Maxwell-Garnett theory, dating back to James Clerk Maxwell-Garnett's foundational work in 1904, provides a simple yet powerful framework to describe the inhomogeneous structure as an effective homogeneous medium, which significantly reduces the overall complexity of analysis, calculation, and design. As such, the Maxwell-Garnett theory enables many practical applications in diverse realms, ranging from photonics, acoustics, mechanics, thermodynamics, to material science. It has long been thought that the Maxwell-Garnett theory of light in impedance-mismatched periodic structures is valid only within the long-wavelength limit, necessitating either the temporal or spatial period of light to be much larger than that of structures. Here, we break this long-held belief by revealing an anomalous Maxwell-Garnett theory for impedance-mismatched photonic time crystals beyond this long-wavelength limit. The key to this anomaly lies in the Fabry-Perot resonance. We discover that under the Fabry-Pérot resonance, the impedance-mismatched photonic time crystal could be essentially equivalent to a homogeneous temporal slab simultaneously at specific discrete wavelengths, despite the temporal period of these light being comparable to or even much smaller than that of photonic time crystals.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
Towards Optimal Heterogeneous Client Sampling in Multi-Model Federated Learning
Authors:
Haoran Zhang,
Zejun Gong,
Zekai Li,
Marie Siew,
Carlee Joe-Wong,
Rachid El-Azouzi
Abstract:
Federated learning (FL) allows edge devices to collaboratively train models without sharing local data. As FL gains popularity, clients may need to train multiple unrelated FL models, but communication constraints limit their ability to train all models simultaneously. While clients could train FL models sequentially, opportunistically having FL clients concurrently train different models -- terme…
▽ More
Federated learning (FL) allows edge devices to collaboratively train models without sharing local data. As FL gains popularity, clients may need to train multiple unrelated FL models, but communication constraints limit their ability to train all models simultaneously. While clients could train FL models sequentially, opportunistically having FL clients concurrently train different models -- termed multi-model federated learning (MMFL) -- can reduce the overall training time. Prior work uses simple client-to-model assignments that do not optimize the contribution of each client to each model over the course of its training. Prior work on single-model FL shows that intelligent client selection can greatly accelerate convergence, but naïve extensions to MMFL can violate heterogeneous resource constraints at both the server and the clients. In this work, we develop a novel convergence analysis of MMFL with arbitrary client sampling methods, theoretically demonstrating the strengths and limitations of previous well-established gradient-based methods. Motivated by this analysis, we propose MMFL-LVR, a loss-based sampling method that minimizes training variance while explicitly respecting communication limits at the server and reducing computational costs at the clients. We extend this to MMFL-StaleVR, which incorporates stale updates for improved efficiency and stability, and MMFL-StaleVRE, a lightweight variant suitable for low-overhead deployment. Experiments show our methods improve average accuracy by up to 19.1% over random sampling, with only a 5.4% gap from the theoretical optimum (full client participation).
△ Less
Submitted 21 April, 2025; v1 submitted 7 April, 2025;
originally announced April 2025.
-
Authenticated Sublinear Quantum Private Information Retrieval
Authors:
Fengxia Liu,
Zhiyong Zheng,
Kun Tian,
Yi Zhang,
Heng Guo,
Zhe Hu,
Oleksiy Zhedanov,
Zixian Gong
Abstract:
This paper introduces a novel lower bound on communication complexity using quantum relative entropy and mutual information, refining previous classical entropy-based results. By leveraging Uhlmann's lemma and quantum Pinsker inequalities, the authors establish tighter bounds for information-theoretic security, demonstrating that quantum protocols inherently outperform classical counterparts in ba…
▽ More
This paper introduces a novel lower bound on communication complexity using quantum relative entropy and mutual information, refining previous classical entropy-based results. By leveraging Uhlmann's lemma and quantum Pinsker inequalities, the authors establish tighter bounds for information-theoretic security, demonstrating that quantum protocols inherently outperform classical counterparts in balancing privacy and efficiency. Also explores symmetric Quantum Private Information Retrieval (QPIR) protocols that achieve sub-linear communication complexity while ensuring robustness against specious adversaries: A post-quantum cryptography based protocol that can be authenticated for the specious server; A ring-LWE-based protocol for post-quantum security in a single-server setting, ensuring robustness against quantum attacks; A multi-server protocol optimized for hardware practicality, reducing implementation overhead while maintaining sub-linear efficiency. These protocols address critical gaps in secure database queries, offering exponential communication improvements over classical linear-complexity methods. The work also analyzes security trade-offs under quantum specious adversaries, providing theoretical guarantees for privacy and correctness.
△ Less
Submitted 26 May, 2025; v1 submitted 4 April, 2025;
originally announced April 2025.
-
Electron Penetration Acceleration in Turbulent Magnetic Loops
Authors:
Zheng Gong,
Sida Cao,
Caleb Redshaw,
Matthew R. Edwards
Abstract:
Using particle-in-cell simulations to study fast radio burst (FRB) propagation in a tenuous plasma, we identified a novel mechanism that occurs during the growth of turbulent magnetic loops: electron penetration acceleration. The loops have an electromagnetic left-hand chirality distinct from that of well-known quasistatic magnetic islands. The fast electrons penetrate through the loops and thus a…
▽ More
Using particle-in-cell simulations to study fast radio burst (FRB) propagation in a tenuous plasma, we identified a novel mechanism that occurs during the growth of turbulent magnetic loops: electron penetration acceleration. The loops have an electromagnetic left-hand chirality distinct from that of well-known quasistatic magnetic islands. The fast electrons penetrate through the loops and thus are accelerated to unexpected relativistic energies due to the symmetry breaking induced by the coupling between the loop field and the non-relativistic electromagnetic wave. The identified features of penetration acceleration and magnetic loops might provide a new perspective for understanding particle injection into relativistic collisionless shock precursors invoked in FRB-swept cosmic backgrounds. Additionally, we show that this FRB-relevant phenomenon could be tested in scaled laboratory experiments using a multi-terawatt laser impinging on gas targets.
△ Less
Submitted 30 May, 2025; v1 submitted 3 April, 2025;
originally announced April 2025.
-
Measurement of LLM's Philosophies of Human Nature
Authors:
Minheng Ni,
Ennan Wu,
Zidong Gong,
Zhengyuan Yang,
Linjie Li,
Chung-Ching Lin,
Kevin Lin,
Lijuan Wang,
Wangmeng Zuo
Abstract:
The widespread application of artificial intelligence (AI) in various tasks, along with frequent reports of conflicts or violations involving AI, has sparked societal concerns about interactions with AI systems. Based on Wrightsman's Philosophies of Human Nature Scale (PHNS), a scale empirically validated over decades to effectively assess individuals' attitudes toward human nature, we design the…
▽ More
The widespread application of artificial intelligence (AI) in various tasks, along with frequent reports of conflicts or violations involving AI, has sparked societal concerns about interactions with AI systems. Based on Wrightsman's Philosophies of Human Nature Scale (PHNS), a scale empirically validated over decades to effectively assess individuals' attitudes toward human nature, we design the standardized psychological scale specifically targeting large language models (LLM), named the Machine-based Philosophies of Human Nature Scale (M-PHNS). By evaluating LLMs' attitudes toward human nature across six dimensions, we reveal that current LLMs exhibit a systemic lack of trust in humans, and there is a significant negative correlation between the model's intelligence level and its trust in humans. Furthermore, we propose a mental loop learning framework, which enables LLM to continuously optimize its value system during virtual interactions by constructing moral scenarios, thereby improving its attitude toward human nature. Experiments demonstrate that mental loop learning significantly enhances their trust in humans compared to persona or instruction prompts. This finding highlights the potential of human-based psychological assessments for LLM, which can not only diagnose cognitive biases but also provide a potential solution for ethical learning in artificial intelligence. We release the M-PHNS evaluation code and data at https://github.com/kodenii/M-PHNS.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
MO-CTranS: A unified multi-organ segmentation model learning from multiple heterogeneously labelled datasets
Authors:
Zhendi Gong,
Susan Francis,
Eleanor Cox,
Stamatios N. Sotiropoulos,
Dorothee P. Auer,
Guoping Qiu,
Andrew P. French,
Xin Chen
Abstract:
Multi-organ segmentation holds paramount significance in many clinical tasks. In practice, compared to large fully annotated datasets, multiple small datasets are often more accessible and organs are not labelled consistently. Normally, an individual model is trained for each of these datasets, which is not an effective way of using data for model learning. It remains challenging to train a single…
▽ More
Multi-organ segmentation holds paramount significance in many clinical tasks. In practice, compared to large fully annotated datasets, multiple small datasets are often more accessible and organs are not labelled consistently. Normally, an individual model is trained for each of these datasets, which is not an effective way of using data for model learning. It remains challenging to train a single model that can robustly learn from several partially labelled datasets due to label conflict and data imbalance problems. We propose MO-CTranS: a single model that can overcome such problems. MO-CTranS contains a CNN-based encoder and a Transformer-based decoder, which are connected in a multi-resolution manner. Task-specific tokens are introduced in the decoder to help differentiate label discrepancies. Our method was evaluated and compared to several baseline models and state-of-the-art (SOTA) solutions on abdominal MRI datasets that were acquired in different views (i.e. axial and coronal) and annotated for different organs (i.e. liver, kidney, spleen). Our method achieved better performance (most were statistically significant) than the compared methods. Github link: https://github.com/naisops/MO-CTranS.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
Instance-Level Data-Use Auditing of Visual ML Models
Authors:
Zonghao Huang,
Neil Zhenqiang Gong,
Michael K. Reiter
Abstract:
The growing trend of legal disputes over the unauthorized use of data in machine learning (ML) systems highlights the urgent need for reliable data-use auditing mechanisms to ensure accountability and transparency in ML. In this paper, we present the first proactive instance-level data-use auditing method designed to enable data owners to audit the use of their individual data instances in ML mode…
▽ More
The growing trend of legal disputes over the unauthorized use of data in machine learning (ML) systems highlights the urgent need for reliable data-use auditing mechanisms to ensure accountability and transparency in ML. In this paper, we present the first proactive instance-level data-use auditing method designed to enable data owners to audit the use of their individual data instances in ML models, providing more fine-grained auditing results. Our approach integrates any black-box membership inference technique with a sequential hypothesis test, providing a quantifiable and tunable false-detection rate. We evaluate our method on three types of visual ML models: image classifiers, visual encoders, and Contrastive Image-Language Pretraining (CLIP) models. In additional, we apply our method to evaluate the performance of two state-of-the-art approximate unlearning methods. Our findings reveal that neither method successfully removes the influence of the unlearned data instances from image classifiers and CLIP models even if sacrificing model utility by $10.33\%$.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
HS-SLAM: Hybrid Representation with Structural Supervision for Improved Dense SLAM
Authors:
Ziren Gong,
Fabio Tosi,
Youmin Zhang,
Stefano Mattoccia,
Matteo Poggi
Abstract:
NeRF-based SLAM has recently achieved promising results in tracking and reconstruction. However, existing methods face challenges in providing sufficient scene representation, capturing structural information, and maintaining global consistency in scenes emerging significant movement or being forgotten. To this end, we present HS-SLAM to tackle these problems. To enhance scene representation capac…
▽ More
NeRF-based SLAM has recently achieved promising results in tracking and reconstruction. However, existing methods face challenges in providing sufficient scene representation, capturing structural information, and maintaining global consistency in scenes emerging significant movement or being forgotten. To this end, we present HS-SLAM to tackle these problems. To enhance scene representation capacity, we propose a hybrid encoding network that combines the complementary strengths of hash-grid, tri-planes, and one-blob, improving the completeness and smoothness of reconstruction. Additionally, we introduce structural supervision by sampling patches of non-local pixels rather than individual rays to better capture the scene structure. To ensure global consistency, we implement an active global bundle adjustment (BA) to eliminate camera drifts and mitigate accumulative errors. Experimental results demonstrate that HS-SLAM outperforms the baselines in tracking and reconstruction accuracy while maintaining the efficiency required for robotics.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM
Authors:
Codefuse,
Ling Team,
:,
Wenting Cai,
Yuchen Cao,
Chaoyu Chen,
Chen Chen,
Siba Chen,
Qing Cui,
Peng Di,
Junpeng Fang,
Zi Gong,
Ting Guo,
Zhengyu He,
Yang Huang,
Cong Li,
Jianguo Li,
Zheng Li,
Shijie Lian,
BingChang Liu,
Songshan Luo,
Shuo Mao,
Min Shen,
Jian Wu,
Jiaolong Yang
, et al. (8 additional authors not shown)
Abstract:
Recent advancements in code large language models (LLMs) have demonstrated remarkable capabilities in code generation and understanding. It is still challenging to build a code LLM with comprehensive performance yet ultimate efficiency. Many attempts have been released in the open source community to break the trade-off between performance and efficiency, such as the Qwen Coder series and the Deep…
▽ More
Recent advancements in code large language models (LLMs) have demonstrated remarkable capabilities in code generation and understanding. It is still challenging to build a code LLM with comprehensive performance yet ultimate efficiency. Many attempts have been released in the open source community to break the trade-off between performance and efficiency, such as the Qwen Coder series and the DeepSeek Coder series. This paper introduces yet another attempt in this area, namely Ling-Coder-Lite. We leverage the efficient Mixture-of-Experts (MoE) architecture along with a set of high-quality data curation methods (especially those based on program analytics) to build an efficient yet powerful code LLM. Ling-Coder-Lite exhibits on-par performance on 12 representative coding benchmarks compared to state-of-the-art models of similar size, such as Qwen2.5-Coder-7B and DeepSeek-Coder-V2-Lite, while offering competitive latency and throughput. In practice, we achieve a 50\% reduction in deployment resources compared to the similar-sized dense model without performance loss. To facilitate further research and development in this area, we open-source our models as well as a substantial portion of high-quality data for the annealing and post-training stages. The models and data can be accessed at~\url{https://huggingface.co/inclusionAI/Ling-Coder-lite}.
△ Less
Submitted 22 March, 2025;
originally announced March 2025.
-
BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models
Authors:
Zenghui Yuan,
Jiawen Shi,
Pan Zhou,
Neil Zhenqiang Gong,
Lichao Sun
Abstract:
Multi-modal large language models (MLLMs) extend large language models (LLMs) to process multi-modal information, enabling them to generate responses to image-text inputs. MLLMs have been incorporated into diverse multi-modal applications, such as autonomous driving and medical diagnosis, via plug-and-play without fine-tuning. This deployment paradigm increases the vulnerability of MLLMs to backdo…
▽ More
Multi-modal large language models (MLLMs) extend large language models (LLMs) to process multi-modal information, enabling them to generate responses to image-text inputs. MLLMs have been incorporated into diverse multi-modal applications, such as autonomous driving and medical diagnosis, via plug-and-play without fine-tuning. This deployment paradigm increases the vulnerability of MLLMs to backdoor attacks. However, existing backdoor attacks against MLLMs achieve limited effectiveness and stealthiness. In this work, we propose BadToken, the first token-level backdoor attack to MLLMs. BadToken introduces two novel backdoor behaviors: Token-substitution and Token-addition, which enable flexible and stealthy attacks by making token-level modifications to the original output for backdoored inputs. We formulate a general optimization problem that considers the two backdoor behaviors to maximize the attack effectiveness. We evaluate BadToken on two open-source MLLMs and various tasks. Our results show that our attack maintains the model's utility while achieving high attack success rates and stealthiness. We also show the real-world threats of BadToken in two scenarios, i.e., autonomous driving and medical diagnosis. Furthermore, we consider defenses including fine-tuning and input purification. Our results highlight the threat of our attack.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
Waveguide QED with dissipative light-matter couplings
Authors:
Xing-Liang Dong,
Peng-Bo Li,
Zongping Gong,
Franco Nori
Abstract:
Dissipative light-matter coupling plays a vital role in non-Hermitian physics, but it remains largely unexplored in waveguide QED systems. In this work, we find that by employing pseudo-Hermitian symmetry rather than anti-PT symmetry, the concept of dissipative coupling could be generalized and applied to the field of waveguide QED. This leads to a series of intriguing results, such as spontaneous…
▽ More
Dissipative light-matter coupling plays a vital role in non-Hermitian physics, but it remains largely unexplored in waveguide QED systems. In this work, we find that by employing pseudo-Hermitian symmetry rather than anti-PT symmetry, the concept of dissipative coupling could be generalized and applied to the field of waveguide QED. This leads to a series of intriguing results, such as spontaneous breaking of pseudo-Hermitian symmetry across the exceptional points (EPs), level attraction between the bound states, and critical transition across the EPs for the population of quantum emitters in the bound state. Thanks to the tunability of photonic bands in crystal waveguides, we also demonstrate that dissipative light-matter coupling leads to the emergence of nonstandard third-order exceptional points with chiral spatial profiles in a topological waveguide QED system. This work provides a promising paradigm for studying non-Hermitian quantum phenomena in waveguide QED systems.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
Exploring the Necessity of Reasoning in LLM-based Agent Scenarios
Authors:
Xueyang Zhou,
Guiyao Tie,
Guowen Zhang,
Weidong Wang,
Zhigang Zuo,
Di Wu,
Duanfeng Chu,
Pan Zhou,
Neil Zhenqiang Gong,
Lichao Sun
Abstract:
The rise of Large Reasoning Models (LRMs) signifies a paradigm shift toward advanced computational reasoning. Yet, this progress disrupts traditional agent frameworks, traditionally anchored by execution-oriented Large Language Models (LLMs). To explore this transformation, we propose the LaRMA framework, encompassing nine tasks across Tool Usage, Plan Design, and Problem Solving, assessed with th…
▽ More
The rise of Large Reasoning Models (LRMs) signifies a paradigm shift toward advanced computational reasoning. Yet, this progress disrupts traditional agent frameworks, traditionally anchored by execution-oriented Large Language Models (LLMs). To explore this transformation, we propose the LaRMA framework, encompassing nine tasks across Tool Usage, Plan Design, and Problem Solving, assessed with three top LLMs (e.g., Claude3.5-sonnet) and five leading LRMs (e.g., DeepSeek-R1). Our findings address four research questions: LRMs surpass LLMs in reasoning-intensive tasks like Plan Design, leveraging iterative reflection for superior outcomes; LLMs excel in execution-driven tasks such as Tool Usage, prioritizing efficiency; hybrid LLM-LRM configurations, pairing LLMs as actors with LRMs as reflectors, optimize agent performance by blending execution speed with reasoning depth; and LRMs' enhanced reasoning incurs higher computational costs, prolonged processing, and behavioral challenges, including overthinking and fact-ignoring tendencies. This study fosters deeper inquiry into LRMs' balance of deep thinking and overthinking, laying a critical foundation for future agent design advancements.
△ Less
Submitted 27 May, 2025; v1 submitted 14 March, 2025;
originally announced March 2025.
-
Learning Robotic Policy with Imagined Transition: Mitigating the Trade-off between Robustness and Optimality
Authors:
Wei Xiao,
Shangke Lyu,
Zhefei Gong,
Renjie Wang,
Donglin Wang
Abstract:
Existing quadrupedal locomotion learning paradigms usually rely on extensive domain randomization to alleviate the sim2real gap and enhance robustness. It trains policies with a wide range of environment parameters and sensor noises to perform reliably under uncertainty. However, since optimal performance under ideal conditions often conflicts with the need to handle worst-case scenarios, there is…
▽ More
Existing quadrupedal locomotion learning paradigms usually rely on extensive domain randomization to alleviate the sim2real gap and enhance robustness. It trains policies with a wide range of environment parameters and sensor noises to perform reliably under uncertainty. However, since optimal performance under ideal conditions often conflicts with the need to handle worst-case scenarios, there is a trade-off between optimality and robustness. This trade-off forces the learned policy to prioritize stability in diverse and challenging conditions over efficiency and accuracy in ideal ones, leading to overly conservative behaviors that sacrifice peak performance. In this paper, we propose a two-stage framework that mitigates this trade-off by integrating policy learning with imagined transitions. This framework enhances the conventional reinforcement learning (RL) approach by incorporating imagined transitions as demonstrative inputs. These imagined transitions are derived from an optimal policy and a dynamics model operating within an idealized setting. Our findings indicate that this approach significantly mitigates the domain randomization-induced negative impact of existing RL algorithms. It leads to accelerated training, reduced tracking errors within the distribution, and enhanced robustness outside the distribution.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
Not All Edges are Equally Robust: Evaluating the Robustness of Ranking-Based Federated Learning
Authors:
Zirui Gong,
Yanjun Zhang,
Leo Yu Zhang,
Zhaoxi Zhang,
Yong Xiang,
Shirui Pan
Abstract:
Federated Ranking Learning (FRL) is a state-of-the-art FL framework that stands out for its communication efficiency and resilience to poisoning attacks. It diverges from the traditional FL framework in two ways: 1) it leverages discrete rankings instead of gradient updates, significantly reducing communication costs and limiting the potential space for malicious updates, and 2) it uses majority v…
▽ More
Federated Ranking Learning (FRL) is a state-of-the-art FL framework that stands out for its communication efficiency and resilience to poisoning attacks. It diverges from the traditional FL framework in two ways: 1) it leverages discrete rankings instead of gradient updates, significantly reducing communication costs and limiting the potential space for malicious updates, and 2) it uses majority voting on the server side to establish the global ranking, ensuring that individual updates have minimal influence since each client contributes only a single vote. These features enhance the system's scalability and position FRL as a promising paradigm for FL training.
However, our analysis reveals that FRL is not inherently robust, as certain edges are particularly vulnerable to poisoning attacks. Through a theoretical investigation, we prove the existence of these vulnerable edges and establish a lower bound and an upper bound for identifying them in each layer. Based on this finding, we introduce a novel local model poisoning attack against FRL, namely the Vulnerable Edge Manipulation (VEM) attack. The VEM attack focuses on identifying and perturbing the most vulnerable edges in each layer and leveraging an optimization-based approach to maximize the attack's impact. Through extensive experiments on benchmark datasets, we demonstrate that our attack achieves an overall 53.23% attack impact and is 3.7x more impactful than existing methods. Our findings highlight significant vulnerabilities in ranking-based FL systems and underline the urgency for the development of new robust FL frameworks.
△ Less
Submitted 22 April, 2025; v1 submitted 11 March, 2025;
originally announced March 2025.
-
Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models?
Authors:
Yuru Jia,
Valerio Marsocci,
Ziyang Gong,
Xue Yang,
Maarten Vergauwen,
Andrea Nascetti
Abstract:
Self-supervised learning (SSL) has revolutionized representation learning in Remote Sensing (RS), advancing Geospatial Foundation Models (GFMs) to leverage vast unlabeled satellite imagery for diverse downstream tasks. Currently, GFMs primarily focus on discriminative objectives, such as contrastive learning or masked image modeling, owing to their proven success in learning transferable represent…
▽ More
Self-supervised learning (SSL) has revolutionized representation learning in Remote Sensing (RS), advancing Geospatial Foundation Models (GFMs) to leverage vast unlabeled satellite imagery for diverse downstream tasks. Currently, GFMs primarily focus on discriminative objectives, such as contrastive learning or masked image modeling, owing to their proven success in learning transferable representations. However, generative diffusion models--which demonstrate the potential to capture multi-grained semantics essential for RS tasks during image generation--remain underexplored for discriminative applications. This prompts the question: can generative diffusion models also excel and serve as GFMs with sufficient discriminative power? In this work, we answer this question with SatDiFuser, a framework that transforms a diffusion-based generative geospatial foundation model into a powerful pretraining tool for discriminative RS. By systematically analyzing multi-stage, noise-dependent diffusion features, we develop three fusion strategies to effectively leverage these diverse representations. Extensive experiments on remote sensing benchmarks show that SatDiFuser outperforms state-of-the-art GFMs, achieving gains of up to +5.7% mIoU in semantic segmentation and +7.9% F1-score in classification, demonstrating the capacity of diffusion-based generative foundation models to rival or exceed discriminative GFMs. Code will be released.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
Poisoned-MRAG: Knowledge Poisoning Attacks to Multimodal Retrieval Augmented Generation
Authors:
Yinuo Liu,
Zenghui Yuan,
Guiyao Tie,
Jiawen Shi,
Pan Zhou,
Lichao Sun,
Neil Zhenqiang Gong
Abstract:
Multimodal retrieval-augmented generation (RAG) enhances the visual reasoning capability of vision-language models (VLMs) by dynamically accessing information from external knowledge bases. In this work, we introduce \textit{Poisoned-MRAG}, the first knowledge poisoning attack on multimodal RAG systems. Poisoned-MRAG injects a few carefully crafted image-text pairs into the multimodal knowledge da…
▽ More
Multimodal retrieval-augmented generation (RAG) enhances the visual reasoning capability of vision-language models (VLMs) by dynamically accessing information from external knowledge bases. In this work, we introduce \textit{Poisoned-MRAG}, the first knowledge poisoning attack on multimodal RAG systems. Poisoned-MRAG injects a few carefully crafted image-text pairs into the multimodal knowledge database, manipulating VLMs to generate the attacker-desired response to a target query. Specifically, we formalize the attack as an optimization problem and propose two cross-modal attack strategies, dirty-label and clean-label, tailored to the attacker's knowledge and goals. Our extensive experiments across multiple knowledge databases and VLMs show that Poisoned-MRAG outperforms existing methods, achieving up to 98\% attack success rate with just five malicious image-text pairs injected into the InfoSeek database (481,782 pairs). Additionally, We evaluate 4 different defense strategies, including paraphrasing, duplicate removal, structure-driven mitigation, and purification, demonstrating their limited effectiveness and trade-offs against Poisoned-MRAG. Our results highlight the effectiveness and scalability of Poisoned-MRAG, underscoring its potential as a significant threat to multimodal RAG systems.
△ Less
Submitted 14 March, 2025; v1 submitted 8 March, 2025;
originally announced March 2025.
-
Large Language Models Post-training: Surveying Techniques from Alignment to Reasoning
Authors:
Guiyao Tie,
Zeli Zhao,
Dingjie Song,
Fuyang Wei,
Rong Zhou,
Yurou Dai,
Wen Yin,
Zhejian Yang,
Jiangyue Yan,
Yao Su,
Zhenhan Dai,
Yifeng Xie,
Yihan Cao,
Lichao Sun,
Pan Zhou,
Lifang He,
Hechang Chen,
Yu Zhang,
Qingsong Wen,
Tianming Liu,
Neil Zhenqiang Gong,
Jiliang Tang,
Caiming Xiong,
Heng Ji,
Philip S. Yu
, et al. (1 additional authors not shown)
Abstract:
The emergence of Large Language Models (LLMs) has fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration. However, their pre-trained architectures often reveal limitations in specialized contexts, including restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific per…
▽ More
The emergence of Large Language Models (LLMs) has fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration. However, their pre-trained architectures often reveal limitations in specialized contexts, including restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific performance. These challenges necessitate advanced post-training language models (PoLMs) to address these shortcomings, such as OpenAI-o1/o3 and DeepSeek-R1 (collectively known as Large Reasoning Models, or LRMs). This paper presents the first comprehensive survey of PoLMs, systematically tracing their evolution across five core paradigms: Fine-tuning, which enhances task-specific accuracy; Alignment, which ensures ethical coherence and alignment with human preferences; Reasoning, which advances multi-step inference despite challenges in reward design; Efficiency, which optimizes resource utilization amidst increasing complexity; Integration and Adaptation, which extend capabilities across diverse modalities while addressing coherence issues. Charting progress from ChatGPT's alignment strategies to DeepSeek-R1's innovative reasoning advancements, we illustrate how PoLMs leverage datasets to mitigate biases, deepen reasoning capabilities, and enhance domain adaptability. Our contributions include a pioneering synthesis of PoLM evolution, a structured taxonomy categorizing techniques and datasets, and a strategic agenda emphasizing the role of LRMs in improving reasoning proficiency and domain flexibility. As the first survey of its scope, this work consolidates recent PoLM advancements and establishes a rigorous intellectual framework for future research, fostering the development of LLMs that excel in precision, ethical robustness, and versatility across scientific and societal applications.
△ Less
Submitted 20 May, 2025; v1 submitted 8 March, 2025;
originally announced March 2025.
-
AI-driven Prediction of Insulin Resistance in Normal Populations: Comparing Models and Criteria
Authors:
Weihao Gao,
Zhuo Deng,
Zheng Gong,
Ziyi Jiang,
Lan Ma
Abstract:
Insulin resistance (IR) is a key precursor to diabetes and a significant risk factor for cardiovascular disease. Traditional IR assessment methods require multiple blood tests. We developed a simple AI model using only fasting blood glucose to predict IR in non-diabetic populations. Data from the NHANES (1999-2020) and CHARLS (2015) studies were used for model training and validation. Input featur…
▽ More
Insulin resistance (IR) is a key precursor to diabetes and a significant risk factor for cardiovascular disease. Traditional IR assessment methods require multiple blood tests. We developed a simple AI model using only fasting blood glucose to predict IR in non-diabetic populations. Data from the NHANES (1999-2020) and CHARLS (2015) studies were used for model training and validation. Input features included age, gender, height, weight, blood pressure, waist circumference, and fasting blood glucose. The CatBoost algorithm achieved AUC values of 0.8596 (HOMA-IR) and 0.7777 (TyG index) in NHANES, with an external AUC of 0.7442 for TyG. For METS-IR prediction, the model achieved AUC values of 0.9731 (internal) and 0.9591 (external), with RMSE values of 3.2643 (internal) and 3.057 (external). SHAP analysis highlighted waist circumference as a key predictor of IR. This AI model offers a minimally invasive and effective tool for IR prediction, supporting early diabetes and cardiovascular disease prevention.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Uncovering inequalities in new knowledge learning by large language models across different languages
Authors:
Chenglong Wang,
Haoyu Tang,
Xiyuan Yang,
Yueqi Xie,
Jina Suh,
Sunayana Sitaram,
Junming Huang,
Yu Xie,
Zhaoya Gong,
Xing Xie,
Fangzhao Wu
Abstract:
As large language models (LLMs) gradually become integral tools for problem solving in daily life worldwide, understanding linguistic inequality is becoming increasingly important. Existing research has primarily focused on static analyses that assess the disparities in the existing knowledge and capabilities of LLMs across languages. However, LLMs are continuously evolving, acquiring new knowledg…
▽ More
As large language models (LLMs) gradually become integral tools for problem solving in daily life worldwide, understanding linguistic inequality is becoming increasingly important. Existing research has primarily focused on static analyses that assess the disparities in the existing knowledge and capabilities of LLMs across languages. However, LLMs are continuously evolving, acquiring new knowledge to generate up-to-date, domain-specific responses. Investigating linguistic inequalities within this dynamic process is, therefore, also essential. In this paper, we explore inequalities in new knowledge learning by LLMs across different languages and four key dimensions: effectiveness, transferability, prioritization, and robustness. Through extensive experiments under two settings (in-context learning and fine-tuning) using both proprietary and open-source models, we demonstrate that low-resource languages consistently face disadvantages across all four dimensions. By shedding light on these disparities, we aim to raise awareness of linguistic inequalities in LLMs' new knowledge learning, fostering the development of more inclusive and equitable future LLMs.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
Cosmology with second and third-order shear statistics for the Dark Energy Survey: Methods and simulated analysis
Authors:
R. C. H. Gomes,
S. Sugiyama,
B. Jain,
M. Jarvis,
D. Anbajagane,
M. Gatti,
D. Gebauer,
Z. Gong,
A. Halder,
G. A. Marques,
S. Pandey,
J. L. Marshall,
S. Allam,
O. Alves,
F. Andrade-Oliveira,
D. Bacon,
J. Blazek,
S. Bocquet,
D. Brooks,
A. Carnero Rosell,
J. Carretero,
L. N. da Costa,
P. Doel,
C. Doux,
S. Everett
, et al. (34 additional authors not shown)
Abstract:
We present a new pipeline designed for the robust inference of cosmological parameters using both second- and third-order shear statistics. We build a theoretical model for rapid evaluation of three-point correlations using our fastnc code and integrate it into the CosmoSIS framework. We measure the two-point functions $ξ_{\pm}$ and the full configuration-dependent three-point shear correlation fu…
▽ More
We present a new pipeline designed for the robust inference of cosmological parameters using both second- and third-order shear statistics. We build a theoretical model for rapid evaluation of three-point correlations using our fastnc code and integrate it into the CosmoSIS framework. We measure the two-point functions $ξ_{\pm}$ and the full configuration-dependent three-point shear correlation functions across all auto- and cross-redshift bins. We compress the three-point functions into the mass aperture statistic $\langle M_{\rm ap}^3\rangle$ for a set of 796 simulated shear maps designed to model the Dark Energy Survey (DES) Year 3 data. We estimate from it the full covariance matrix and model the effects of intrinsic alignments, shear calibration biases and photometric redshift uncertainties. We apply scale cuts to minimize the contamination from the baryonic signal as modeled through hydrodynamical simulations. We find a significant improvement of $83\%$ on the Figure of Merit in the $Ω_{\rm m}$-$S_8$ plane when we add the $\langle M_{\rm ap}^3\rangle$ data to the $ξ_{\pm}$ information. We present our findings for all relevant cosmological and systematic uncertainty parameters and discuss the complementarity of third-order and second-order statistics.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
MindSimulator: Exploring Brain Concept Localization via Synthetic FMRI
Authors:
Guangyin Bao,
Qi Zhang,
Zixuan Gong,
Zhuojia Wu,
Duoqian Miao
Abstract:
Concept-selective regions within the human cerebral cortex exhibit significant activation in response to specific visual stimuli associated with particular concepts. Precisely localizing these regions stands as a crucial long-term goal in neuroscience to grasp essential brain functions and mechanisms. Conventional experiment-driven approaches hinge on manually constructed visual stimulus collectio…
▽ More
Concept-selective regions within the human cerebral cortex exhibit significant activation in response to specific visual stimuli associated with particular concepts. Precisely localizing these regions stands as a crucial long-term goal in neuroscience to grasp essential brain functions and mechanisms. Conventional experiment-driven approaches hinge on manually constructed visual stimulus collections and corresponding brain activity recordings, constraining the support and coverage of concept localization. Additionally, these stimuli often consist of concept objects in unnatural contexts and are potentially biased by subjective preferences, thus prompting concerns about the validity and generalizability of the identified regions. To address these limitations, we propose a data-driven exploration approach. By synthesizing extensive brain activity recordings, we statistically localize various concept-selective regions. Our proposed MindSimulator leverages advanced generative technologies to learn the probability distribution of brain activity conditioned on concept-oriented visual stimuli. This enables the creation of simulated brain recordings that reflect real neural response patterns. Using the synthetic recordings, we successfully localize several well-studied concept-selective regions and validate them against empirical findings, achieving promising prediction accuracy. The feasibility opens avenues for exploring novel concept-selective regions and provides prior hypotheses for future neuroscience research.
△ Less
Submitted 4 March, 2025;
originally announced March 2025.
-
Jailbreaking Safeguarded Text-to-Image Models via Large Language Models
Authors:
Zhengyuan Jiang,
Yuepeng Hu,
Yuchen Yang,
Yinzhi Cao,
Neil Zhenqiang Gong
Abstract:
Text-to-Image models may generate harmful content, such as pornographic images, particularly when unsafe prompts are submitted. To address this issue, safety filters are often added on top of text-to-image models, or the models themselves are aligned to reduce harmful outputs. However, these defenses remain vulnerable when an attacker strategically designs adversarial prompts to bypass these safet…
▽ More
Text-to-Image models may generate harmful content, such as pornographic images, particularly when unsafe prompts are submitted. To address this issue, safety filters are often added on top of text-to-image models, or the models themselves are aligned to reduce harmful outputs. However, these defenses remain vulnerable when an attacker strategically designs adversarial prompts to bypass these safety guardrails. In this work, we propose PromptTune, a method to jailbreak text-to-image models with safety guardrails using a fine-tuned large language model. Unlike other query-based jailbreak attacks that require repeated queries to the target model, our attack generates adversarial prompts efficiently after fine-tuning our AttackLLM. We evaluate our method on three datasets of unsafe prompts and against five safety guardrails. Our results demonstrate that our approach effectively bypasses safety guardrails, outperforms existing no-box attacks, and also facilitates other query-based attacks.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Disentangling Feature Structure: A Mathematically Provable Two-Stage Training Dynamics in Transformers
Authors:
Zixuan Gong,
Jiaye Teng,
Yong Liu
Abstract:
Transformers may exhibit two-stage training dynamics during the real-world training process. For instance, when training GPT-2 on the Counterfact dataset, the answers progress from syntactically incorrect to syntactically correct to semantically correct. However, existing theoretical analyses hardly account for this two-stage phenomenon. In this paper, we theoretically demonstrate how such two-sta…
▽ More
Transformers may exhibit two-stage training dynamics during the real-world training process. For instance, when training GPT-2 on the Counterfact dataset, the answers progress from syntactically incorrect to syntactically correct to semantically correct. However, existing theoretical analyses hardly account for this two-stage phenomenon. In this paper, we theoretically demonstrate how such two-stage training dynamics occur in transformers. Specifically, we analyze the dynamics of transformers using feature learning techniques under in-context learning regimes, based on a disentangled two-type feature structure. Such disentanglement of feature structure is general in practice, e.g., natural languages contain syntax and semantics, and proteins contain primary and secondary structures. To our best known, this is the first rigorous result regarding a two-stage optimization process in transformers. Additionally, a corollary indicates that such a two-stage process is closely related to the spectral properties of the attention weights, which accords well with empirical findings.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
SafeText: Safe Text-to-image Models via Aligning the Text Encoder
Authors:
Yuepeng Hu,
Zhengyuan Jiang,
Neil Zhenqiang Gong
Abstract:
Text-to-image models can generate harmful images when presented with unsafe prompts, posing significant safety and societal risks. Alignment methods aim to modify these models to ensure they generate only non-harmful images, even when exposed to unsafe prompts. A typical text-to-image model comprises two main components: 1) a text encoder and 2) a diffusion module. Existing alignment methods mainl…
▽ More
Text-to-image models can generate harmful images when presented with unsafe prompts, posing significant safety and societal risks. Alignment methods aim to modify these models to ensure they generate only non-harmful images, even when exposed to unsafe prompts. A typical text-to-image model comprises two main components: 1) a text encoder and 2) a diffusion module. Existing alignment methods mainly focus on modifying the diffusion module to prevent harmful image generation. However, this often significantly impacts the model's behavior for safe prompts, causing substantial quality degradation of generated images. In this work, we propose SafeText, a novel alignment method that fine-tunes the text encoder rather than the diffusion module. By adjusting the text encoder, SafeText significantly alters the embedding vectors for unsafe prompts, while minimally affecting those for safe prompts. As a result, the diffusion module generates non-harmful images for unsafe prompts while preserving the quality of images for safe prompts. We evaluate SafeText on multiple datasets of safe and unsafe prompts, including those generated through jailbreak attacks. Our results show that SafeText effectively prevents harmful image generation with minor impact on the images for safe prompts, and SafeText outperforms six existing alignment methods. We will publish our code and data after paper acceptance.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
Data-Driven Model Identification of Unbalanced Induction Motor Dynamics and Forces using SINDYc
Authors:
Emma Vancayseele,
Philip Desenfans,
Zifeng Gong,
Dries Vanoost,
Herbert De Gersem,
Davy Pissoort
Abstract:
This paper identifies the stator currents, torque and unbalanced magnetic pull (UMP) of an unbalanced induction motor by the System Identification of Nonlinear Dynamics with Control (SINDYc) method from time-series data of measurable quantities. The SINDYc model has been trained on data coming from a nonlinear magnetic equivalent circuit model for three rotor eccentricity configurations. When eval…
▽ More
This paper identifies the stator currents, torque and unbalanced magnetic pull (UMP) of an unbalanced induction motor by the System Identification of Nonlinear Dynamics with Control (SINDYc) method from time-series data of measurable quantities. The SINDYc model has been trained on data coming from a nonlinear magnetic equivalent circuit model for three rotor eccentricity configurations. When evaluating the SINDYc model for static eccentricity, torques and UMPs with excellent accuracies, i.e., 8.8 mNm and 4.87 N of mean absolute error, respectively, are found. When compared with a reference torque equation, this amounts to a 65% error reduction. For dynamic eccentricity, the estimation is more difficult. The SINDYc model is fast enough to be embedded in a control procedure.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
Towards Auto-Regressive Next-Token Prediction: In-Context Learning Emerges from Generalization
Authors:
Zixuan Gong,
Xiaolin Hu,
Huayi Tang,
Yong Liu
Abstract:
Large language models (LLMs) have demonstrated remarkable in-context learning (ICL) abilities. However, existing theoretical analysis of ICL primarily exhibits two limitations: (a) Limited i.i.d. Setting. Most studies focus on supervised function learning tasks where prompts are constructed with i.i.d. input-label pairs. This i.i.d. assumption diverges significantly from real language learning sce…
▽ More
Large language models (LLMs) have demonstrated remarkable in-context learning (ICL) abilities. However, existing theoretical analysis of ICL primarily exhibits two limitations: (a) Limited i.i.d. Setting. Most studies focus on supervised function learning tasks where prompts are constructed with i.i.d. input-label pairs. This i.i.d. assumption diverges significantly from real language learning scenarios where prompt tokens are interdependent. (b) Lack of Emergence Explanation. Most literature answers what ICL does from an implicit optimization perspective but falls short in elucidating how ICL emerges and the impact of pre-training phase on ICL. In our paper, to extend (a), we adopt a more practical paradigm, auto-regressive next-token prediction (AR-NTP), which closely aligns with the actual training of language models. Specifically, within AR-NTP, we emphasize prompt token-dependency, which involves predicting each subsequent token based on the preceding sequence. To address (b), we formalize a systematic pre-training and ICL framework, highlighting the layer-wise structure of sequences and topics, alongside a two-level expectation. In conclusion, we present data-dependent, topic-dependent and optimization-dependent PAC-Bayesian generalization bounds for pre-trained LLMs, investigating that ICL emerges from the generalization of sequences and topics. Our theory is supported by experiments on numerical linear dynamic systems, synthetic GINC and real-world language datasets.
△ Less
Submitted 24 February, 2025;
originally announced February 2025.
-
A Survey of Model Extraction Attacks and Defenses in Distributed Computing Environments
Authors:
Kaixiang Zhao,
Lincan Li,
Kaize Ding,
Neil Zhenqiang Gong,
Yue Zhao,
Yushun Dong
Abstract:
Model Extraction Attacks (MEAs) threaten modern machine learning systems by enabling adversaries to steal models, exposing intellectual property and training data. With the increasing deployment of machine learning models in distributed computing environments, including cloud, edge, and federated learning settings, each paradigm introduces distinct vulnerabilities and challenges. Without a unified…
▽ More
Model Extraction Attacks (MEAs) threaten modern machine learning systems by enabling adversaries to steal models, exposing intellectual property and training data. With the increasing deployment of machine learning models in distributed computing environments, including cloud, edge, and federated learning settings, each paradigm introduces distinct vulnerabilities and challenges. Without a unified perspective on MEAs across these distributed environments, organizations risk fragmented defenses, inadequate risk assessments, and substantial economic and privacy losses. This survey is motivated by the urgent need to understand how the unique characteristics of cloud, edge, and federated deployments shape attack vectors and defense requirements. We systematically examine the evolution of attack methodologies and defense mechanisms across these environments, demonstrating how environmental factors influence security strategies in critical sectors such as autonomous vehicles, healthcare, and financial services. By synthesizing recent advances in MEAs research and discussing the limitations of current evaluation practices, this survey provides essential insights for developing robust and adaptive defense strategies. Our comprehensive approach highlights the importance of integrating protective measures across the entire distributed computing landscape to ensure the secure deployment of machine learning models.
△ Less
Submitted 21 February, 2025;
originally announced February 2025.
-
On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective
Authors:
Yue Huang,
Chujie Gao,
Siyuan Wu,
Haoran Wang,
Xiangqi Wang,
Yujun Zhou,
Yanbo Wang,
Jiayi Ye,
Jiawen Shi,
Qihui Zhang,
Yuan Li,
Han Bao,
Zhaoyi Liu,
Tianrui Guan,
Dongping Chen,
Ruoxi Chen,
Kehan Guo,
Andy Zou,
Bryan Hooi Kuen-Yew,
Caiming Xiong,
Elias Stengel-Eskin,
Hongyang Zhang,
Hongzhi Yin,
Huan Zhang,
Huaxiu Yao
, et al. (41 additional authors not shown)
Abstract:
Generative Foundation Models (GenFMs) have emerged as transformative tools. However, their widespread adoption raises critical concerns regarding trustworthiness across dimensions. This paper presents a comprehensive framework to address these challenges through three key contributions. First, we systematically review global AI governance laws and policies from governments and regulatory bodies, a…
▽ More
Generative Foundation Models (GenFMs) have emerged as transformative tools. However, their widespread adoption raises critical concerns regarding trustworthiness across dimensions. This paper presents a comprehensive framework to address these challenges through three key contributions. First, we systematically review global AI governance laws and policies from governments and regulatory bodies, as well as industry practices and standards. Based on this analysis, we propose a set of guiding principles for GenFMs, developed through extensive multidisciplinary collaboration that integrates technical, ethical, legal, and societal perspectives. Second, we introduce TrustGen, the first dynamic benchmarking platform designed to evaluate trustworthiness across multiple dimensions and model types, including text-to-image, large language, and vision-language models. TrustGen leverages modular components--metadata curation, test case generation, and contextual variation--to enable adaptive and iterative assessments, overcoming the limitations of static evaluation methods. Using TrustGen, we reveal significant progress in trustworthiness while identifying persistent challenges. Finally, we provide an in-depth discussion of the challenges and future directions for trustworthy GenFMs, which reveals the complex, evolving nature of trustworthiness, highlighting the nuanced trade-offs between utility and trustworthiness, and consideration for various downstream applications, identifying persistent challenges and providing a strategic roadmap for future research. This work establishes a holistic framework for advancing trustworthiness in GenAI, paving the way for safer and more responsible integration of GenFMs into critical applications. To facilitate advancement in the community, we release the toolkit for dynamic evaluation.
△ Less
Submitted 11 May, 2025; v1 submitted 20 February, 2025;
originally announced February 2025.
-
VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation
Authors:
Wei Zhao,
Pengxiang Ding,
Min Zhang,
Zhefei Gong,
Shuanghao Bai,
Han Zhao,
Donglin Wang
Abstract:
Vision-language-action models (VLAs) have become increasingly popular in robot manipulation for their end-to-end design and remarkable performance. However, existing VLAs rely heavily on vision-language models (VLMs) that only support text-based instructions, neglecting the more natural speech modality for human-robot interaction. Traditional speech integration methods usually involves a separate…
▽ More
Vision-language-action models (VLAs) have become increasingly popular in robot manipulation for their end-to-end design and remarkable performance. However, existing VLAs rely heavily on vision-language models (VLMs) that only support text-based instructions, neglecting the more natural speech modality for human-robot interaction. Traditional speech integration methods usually involves a separate speech recognition system, which complicates the model and introduces error propagation. Moreover, the transcription procedure would lose non-semantic information in the raw speech, such as voiceprint, which may be crucial for robots to successfully complete customized tasks. To overcome above challenges, we propose VLAS, a novel end-to-end VLA that integrates speech recognition directly into the robot policy model. VLAS allows the robot to understand spoken commands through inner speech-text alignment and produces corresponding actions to fulfill the task. We also present two new datasets, SQA and CSI, to support a three-stage tuning process for speech instructions, which empowers VLAS with the ability of multimodal interaction across text, image, speech, and robot actions. Taking a step further, a voice retrieval-augmented generation (RAG) paradigm is designed to enable our model to effectively handle tasks that require individual-specific knowledge. Our extensive experiments show that VLAS can effectively accomplish robot manipulation tasks with diverse speech commands, offering a seamless and customized interaction experience.
△ Less
Submitted 21 February, 2025; v1 submitted 19 February, 2025;
originally announced February 2025.
-
S2C: Learning Noise-Resistant Differences for Unsupervised Change Detection in Multimodal Remote Sensing Images
Authors:
Lei Ding,
Xibing Zuo,
Danfeng Hong,
Haitao Guo,
Jun Lu,
Zhihui Gong,
Lorenzo Bruzzone
Abstract:
Unsupervised Change Detection (UCD) in multimodal Remote Sensing (RS) images remains a difficult challenge due to the inherent spatio-temporal complexity within data, and the heterogeneity arising from different imaging sensors. Inspired by recent advancements in Visual Foundation Models (VFMs) and Contrastive Learning (CL) methodologies, this research aims to develop CL methodologies to translate…
▽ More
Unsupervised Change Detection (UCD) in multimodal Remote Sensing (RS) images remains a difficult challenge due to the inherent spatio-temporal complexity within data, and the heterogeneity arising from different imaging sensors. Inspired by recent advancements in Visual Foundation Models (VFMs) and Contrastive Learning (CL) methodologies, this research aims to develop CL methodologies to translate implicit knowledge in VFM into change representations, thus eliminating the need for explicit supervision. To this end, we introduce a Semantic-to-Change (S2C) learning framework for UCD in both homogeneous and multimodal RS images. Differently from existing CL methodologies that typically focus on learning multi-temporal similarities, we introduce a novel triplet learning strategy that explicitly models temporal differences, which are crucial to the CD task. Furthermore, random spatial and spectral perturbations are introduced during the training to enhance robustness to temporal noise. In addition, a grid sparsity regularization is defined to suppress insignificant changes, and an IoU-matching algorithm is developed to refine the CD results. Experiments on four benchmark CD datasets demonstrate that the proposed S2C learning framework achieves significant improvements in accuracy, surpassing current state-of-the-art by over 31\%, 9\%, 23\%, and 15\%, respectively. It also demonstrates robustness and sample efficiency, suitable for training and adaptation of various Visual Foundation Models (VFMs) or backbone neural networks. The relevant code will be available at: github.com/DingLei14/S2C.
△ Less
Submitted 18 February, 2025;
originally announced February 2025.
-
Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges
Authors:
Bolei Ma,
Yuting Li,
Wei Zhou,
Ziwei Gong,
Yang Janet Liu,
Katja Jasinskaja,
Annemarie Friedrich,
Julia Hirschberg,
Frauke Kreuter,
Barbara Plank
Abstract:
Understanding pragmatics-the use of language in context-is crucial for developing NLP systems capable of interpreting nuanced language use. Despite recent advances in language technologies, including large language models, evaluating their ability to handle pragmatic phenomena such as implicatures and references remains challenging. To advance pragmatic abilities in models, it is essential to unde…
▽ More
Understanding pragmatics-the use of language in context-is crucial for developing NLP systems capable of interpreting nuanced language use. Despite recent advances in language technologies, including large language models, evaluating their ability to handle pragmatic phenomena such as implicatures and references remains challenging. To advance pragmatic abilities in models, it is essential to understand current evaluation trends and identify existing limitations. In this survey, we provide a comprehensive review of resources designed for evaluating pragmatic capabilities in NLP, categorizing datasets by the pragmatic phenomena they address. We analyze task designs, data collection methods, evaluation approaches, and their relevance to real-world applications. By examining these resources in the context of modern language models, we highlight emerging trends, challenges, and gaps in existing benchmarks. Our survey aims to clarify the landscape of pragmatic evaluation and guide the development of more comprehensive and targeted benchmarks, ultimately contributing to more nuanced and context-aware NLP models.
△ Less
Submitted 12 June, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Authors:
Ailin Huang,
Boyong Wu,
Bruce Wang,
Chao Yan,
Chen Hu,
Chengli Feng,
Fei Tian,
Feiyu Shen,
Jingbei Li,
Mingrui Chen,
Peng Liu,
Ruihang Miao,
Wang You,
Xi Chen,
Xuerui Yang,
Yechang Huang,
Yuxiang Zhang,
Zheng Gong,
Zixin Zhang,
Hongyu Zhou,
Jianjian Sun,
Brian Li,
Chengting Feng,
Changyi Wan,
Hanpeng Hu
, et al. (120 additional authors not shown)
Abstract:
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contribu…
▽ More
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
△ Less
Submitted 18 February, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
Akan Cinematic Emotions (ACE): A Multimodal Multi-party Dataset for Emotion Recognition in Movie Dialogues
Authors:
David Sasu,
Zehui Wu,
Ziwei Gong,
Run Chen,
Pengyuan Shi,
Lin Ai,
Julia Hirschberg,
Natalie Schluter
Abstract:
In this paper, we introduce the Akan Conversation Emotion (ACE) dataset, the first multimodal emotion dialogue dataset for an African language, addressing the significant lack of resources for low-resource languages in emotion recognition research. ACE, developed for the Akan language, contains 385 emotion-labeled dialogues and 6,162 utterances across audio, visual, and textual modalities, along w…
▽ More
In this paper, we introduce the Akan Conversation Emotion (ACE) dataset, the first multimodal emotion dialogue dataset for an African language, addressing the significant lack of resources for low-resource languages in emotion recognition research. ACE, developed for the Akan language, contains 385 emotion-labeled dialogues and 6,162 utterances across audio, visual, and textual modalities, along with word-level prosodic prominence annotations. The presence of prosodic labels in this dataset also makes it the first prosodically annotated African language dataset. We demonstrate the quality and utility of ACE through experiments using state-of-the-art emotion recognition methods, establishing solid baselines for future research. We hope ACE inspires further work on inclusive, linguistically and culturally diverse NLP resources.
△ Less
Submitted 2 June, 2025; v1 submitted 15 February, 2025;
originally announced February 2025.
-
Provably Robust Federated Reinforcement Learning
Authors:
Minghong Fang,
Xilong Wang,
Neil Zhenqiang Gong
Abstract:
Federated reinforcement learning (FRL) allows agents to jointly learn a global decision-making policy under the guidance of a central server. While FRL has advantages, its decentralized design makes it prone to poisoning attacks. To mitigate this, Byzantine-robust aggregation techniques tailored for FRL have been introduced. Yet, in our work, we reveal that these current Byzantine-robust technique…
▽ More
Federated reinforcement learning (FRL) allows agents to jointly learn a global decision-making policy under the guidance of a central server. While FRL has advantages, its decentralized design makes it prone to poisoning attacks. To mitigate this, Byzantine-robust aggregation techniques tailored for FRL have been introduced. Yet, in our work, we reveal that these current Byzantine-robust techniques are not immune to our newly introduced Normalized attack. Distinct from previous attacks that targeted enlarging the distance of policy updates before and after an attack, our Normalized attack emphasizes on maximizing the angle of deviation between these updates. To counter these threats, we develop an ensemble FRL approach that is provably secure against both known and our newly proposed attacks. Our ensemble method involves training multiple global policies, where each is learnt by a group of agents using any foundational aggregation rule. These well-trained global policies then individually predict the action for a specific test state. The ultimate action is chosen based on a majority vote for discrete action systems or the geometric median for continuous ones. Our experimental results across different settings show that the Normalized attack can greatly disrupt non-ensemble Byzantine-robust methods, and our ensemble approach offers substantial resistance against poisoning attacks.
△ Less
Submitted 12 February, 2025;
originally announced February 2025.
-
Efficient Diffusion Models: A Survey
Authors:
Hui Shen,
Jingxuan Zhang,
Boning Xiong,
Rui Hu,
Shoufa Chen,
Zhongwei Wan,
Xin Wang,
Yu Zhang,
Zixuan Gong,
Guangyin Bao,
Chaofan Tao,
Yongfeng Huang,
Ye Yuan,
Mi Zhang
Abstract:
Diffusion models have emerged as powerful generative models capable of producing high-quality contents such as images, videos, and audio, demonstrating their potential to revolutionize digital content creation. However, these capabilities come at the cost of their significant computational resources and lengthy generation time, underscoring the critical need to develop efficient techniques for pra…
▽ More
Diffusion models have emerged as powerful generative models capable of producing high-quality contents such as images, videos, and audio, demonstrating their potential to revolutionize digital content creation. However, these capabilities come at the cost of their significant computational resources and lengthy generation time, underscoring the critical need to develop efficient techniques for practical deployment. In this survey, we provide a systematic and comprehensive review of research on efficient diffusion models. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient diffusion model topics from algorithm-level, system-level, and framework perspective, respectively. We have also created a GitHub repository where we organize the papers featured in this survey at https://github.com/AIoT-MLSys-Lab/Efficient-Diffusion-Model-Survey. We hope our survey can serve as a valuable resource to help researchers and practitioners gain a systematic understanding of efficient diffusion model research and inspire them to contribute to this important and exciting field.
△ Less
Submitted 6 June, 2025; v1 submitted 3 February, 2025;
originally announced February 2025.
-
SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation
Authors:
Xingtong Yu,
Zechuan Gong,
Chang Zhou,
Yuan Fang,
Hui Zhang
Abstract:
Graphs are able to model interconnected entities in many online services, supporting a wide range of applications on the Web. This raises an important question: How can we train a graph foundational model on multiple source domains and adapt to an unseen target domain? A major obstacle is that graphs from different domains often exhibit divergent characteristics. Some studies leverage large langua…
▽ More
Graphs are able to model interconnected entities in many online services, supporting a wide range of applications on the Web. This raises an important question: How can we train a graph foundational model on multiple source domains and adapt to an unseen target domain? A major obstacle is that graphs from different domains often exhibit divergent characteristics. Some studies leverage large language models to align multiple domains based on textual descriptions associated with the graphs, limiting their applicability to text-attributed graphs. For text-free graphs, a few recent works attempt to align different feature distributions across domains, while generally neglecting structural differences. In this work, we propose a novel Structure Alignment framework for text-free Multi-domain Graph Pre-Training and cross-domain adaptation (SAMGPT). It is designed to learn multi-domain knowledge from graphs originating in multiple source domains, which can then be adapted to address applications in an unseen target domain. Specifically, we introduce a set of structure tokens to harmonize structure-based aggregation across source domains during the pre-training phase. Next, for cross-domain adaptation, we design dual prompts, namely, holistic prompts and specific prompts, which adapt unified multi-domain structural knowledge and fine-grained, domain-specific information, respectively, to a target domain. Finally, we conduct comprehensive experiments on seven public datasets to evaluate and analyze the effectiveness of SAMGPT.
△ Less
Submitted 12 April, 2025; v1 submitted 7 February, 2025;
originally announced February 2025.
-
Pure momentum-shift bulk photovoltaic effect in ferroelectric flat-band Mott insulators
Authors:
Zhuocheng Lu,
Zhihao Gong,
Jingshan Qi,
Hua Wang,
Kai Chang
Abstract:
The shift current photovoltaic effect is conventionally understood as the real-space displacement of a wave packet induced by photoexcitation. However, this interpretation becomes insufficient in flat-band systems, where quasiparticles are too massive to accelerate in real space under the optical electric field. Here, we developed a physically consistent method to decompose the shift current into…
▽ More
The shift current photovoltaic effect is conventionally understood as the real-space displacement of a wave packet induced by photoexcitation. However, this interpretation becomes insufficient in flat-band systems, where quasiparticles are too massive to accelerate in real space under the optical electric field. Here, we developed a physically consistent method to decompose the shift current into real-space and momentum-space components. A surprising pure momentum-space shift current is found theoretically in flat-band Mott insulator Nb$_3$X$_8$ (X = Cl, Br, I) monolayers. This work underscores that significant shift current responses can emerge even in systems with minimal interband polarization differences, highlighting the potential for exploring novel bulk photovoltaic effects in flat-band Mott insulators.
△ Less
Submitted 24 February, 2025; v1 submitted 6 February, 2025;
originally announced February 2025.
-
Dark Brain Energy: Toward an Integrative Model of Spontaneous Slow Oscillations
Authors:
ZhuQing Gong,
XiNian Zuo
Abstract:
Neural oscillations facilitate the functioning of the human brain in spatial and temporal dimensions at various frequencies. These oscillations feature a universal frequency architecture that is governed by brain anatomy, ensuring frequency specificity remains invariant across different measurement techniques. Initial magnetic resonance imaging (MRI) methodology constrained functional MRI (fMRI) i…
▽ More
Neural oscillations facilitate the functioning of the human brain in spatial and temporal dimensions at various frequencies. These oscillations feature a universal frequency architecture that is governed by brain anatomy, ensuring frequency specificity remains invariant across different measurement techniques. Initial magnetic resonance imaging (MRI) methodology constrained functional MRI (fMRI) investigations to a singular frequency range, thereby neglecting the frequency characteristics inherent in blood oxygen level-dependent oscillations. With advancements in MRI technology, it has become feasible to decode intricate brain activities via multi-band frequency analysis (MBFA). During the past decade, the utilization of MBFA in fMRI studies has surged, unveiling frequency-dependent characteristics of spontaneous slow oscillations (SSOs) believed to base dark energy in the brain. There remains a dearth of conclusive insights and hypotheses pertaining to the properties and functionalities of SSOs in distinct bands. We surveyed the SSO MBFA studies during the past 15 years to delineate the attributes of SSOs and enlighten their correlated functions. We further proposed a model to elucidate the hierarchical organization of multi-band SSOs by integrating their function, aimed at bridging theoretical gaps and guiding future MBFA research endeavors.
△ Less
Submitted 6 February, 2025;
originally announced February 2025.
-
Clustering of the extreme: A theoretical description of weak lensing critical points power spectra in the mildly nonlinear regime
Authors:
Zhengyangguang Gong,
Alexandre Barthelemy,
Sandrine Codis
Abstract:
In cosmic web analysis, complementary to traditional cosmological probes, the extrema (e.g. peaks and voids) two-point correlation functions (2PCFs) are of particular interest for the study of both astrophysical phenomena and cosmological structure formation. However most previous studies constructed those statistics via N-body simulations without a robust theoretical derivation from first princip…
▽ More
In cosmic web analysis, complementary to traditional cosmological probes, the extrema (e.g. peaks and voids) two-point correlation functions (2PCFs) are of particular interest for the study of both astrophysical phenomena and cosmological structure formation. However most previous studies constructed those statistics via N-body simulations without a robust theoretical derivation from first principles. A strong motivation exists for analytically describing the 2PCFs of these local extrema, taking into account the nonlinear gravitational evolution in the late Universe. In this paper, we derive analytical formulae for the power spectra and 2PCFs of 2D critical points, including peaks (maxima), voids (minima) and saddle points, in mildly non-Gaussian weak gravitational lensing fields. We apply a perturbative bias expansion to model the clustering of 2D critical points. We successfully derive the power spectrum of weak lensing critical points up to the next-to-next-to-leading order (NNLO) in gravitational perturbation theory, where trispectrum configurations of the weak lensing field have to be included. We numerically evaluate those power spectra up to the next-to-leading order (NLO), which correspond to the inclusion of bispectrum configurations, and transform them to the corresponding 2PCFs. An exact Monte Carlo (MC) integration is performed assuming a Gaussian distributed density field to validate our theoretical predictions. Overall, we find similar properties in 2D compared to the clustering of 3D critical points previously measured from N-body simulations. Contrary to standard lensing power spectra analysis, we find distinct BAO features in the lensing peak 2PCFs due to the gradient and curvature constraints, and we quantify that non-Gaussianity makes for ~10% of the signal at quasi-linear scales which could be important for current stage-IV surveys.
△ Less
Submitted 6 February, 2025; v1 submitted 5 February, 2025;
originally announced February 2025.
-
Gradual Domain Adaptation for Graph Learning
Authors:
Pui Ieng Lei,
Ximing Chen,
Yijun Sheng,
Yanyan Liu,
Jingzhi Guo,
Zhiguo Gong
Abstract:
Existing literature lacks a graph domain adaptation technique for handling large distribution shifts, primarily due to the difficulty in simulating an evolving path from source to target graph. To make a breakthrough, we present a graph gradual domain adaptation (GGDA) framework with the construction of a compact domain sequence that minimizes information loss in adaptations. Our approach starts w…
▽ More
Existing literature lacks a graph domain adaptation technique for handling large distribution shifts, primarily due to the difficulty in simulating an evolving path from source to target graph. To make a breakthrough, we present a graph gradual domain adaptation (GGDA) framework with the construction of a compact domain sequence that minimizes information loss in adaptations. Our approach starts with an efficient generation of knowledge-preserving intermediate graphs over the Fused Gromov-Wasserstein (FGW) metric. With the bridging data pool, GGDA domains are then constructed via a novel vertex-based domain progression, which comprises "close" vertex selections and adaptive domain advancement to enhance inter-domain information transferability. Theoretically, our framework concretizes the intractable inter-domain distance $W_p(μ_t,μ_{t+1})$ via implementable upper and lower bounds, enabling flexible adjustments of this metric for optimizing domain formation. Extensive experiments under various transfer scenarios validate the superior performance of our GGDA framework.
△ Less
Submitted 27 June, 2025; v1 submitted 29 January, 2025;
originally announced January 2025.
-
Quantum Geometric Origin of Strain-Tunable Giant Second-Harmonic Generation in Bi$_2$O$_2$X (X=S, Se, Te)
Authors:
Zhefeng Lou,
Zhihao Gong,
Ziye Zhu,
Wenbin Li,
Xiao Lin,
Hua Wang
Abstract:
Two-dimensional (2D) materials with giant nonlinear optical (NLO) responses are essential for the development of advanced on-chip NLO devices. Using first-principles calculations, we predict a remarkable strain-induced enhancement of second-harmonic generation (SHG) in the high-performance 2D semiconductors Bi$_2$O$_2$X (X = S, Se, Te). The SHG susceptibilities of Bi$_2$O$_2$X under strain are on…
▽ More
Two-dimensional (2D) materials with giant nonlinear optical (NLO) responses are essential for the development of advanced on-chip NLO devices. Using first-principles calculations, we predict a remarkable strain-induced enhancement of second-harmonic generation (SHG) in the high-performance 2D semiconductors Bi$_2$O$_2$X (X = S, Se, Te). The SHG susceptibilities of Bi$_2$O$_2$X under strain are on the order of 1~nm/V, rivalling the highest values reported among 2D materials. This giant SHG response originates from gauge-invariant geometric quantities, including the quantum metric, shift vector, and triple phase product. The strain also induces a bandgap variation in Bi$_2$O$_2$X. Intriguingly, in Bi$_2$O$_2$Te, strain-induced bandgap tuning drives a transition from a semiconductor to a half-metal, and ultimately to a polar metal. Our findings present a unique platform that combines strain-tunable bandgap engineering with exceptional NLO properties, while also highlighting the crucial role of quantum geometry in enhancing SHG.
△ Less
Submitted 28 January, 2025;
originally announced January 2025.
-
A Comprehensive Survey on Self-Interpretable Neural Networks
Authors:
Yang Ji,
Ying Sun,
Yuting Zhang,
Zhigaoyuan Wang,
Yuanxin Zhuang,
Zheng Gong,
Dazhong Shen,
Chuan Qin,
Hengshu Zhu,
Hui Xiong
Abstract:
Neural networks have achieved remarkable success across various fields. However, the lack of interpretability limits their practical use, particularly in critical decision-making scenarios. Post-hoc interpretability, which provides explanations for pre-trained models, is often at risk of robustness and fidelity. This has inspired a rising interest in self-interpretable neural networks, which inher…
▽ More
Neural networks have achieved remarkable success across various fields. However, the lack of interpretability limits their practical use, particularly in critical decision-making scenarios. Post-hoc interpretability, which provides explanations for pre-trained models, is often at risk of robustness and fidelity. This has inspired a rising interest in self-interpretable neural networks, which inherently reveal the prediction rationale through the model structures. Although there exist surveys on post-hoc interpretability, a comprehensive and systematic survey of self-interpretable neural networks is still missing. To address this gap, we first collect and review existing works on self-interpretable neural networks and provide a structured summary of their methodologies from five key perspectives: attribution-based, function-based, concept-based, prototype-based, and rule-based self-interpretation. We also present concrete, visualized examples of model explanations and discuss their applicability across diverse scenarios, including image, text, graph data, and deep reinforcement learning. Additionally, we summarize existing evaluation metrics for self-interpretability and identify open challenges in this field, offering insights for future research. To support ongoing developments, we present a publicly accessible resource to track advancements in this domain: https://github.com/yangji721/Awesome-Self-Interpretable-Neural-Network.
△ Less
Submitted 21 March, 2025; v1 submitted 26 January, 2025;
originally announced January 2025.
-
Optimal spectral transport of non-Hermitian systems
Authors:
Mingtao Xu,
Zongping Gong,
Wei Yi
Abstract:
The optimal transport problem seeks to minimize the total transportation cost between two distributions, thus providing a measure of distance between them. In this work, we study the optimal transport of the eigenspectrum of one-dimensional non-Hermitian models as the spectrum deforms on the complex plane under a varying imaginary gauge field. Notably, according to the non-Bloch band theory, the d…
▽ More
The optimal transport problem seeks to minimize the total transportation cost between two distributions, thus providing a measure of distance between them. In this work, we study the optimal transport of the eigenspectrum of one-dimensional non-Hermitian models as the spectrum deforms on the complex plane under a varying imaginary gauge field. Notably, according to the non-Bloch band theory, the deforming spectrum continuously connects the eigenspectra of the original non-Hermitian model (with vanishing gauge field) under different boundary conditions. It follows that the optimal spectral transport should contain key information of the model. Characterizing the optimal spectral transport through the Wasserstein metric, we show that, indeed, important features of the non-Hermitian model, such as the (auxiliary) generalized Brillouin zone, the non-Bloch exceptional point, and topological phase transition, can be determined from the Wasserstein-metric calculation. We confirm our conclusions using concrete examples. Our work highlights the key role of spectral geometry in non-Hermitian physics, and offers a practical and convenient access to the properties of non-Hermitian models.
△ Less
Submitted 25 January, 2025;
originally announced January 2025.
-
ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding
Authors:
Austin T. Wang,
ZeMing Gong,
Angel X. Chang
Abstract:
3D visual grounding (3DVG) involves localizing entities in a 3D scene referred to by natural language text. Such models are useful for embodied AI and scene retrieval applications, which involve searching for objects or patterns using natural language descriptions. While recent works have focused on LLM-based scaling of 3DVG datasets, these datasets do not capture the full range of potential promp…
▽ More
3D visual grounding (3DVG) involves localizing entities in a 3D scene referred to by natural language text. Such models are useful for embodied AI and scene retrieval applications, which involve searching for objects or patterns using natural language descriptions. While recent works have focused on LLM-based scaling of 3DVG datasets, these datasets do not capture the full range of potential prompts which could be specified in the English language. To ensure that we are scaling up and testing against a useful and representative set of prompts, we propose a framework for linguistically analyzing 3DVG prompts and introduce Visual Grounding with Diverse Language in 3D (ViGiL3D), a diagnostic dataset for evaluating visual grounding methods against a diverse set of language patterns. We evaluate existing open-vocabulary 3DVG methods to demonstrate that these methods are not yet proficient in understanding and identifying the targets of more challenging, out-of-distribution prompts, toward real-world applications.
△ Less
Submitted 7 July, 2025; v1 submitted 2 January, 2025;
originally announced January 2025.
-
KnowRA: Knowledge Retrieval Augmented Method for Document-level Relation Extraction with Comprehensive Reasoning Abilities
Authors:
Chengcheng Mai,
Yuxiang Wang,
Ziyu Gong,
Hanxiang Wang,
Yihua Huang
Abstract:
Document-level relation extraction (Doc-RE) aims to extract relations between entities across multiple sentences. Therefore, Doc-RE requires more comprehensive reasoning abilities like humans, involving complex cross-sentence interactions between entities, contexts, and external general knowledge, compared to the sentence-level RE. However, most existing Doc-RE methods focus on optimizing single r…
▽ More
Document-level relation extraction (Doc-RE) aims to extract relations between entities across multiple sentences. Therefore, Doc-RE requires more comprehensive reasoning abilities like humans, involving complex cross-sentence interactions between entities, contexts, and external general knowledge, compared to the sentence-level RE. However, most existing Doc-RE methods focus on optimizing single reasoning ability, but lack the ability to utilize external knowledge for comprehensive reasoning on long documents. To solve these problems, a knowledge retrieval augmented method, named KnowRA, was proposed with comprehensive reasoning to autonomously determine whether to accept external knowledge to assist DocRE. Firstly, we constructed a document graph for semantic encoding and integrated the co-reference resolution model to augment the co-reference reasoning ability. Then, we expanded the document graph into a document knowledge graph by retrieving the external knowledge base for common-sense reasoning and a novel knowledge filtration method was presented to filter out irrelevant knowledge. Finally, we proposed the axis attention mechanism to build direct and indirect associations with intermediary entities for achieving cross-sentence logical reasoning. Extensive experiments conducted on two datasets verified the effectiveness of our method compared to the state-of-the-art baselines. Our code is available at https://anonymous.4open.science/r/KnowRA.
△ Less
Submitted 1 May, 2025; v1 submitted 31 December, 2024;
originally announced January 2025.