-
Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment
Authors:
Pedram Zaree,
Md Abdullah Al Mamun,
Quazi Mishkatul Alam,
Yue Dong,
Ihsen Alouani,
Nael Abu-Ghazaleh
Abstract:
Recent research has shown that carefully crafted jailbreak inputs can induce large language models to produce harmful outputs, despite safety measures such as alignment. It is important to anticipate the range of potential Jailbreak attacks to guide effective defenses and accurate assessment of model safety. In this paper, we present a new approach for generating highly effective Jailbreak attacks…
▽ More
Recent research has shown that carefully crafted jailbreak inputs can induce large language models to produce harmful outputs, despite safety measures such as alignment. It is important to anticipate the range of potential Jailbreak attacks to guide effective defenses and accurate assessment of model safety. In this paper, we present a new approach for generating highly effective Jailbreak attacks that manipulate the attention of the model to selectively strengthen or weaken attention among different parts of the prompt. By harnessing attention loss, we develop more effective jailbreak attacks, that are also transferrable. The attacks amplify the success rate of existing Jailbreak algorithms including GCG, AutoDAN, and ReNeLLM, while lowering their generation cost (for example, the amplified GCG attack achieves 91.2% ASR, vs. 67.9% for the original attack on Llama2-7B/AdvBench, using less than a third of the generation time).
△ Less
Submitted 21 February, 2025;
originally announced February 2025.
-
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
Authors:
Erfan Shayegani,
Md Abdullah Al Mamun,
Yu Fu,
Pedram Zaree,
Yue Dong,
Nael Abu-Ghazaleh
Abstract:
Large Language Models (LLMs) are swiftly advancing in architecture and capability, and as they integrate more deeply into complex systems, the urgency to scrutinize their security properties grows. This paper surveys research in the emerging interdisciplinary field of adversarial attacks on LLMs, a subfield of trustworthy ML, combining the perspectives of Natural Language Processing and Security.…
▽ More
Large Language Models (LLMs) are swiftly advancing in architecture and capability, and as they integrate more deeply into complex systems, the urgency to scrutinize their security properties grows. This paper surveys research in the emerging interdisciplinary field of adversarial attacks on LLMs, a subfield of trustworthy ML, combining the perspectives of Natural Language Processing and Security. Prior work has shown that even safety-aligned LLMs (via instruction tuning and reinforcement learning through human feedback) can be susceptible to adversarial attacks, which exploit weaknesses and mislead AI systems, as evidenced by the prevalence of `jailbreak' attacks on models like ChatGPT and Bard. In this survey, we first provide an overview of large language models, describe their safety alignment, and categorize existing research based on various learning structures: textual-only attacks, multi-modal attacks, and additional attack methods specifically targeting complex systems, such as federated learning or multi-agent systems. We also offer comprehensive remarks on works that focus on the fundamental sources of vulnerabilities and potential defenses. To make this field more accessible to newcomers, we present a systematic review of existing works, a structured typology of adversarial attack concepts, and additional resources, including slides for presentations on related topics at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL'24).
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
That Doesn't Go There: Attacks on Shared State in Multi-User Augmented Reality Applications
Authors:
Carter Slocum,
Yicheng Zhang,
Erfan Shayegani,
Pedram Zaree,
Nael Abu-Ghazaleh,
Jiasi Chen
Abstract:
Augmented Reality (AR) is expected to become a pervasive component in enabling shared virtual experiences. In order to facilitate collaboration among multiple users, it is crucial for multi-user AR applications to establish a consensus on the "shared state" of the virtual world and its augmentations, through which they interact within augmented reality spaces. Current methods to create and access…
▽ More
Augmented Reality (AR) is expected to become a pervasive component in enabling shared virtual experiences. In order to facilitate collaboration among multiple users, it is crucial for multi-user AR applications to establish a consensus on the "shared state" of the virtual world and its augmentations, through which they interact within augmented reality spaces. Current methods to create and access shared state collect sensor data from devices (e.g., camera images), process them, and integrate them into the shared state. However, this process introduces new vulnerabilities and opportunities for attacks. Maliciously writing false data to "poison" the shared state is a major concern for the security of the downstream victims that depend on it. Another type of vulnerability arises when reading the shared state; by providing false inputs, an attacker can view hologram augmentations at locations they are not allowed to access. In this work, we demonstrate a series of novel attacks on multiple AR frameworks with shared states, focusing on three publicly-accessible frameworks. We show that these frameworks, while using different underlying implementations, scopes, and mechanisms to read from and write to the shared state, have shared vulnerability to a unified threat model. Our evaluation of these state-of-art AR applications demonstrates reliable attacks both on updating and accessing shared state across the different systems. To defend against such threats, we discuss a number of potential mitigation strategies that can help enhance the security of multi-user AR applications.
△ Less
Submitted 8 March, 2024; v1 submitted 17 August, 2023;
originally announced August 2023.
-
Co(ve)rtex: ML Models as storage channels and their (mis-)applications
Authors:
Md Abdullah Al Mamun,
Quazi Mishkatul Alam,
Erfan Shayegani,
Pedram Zaree,
Ihsen Alouani,
Nael Abu-Ghazaleh
Abstract:
Machine learning (ML) models are overparameterized to support generality and avoid overfitting. The state of these parameters is essentially a "don't-care" with respect to the primary model provided that this state does not interfere with the primary model. In both hardware and software systems, don't-care states and undefined behavior have been shown to be sources of significant vulnerabilities.…
▽ More
Machine learning (ML) models are overparameterized to support generality and avoid overfitting. The state of these parameters is essentially a "don't-care" with respect to the primary model provided that this state does not interfere with the primary model. In both hardware and software systems, don't-care states and undefined behavior have been shown to be sources of significant vulnerabilities. In this paper, we propose a new information theoretic perspective of the problem; we consider the ML model as a storage channel with a capacity that increases with overparameterization. Specifically, we consider a sender that embeds arbitrary information in the model at training time, which can be extracted by a receiver with a black-box access to the deployed model. We derive an upper bound on the capacity of the channel based on the number of available unused parameters. We then explore black-box write and read primitives that allow the attacker to:(i) store data in an optimized way within the model by augmenting the training data at the transmitter side, and (ii) to read it by querying the model after it is deployed. We also consider a new version of the problem which takes information storage covertness into account. Specifically, to obtain storage covertness, we introduce a new constraint such that the data augmentation used for the write primitives minimizes the distribution shift with the initial (baseline task) distribution. This constraint introduces a level of "interference" with the initial task, thereby limiting the channel's effective capacity. Therefore, we develop optimizations to improve the capacity in this case, including a novel ML-specific substitution based error correction protocol. We believe that the proposed modeling of the problem offers new tools to better understand and mitigate potential vulnerabilities of ML, especially in the context of increasingly large models.
△ Less
Submitted 11 May, 2024; v1 submitted 17 July, 2023;
originally announced July 2023.