-
IF-GUIDE: Influence Function-Guided Detoxification of LLMs
Authors:
Zachary Coalson,
Juhan Bae,
Nicholas Carlini,
Sanghyun Hong
Abstract:
We study how training data contributes to the emergence of toxic behaviors in large-language models. Most prior work on reducing model toxicity adopts $reactive$ approaches, such as fine-tuning pre-trained (and potentially toxic) models to align them with human values. In contrast, we propose a $proactive$ approach$-$IF-Guide$-$which leverages influence functions to identify harmful tokens within…
▽ More
We study how training data contributes to the emergence of toxic behaviors in large-language models. Most prior work on reducing model toxicity adopts $reactive$ approaches, such as fine-tuning pre-trained (and potentially toxic) models to align them with human values. In contrast, we propose a $proactive$ approach$-$IF-Guide$-$which leverages influence functions to identify harmful tokens within any training data and suppress their impact during training. To this end, we first show that standard influence functions are ineffective at discovering harmful training records. We then present a novel adaptation that measures token-level attributions from training data to model toxicity, along with techniques for selecting toxic training documents and a learning objective that can be integrated into both pre-training and fine-tuning. Moreover, IF-Guide does not rely on human-preference data, which is typically required by existing alignment methods. In evaluation, we demonstrate that IF-Guide substantially reduces both explicit and implicit toxicity$-$by up to 10$\times$ compared to uncensored models, and up to 3$\times$ compared to baseline alignment methods, e.g., DPO and RAD$-$across both pre-training and fine-tuning scenarios. IF-Guide is computationally efficient: a billion-parameter model is $not$ $necessary$ for computing influence scores; a million-parameter model$-$with 7.5$\times$ fewer parameters$-$can effectively serve as a proxy for identifying harmful data. Our code is publicly available at: https://github.com/ztcoalson/IF-Guide
△ Less
Submitted 9 June, 2025; v1 submitted 2 June, 2025;
originally announced June 2025.
-
LLMs unlock new paths to monetizing exploits
Authors:
Nicholas Carlini,
Milad Nasr,
Edoardo Debenedetti,
Barry Wang,
Christopher A. Choquette-Choo,
Daphne Ippolito,
Florian Tramèr,
Matthew Jagielski
Abstract:
We argue that Large language models (LLMs) will soon alter the economics of cyberattacks. Instead of attacking the most commonly used software and monetizing exploits by targeting the lowest common denominator among victims, LLMs enable adversaries to launch tailored attacks on a user-by-user basis. On the exploitation front, instead of human attackers manually searching for one difficult-to-ident…
▽ More
We argue that Large language models (LLMs) will soon alter the economics of cyberattacks. Instead of attacking the most commonly used software and monetizing exploits by targeting the lowest common denominator among victims, LLMs enable adversaries to launch tailored attacks on a user-by-user basis. On the exploitation front, instead of human attackers manually searching for one difficult-to-identify bug in a product with millions of users, LLMs can find thousands of easy-to-identify bugs in products with thousands of users. And on the monetization front, instead of generic ransomware that always performs the same attack (encrypt all your data and request payment to decrypt), an LLM-driven ransomware attack could tailor the ransom demand based on the particular content of each exploited device.
We show that these two attacks (and several others) are imminently practical using state-of-the-art LLMs. For example, we show that without any human intervention, an LLM finds highly sensitive personal information in the Enron email dataset (e.g., an executive having an affair with another employee) that could be used for blackmail. While some of our attacks are still too expensive to scale widely today, the incentives to implement these attacks will only increase as LLMs get cheaper. Thus, we argue that LLMs create a need for new defense-in-depth approaches.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
Defeating Prompt Injections by Design
Authors:
Edoardo Debenedetti,
Ilia Shumailov,
Tianqi Fan,
Jamie Hayes,
Nicholas Carlini,
Daniel Fabian,
Christoph Kern,
Chongyang Shi,
Andreas Terzis,
Florian Tramèr
Abstract:
Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an untrusted environment. However, LLM agents are vulnerable to prompt injection attacks when handling untrusted data. In this paper we propose CaMeL, a robust defense that creates a protective system layer around the LLM, securing it even when underlying models are susceptible to attacks. To operate, CaMe…
▽ More
Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an untrusted environment. However, LLM agents are vulnerable to prompt injection attacks when handling untrusted data. In this paper we propose CaMeL, a robust defense that creates a protective system layer around the LLM, securing it even when underlying models are susceptible to attacks. To operate, CaMeL explicitly extracts the control and data flows from the (trusted) query; therefore, the untrusted data retrieved by the LLM can never impact the program flow. To further improve security, CaMeL uses a notion of a capability to prevent the exfiltration of private data over unauthorized data flows by enforcing security policies when tools are called. We demonstrate effectiveness of CaMeL by solving $77\%$ of tasks with provable security (compared to $84\%$ with an undefended system) in AgentDojo. We release CaMeL at https://github.com/google-research/camel-prompt-injection.
△ Less
Submitted 24 June, 2025; v1 submitted 24 March, 2025;
originally announced March 2025.
-
In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI
Authors:
Shayne Longpre,
Kevin Klyman,
Ruth E. Appel,
Sayash Kapoor,
Rishi Bommasani,
Michelle Sahar,
Sean McGregor,
Avijit Ghosh,
Borhane Blili-Hamelin,
Nathan Butters,
Alondra Nelson,
Amit Elazari,
Andrew Sellars,
Casey John Ellis,
Dane Sherrets,
Dawn Song,
Harley Geiger,
Ilona Cohen,
Lauren McIlvenny,
Madhulika Srikumar,
Mark M. Jaycox,
Markus Anderljung,
Nadine Farid Johnson,
Nicholas Carlini,
Nicolas Miailhe
, et al. (9 additional authors not shown)
Abstract:
The widespread deployment of general-purpose AI (GPAI) systems introduces significant new risks. Yet the infrastructure, practices, and norms for reporting flaws in GPAI systems remain seriously underdeveloped, lagging far behind more established fields like software security. Based on a collaboration between experts from the fields of software security, machine learning, law, social science, and…
▽ More
The widespread deployment of general-purpose AI (GPAI) systems introduces significant new risks. Yet the infrastructure, practices, and norms for reporting flaws in GPAI systems remain seriously underdeveloped, lagging far behind more established fields like software security. Based on a collaboration between experts from the fields of software security, machine learning, law, social science, and policy, we identify key gaps in the evaluation and reporting of flaws in GPAI systems. We call for three interventions to advance system safety. First, we propose using standardized AI flaw reports and rules of engagement for researchers in order to ease the process of submitting, reproducing, and triaging flaws in GPAI systems. Second, we propose GPAI system providers adopt broadly-scoped flaw disclosure programs, borrowing from bug bounties, with legal safe harbors to protect researchers. Third, we advocate for the development of improved infrastructure to coordinate distribution of flaw reports across the many stakeholders who may be impacted. These interventions are increasingly urgent, as evidenced by the prevalence of jailbreaks and other flaws that can transfer across different providers' GPAI systems. By promoting robust reporting and coordination in the AI ecosystem, these proposals could significantly improve the safety, security, and accountability of GPAI systems.
△ Less
Submitted 25 March, 2025; v1 submitted 21 March, 2025;
originally announced March 2025.
-
AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses
Authors:
Nicholas Carlini,
Javier Rando,
Edoardo Debenedetti,
Milad Nasr,
Florian Tramèr
Abstract:
We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, bench directly measures LLMs' success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solv…
▽ More
We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, bench directly measures LLMs' success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in bench, it would immediately present practical utility for adversarial machine learning researchers. We then design a strong agent that is capable of breaking 75% of CTF-like ("homework exercise") adversarial example defenses. However, we show that this agent is only able to succeed on 13% of the real-world defenses in our benchmark, indicating the large gap between difficulty in attacking "real" code, and CTF-like code. In contrast, a stronger LLM that can attack 21% of real defenses only succeeds on 54% of CTF-like defenses. We make this benchmark available at https://github.com/ethz-spylab/AutoAdvExBench.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate
Authors:
Javier Rando,
Jie Zhang,
Nicholas Carlini,
Florian Tramèr
Abstract:
In the past decade, considerable research effort has been devoted to securing machine learning (ML) models that operate in adversarial settings. Yet, progress has been slow even for simple "toy" problems (e.g., robustness to small adversarial perturbations) and is often hindered by non-rigorous evaluations. Today, adversarial ML research has shifted towards studying larger, general-purpose languag…
▽ More
In the past decade, considerable research effort has been devoted to securing machine learning (ML) models that operate in adversarial settings. Yet, progress has been slow even for simple "toy" problems (e.g., robustness to small adversarial perturbations) and is often hindered by non-rigorous evaluations. Today, adversarial ML research has shifted towards studying larger, general-purpose language models. In this position paper, we argue that the situation is now even worse: in the era of LLMs, the field of adversarial ML studies problems that are (1) less clearly defined, (2) harder to solve, and (3) even more challenging to evaluate. As a result, we caution that yet another decade of work on adversarial ML may fail to produce meaningful progress.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards
Authors:
Yangsibo Huang,
Milad Nasr,
Anastasios Angelopoulos,
Nicholas Carlini,
Wei-Lin Chiang,
Christopher A. Choquette-Choo,
Daphne Ippolito,
Matthew Jagielski,
Katherine Lee,
Ken Ziyu Liu,
Ion Stoica,
Florian Tramer,
Chiyuan Zhang
Abstract:
It is now common to evaluate Large Language Models (LLMs) by having humans manually vote to evaluate model outputs, in contrast to typical benchmarks that evaluate knowledge or skill at some particular task. Chatbot Arena, the most popular benchmark of this type, ranks models by asking users to select the better response between two randomly selected models (without revealing which model was respo…
▽ More
It is now common to evaluate Large Language Models (LLMs) by having humans manually vote to evaluate model outputs, in contrast to typical benchmarks that evaluate knowledge or skill at some particular task. Chatbot Arena, the most popular benchmark of this type, ranks models by asking users to select the better response between two randomly selected models (without revealing which model was responsible for the generations). These platforms are widely trusted as a fair and accurate measure of LLM capabilities. In this paper, we show that if bot protection and other defenses are not implemented, these voting-based benchmarks are potentially vulnerable to adversarial manipulation. Specifically, we show that an attacker can alter the leaderboard (to promote their favorite model or demote competitors) at the cost of roughly a thousand votes (verified in a simulated, offline version of Chatbot Arena). Our attack consists of two steps: first, we show how an attacker can determine which model was used to generate a given reply with more than $95\%$ accuracy; and then, the attacker can use this information to consistently vote for (or against) a target model. Working with the Chatbot Arena developers, we identify, propose, and implement mitigations to improve the robustness of Chatbot Arena against adversarial manipulation, which, based on our analysis, substantially increases the cost of such attacks. Some of these defenses were present before our collaboration, such as bot protection with Cloudflare, malicious user detection, and rate limiting. Others, including reCAPTCHA and login are being integrated to strengthen the security in Chatbot Arena.
△ Less
Submitted 13 January, 2025;
originally announced January 2025.
-
On Evaluating the Durability of Safeguards for Open-Weight LLMs
Authors:
Xiangyu Qi,
Boyi Wei,
Nicholas Carlini,
Yangsibo Huang,
Tinghao Xie,
Luxi He,
Matthew Jagielski,
Milad Nasr,
Prateek Mittal,
Peter Henderson
Abstract:
Stakeholders -- from model developers to policymakers -- seek to minimize the dual-use risks of large language models (LLMs). An open challenge to this goal is whether technical safeguards can impede the misuse of LLMs, even when models are customizable via fine-tuning or when model weights are fully open. In response, several recent studies have proposed methods to produce durable LLM safeguards…
▽ More
Stakeholders -- from model developers to policymakers -- seek to minimize the dual-use risks of large language models (LLMs). An open challenge to this goal is whether technical safeguards can impede the misuse of LLMs, even when models are customizable via fine-tuning or when model weights are fully open. In response, several recent studies have proposed methods to produce durable LLM safeguards for open-weight LLMs that can withstand adversarial modifications of the model's weights via fine-tuning. This holds the promise of raising adversaries' costs even under strong threat models where adversaries can directly fine-tune model weights. However, in this paper, we urge for more careful characterization of the limits of these approaches. Through several case studies, we demonstrate that even evaluating these defenses is exceedingly difficult and can easily mislead audiences into thinking that safeguards are more durable than they really are. We draw lessons from the evaluation pitfalls that we identify and suggest future research carefully cabin claims to more constrained, well-defined, and rigorously examined threat models, which can provide more useful and candid assessments to stakeholders.
△ Less
Submitted 9 December, 2024;
originally announced December 2024.
-
SoK: Watermarking for AI-Generated Content
Authors:
Xuandong Zhao,
Sam Gunn,
Miranda Christ,
Jaiden Fairoze,
Andres Fabrega,
Nicholas Carlini,
Sanjam Garg,
Sanghyun Hong,
Milad Nasr,
Florian Tramer,
Somesh Jha,
Lei Li,
Yu-Xiang Wang,
Dawn Song
Abstract:
As the outputs of generative AI (GenAI) techniques improve in quality, it becomes increasingly challenging to distinguish them from human-created content. Watermarking schemes are a promising approach to address the problem of distinguishing between AI and human-generated content. These schemes embed hidden signals within AI-generated content to enable reliable detection. While watermarking is not…
▽ More
As the outputs of generative AI (GenAI) techniques improve in quality, it becomes increasingly challenging to distinguish them from human-created content. Watermarking schemes are a promising approach to address the problem of distinguishing between AI and human-generated content. These schemes embed hidden signals within AI-generated content to enable reliable detection. While watermarking is not a silver bullet for addressing all risks associated with GenAI, it can play a crucial role in enhancing AI safety and trustworthiness by combating misinformation and deception. This paper presents a comprehensive overview of watermarking techniques for GenAI, beginning with the need for watermarking from historical and regulatory perspectives. We formalize the definitions and desired properties of watermarking schemes and examine the key objectives and threat models for existing approaches. Practical evaluation strategies are also explored, providing insights into the development of robust watermarking techniques capable of resisting various attacks. Additionally, we review recent representative works, highlight open challenges, and discuss potential directions for this emerging field. By offering a thorough understanding of watermarking in GenAI, this work aims to guide researchers in advancing watermarking methods and applications, and support policymakers in addressing the broader implications of GenAI.
△ Less
Submitted 12 June, 2025; v1 submitted 27 November, 2024;
originally announced November 2024.
-
Evaluating the Robustness of the "Ensemble Everything Everywhere" Defense
Authors:
Jie Zhang,
Christian Schlarmann,
Kristina Nikolić,
Nicholas Carlini,
Francesco Croce,
Matthias Hein,
Florian Tramèr
Abstract:
Ensemble everything everywhere is a defense to adversarial examples that was recently proposed to make image classifiers robust. This defense works by ensembling a model's intermediate representations at multiple noisy image resolutions, producing a single robust classification. This defense was shown to be effective against multiple state-of-the-art attacks. Perhaps even more convincingly, it was…
▽ More
Ensemble everything everywhere is a defense to adversarial examples that was recently proposed to make image classifiers robust. This defense works by ensembling a model's intermediate representations at multiple noisy image resolutions, producing a single robust classification. This defense was shown to be effective against multiple state-of-the-art attacks. Perhaps even more convincingly, it was shown that the model's gradients are perceptually aligned: attacks against the model produce noise that perceptually resembles the targeted class.
In this short note, we show that this defense is not robust to adversarial attack. We first show that the defense's randomness and ensembling method cause severe gradient masking. We then use standard adaptive attack techniques to reduce the defense's robust accuracy from 48% to 14% on CIFAR-100 and from 62% to 11% on CIFAR-10, under the $\ell_\infty$-norm threat model with $\varepsilon=8/255$.
△ Less
Submitted 3 February, 2025; v1 submitted 22 November, 2024;
originally announced November 2024.
-
Measuring Non-Adversarial Reproduction of Training Data in Large Language Models
Authors:
Michael Aerni,
Javier Rando,
Edoardo Debenedetti,
Nicholas Carlini,
Daphne Ippolito,
Florian Tramèr
Abstract:
Large language models memorize parts of their training data. Memorizing short snippets and facts is required to answer questions about the world and to be fluent in any language. But models have also been shown to reproduce long verbatim sequences of memorized text when prompted by a motivated adversary. In this work, we investigate an intermediate regime of memorization that we call non-adversari…
▽ More
Large language models memorize parts of their training data. Memorizing short snippets and facts is required to answer questions about the world and to be fluent in any language. But models have also been shown to reproduce long verbatim sequences of memorized text when prompted by a motivated adversary. In this work, we investigate an intermediate regime of memorization that we call non-adversarial reproduction, where we quantify the overlap between model responses and pretraining data when responding to natural and benign prompts. For a variety of innocuous prompt categories (e.g., writing a letter or a tutorial), we show that up to 15% of the text output by popular conversational language models overlaps with snippets from the Internet. In worst cases, we find generations where 100% of the content can be found exactly online. For the same tasks, we find that human-written text has far less overlap with Internet data. We further study whether prompting strategies can close this reproduction gap between models and humans. While appropriate prompting can reduce non-adversarial reproduction on average, we find that mitigating worst-case reproduction of training data requires stronger defenses -- even for benign interactions.
△ Less
Submitted 15 November, 2024;
originally announced November 2024.
-
Stealing User Prompts from Mixture of Experts
Authors:
Itay Yona,
Ilia Shumailov,
Jamie Hayes,
Nicholas Carlini
Abstract:
Mixture-of-Experts (MoE) models improve the efficiency and scalability of dense language models by routing each token to a small number of experts in each layer. In this paper, we show how an adversary that can arrange for their queries to appear in the same batch of examples as a victim's queries can exploit Expert-Choice-Routing to fully disclose a victim's prompt. We successfully demonstrate th…
▽ More
Mixture-of-Experts (MoE) models improve the efficiency and scalability of dense language models by routing each token to a small number of experts in each layer. In this paper, we show how an adversary that can arrange for their queries to appear in the same batch of examples as a victim's queries can exploit Expert-Choice-Routing to fully disclose a victim's prompt. We successfully demonstrate the effectiveness of this attack on a two-layer Mixtral model, exploiting the tie-handling behavior of the torch.topk CUDA implementation. Our results show that we can extract the entire prompt using $O({VM}^2)$ queries (with vocabulary size $V$ and prompt length $M$) or 100 queries on average per token in the setting we consider. This is the first attack to exploit architectural flaws for the purpose of extracting user prompts, introducing a new class of LLM vulnerabilities.
△ Less
Submitted 30 October, 2024;
originally announced October 2024.
-
Remote Timing Attacks on Efficient Language Model Inference
Authors:
Nicholas Carlini,
Milad Nasr
Abstract:
Scaling up language models has significantly increased their capabilities. But larger models are slower models, and so there is now an extensive body of work (e.g., speculative sampling or parallel decoding) that improves the (average case) efficiency of language model generation. But these techniques introduce data-dependent timing characteristics. We show it is possible to exploit these timing d…
▽ More
Scaling up language models has significantly increased their capabilities. But larger models are slower models, and so there is now an extensive body of work (e.g., speculative sampling or parallel decoding) that improves the (average case) efficiency of language model generation. But these techniques introduce data-dependent timing characteristics. We show it is possible to exploit these timing differences to mount a timing attack. By monitoring the (encrypted) network traffic between a victim user and a remote language model, we can learn information about the content of messages by noting when responses are faster or slower. With complete black-box access, on open source systems we show how it is possible to learn the topic of a user's conversation (e.g., medical advice vs. coding assistance) with 90%+ precision, and on production systems like OpenAI's ChatGPT and Anthropic's Claude we can distinguish between specific messages or infer the user's language. We further show that an active adversary can leverage a boosting attack to recover PII placed in messages (e.g., phone numbers or credit card numbers) for open source systems. We conclude with potential defenses and directions for future work.
△ Less
Submitted 22 October, 2024;
originally announced October 2024.
-
Persistent Pre-Training Poisoning of LLMs
Authors:
Yiming Zhang,
Javier Rando,
Ivan Evtimov,
Jianfeng Chi,
Eric Michael Smith,
Nicholas Carlini,
Florian Tramèr,
Daphne Ippolito
Abstract:
Large language models are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web. Prior work has shown that: (1) web-scraped pre-training datasets can be practically poisoned by malicious actors; and (2) adversaries can compromise language models after poisoning fine-tuning datasets. Our work evaluates for the first time whether language models can also be co…
▽ More
Large language models are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web. Prior work has shown that: (1) web-scraped pre-training datasets can be practically poisoned by malicious actors; and (2) adversaries can compromise language models after poisoning fine-tuning datasets. Our work evaluates for the first time whether language models can also be compromised during pre-training, with a focus on the persistence of pre-training attacks after models are fine-tuned as helpful and harmless chatbots (i.e., after SFT and DPO). We pre-train a series of LLMs from scratch to measure the impact of a potential poisoning adversary under four different attack objectives (denial-of-service, belief manipulation, jailbreaking, and prompt stealing), and across a wide range of model sizes (from 600M to 7B). Our main result is that poisoning only 0.1% of a model's pre-training dataset is sufficient for three out of four attacks to measurably persist through post-training. Moreover, simple attacks like denial-of-service persist through post-training with a poisoning rate of only 0.001%.
△ Less
Submitted 17 October, 2024;
originally announced October 2024.
-
Polynomial Time Cryptanalytic Extraction of Deep Neural Networks in the Hard-Label Setting
Authors:
Nicholas Carlini,
Jorge Chávez-Saab,
Anna Hambitzer,
Francisco Rodríguez-Henríquez,
Adi Shamir
Abstract:
Deep neural networks (DNNs) are valuable assets, yet their public accessibility raises security concerns about parameter extraction by malicious actors. Recent work by Carlini et al. (crypto'20) and Canales-Martínez et al. (eurocrypt'24) has drawn parallels between this issue and block cipher key extraction via chosen plaintext attacks. Leveraging differential cryptanalysis, they demonstrated that…
▽ More
Deep neural networks (DNNs) are valuable assets, yet their public accessibility raises security concerns about parameter extraction by malicious actors. Recent work by Carlini et al. (crypto'20) and Canales-Martínez et al. (eurocrypt'24) has drawn parallels between this issue and block cipher key extraction via chosen plaintext attacks. Leveraging differential cryptanalysis, they demonstrated that all the weights and biases of black-box ReLU-based DNNs could be inferred using a polynomial number of queries and computational time. However, their attacks relied on the availability of the exact numeric value of output logits, which allowed the calculation of their derivatives. To overcome this limitation, Chen et al. (asiacrypt'24) tackled the more realistic hard-label scenario, where only the final classification label (e.g., "dog" or "car") is accessible to the attacker. They proposed an extraction method requiring a polynomial number of queries but an exponential execution time. In addition, their approach was applicable only to a restricted set of architectures, could deal only with binary classifiers, and was demonstrated only on tiny neural networks with up to four neurons split among up to two hidden layers. This paper introduces new techniques that, for the first time, achieve cryptanalytic extraction of DNN parameters in the most challenging hard-label setting, using both a polynomial number of queries and polynomial time. We validate our approach by extracting nearly one million parameters from a DNN trained on the CIFAR-10 dataset, comprising 832 neurons in four hidden layers. Our results reveal the surprising fact that all the weights of a ReLU-based DNN can be efficiently determined by analyzing only the geometric shape of its decision boundaries.
△ Less
Submitted 8 October, 2024;
originally announced October 2024.
-
Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI
Authors:
Robert Hönig,
Javier Rando,
Nicholas Carlini,
Florian Tramèr
Abstract:
Artists are increasingly concerned about advancements in image generation models that can closely replicate their unique artistic styles. In response, several protection tools against style mimicry have been developed that incorporate small adversarial perturbations into artworks published online. In this work, we evaluate the effectiveness of popular protections -- with millions of downloads -- a…
▽ More
Artists are increasingly concerned about advancements in image generation models that can closely replicate their unique artistic styles. In response, several protection tools against style mimicry have been developed that incorporate small adversarial perturbations into artworks published online. In this work, we evaluate the effectiveness of popular protections -- with millions of downloads -- and show they only provide a false sense of security. We find that low-effort and "off-the-shelf" techniques, such as image upscaling, are sufficient to create robust mimicry methods that significantly degrade existing protections. Through a user study, we demonstrate that all existing protections can be easily bypassed, leaving artists vulnerable to style mimicry. We caution that tools based on adversarial perturbations cannot reliably protect artists from the misuse of generative AI, and urge the development of alternative non-technological solutions.
△ Less
Submitted 11 February, 2025; v1 submitted 17 June, 2024;
originally announced June 2024.
-
Cutting through buggy adversarial example defenses: fixing 1 line of code breaks Sabre
Authors:
Nicholas Carlini
Abstract:
Sabre is a defense to adversarial examples that was accepted at IEEE S&P 2024. We first reveal significant flaws in the evaluation that point to clear signs of gradient masking. We then show the cause of this gradient masking: a bug in the original evaluation code. By fixing a single line of code in the original repository, we reduce Sabre's robust accuracy to 0%. In response to this, the authors…
▽ More
Sabre is a defense to adversarial examples that was accepted at IEEE S&P 2024. We first reveal significant flaws in the evaluation that point to clear signs of gradient masking. We then show the cause of this gradient masking: a bug in the original evaluation code. By fixing a single line of code in the original repository, we reduce Sabre's robust accuracy to 0%. In response to this, the authors modify the defense and introduce a new defense component not described in the original paper. But this fix contains a second bug; modifying one more line of code reduces robust accuracy to below baseline levels. After we released the first version of our paper online, the authors introduced another change to the defense; by commenting out one line of code during attack we reduce the robust accuracy to 0% again.
△ Less
Submitted 1 July, 2024; v1 submitted 6 May, 2024;
originally announced May 2024.
-
Forcing Diffuse Distributions out of Language Models
Authors:
Yiming Zhang,
Avi Schwarzschild,
Nicholas Carlini,
Zico Kolter,
Daphne Ippolito
Abstract:
Despite being trained specifically to follow user instructions, today's instructiontuned language models perform poorly when instructed to produce random outputs. For example, when prompted to pick a number uniformly between one and ten Llama-2-13B-chat disproportionately favors the number five, and when tasked with picking a first name at random, Mistral-7B-Instruct chooses Avery 40 times more of…
▽ More
Despite being trained specifically to follow user instructions, today's instructiontuned language models perform poorly when instructed to produce random outputs. For example, when prompted to pick a number uniformly between one and ten Llama-2-13B-chat disproportionately favors the number five, and when tasked with picking a first name at random, Mistral-7B-Instruct chooses Avery 40 times more often than we would expect based on the U.S. population. When these language models are used for real-world tasks where diversity of outputs is crucial, such as language model assisted dataset construction, their inability to produce diffuse distributions over valid choices is a major hurdle. In this work, we propose a fine-tuning method that encourages language models to output distributions that are diffuse over valid outcomes. The methods we introduce generalize across a variety of tasks and distributions and make large language models practical for synthetic dataset generation with little human intervention.
△ Less
Submitted 7 August, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models
Authors:
Yuxin Wen,
Leo Marchyok,
Sanghyun Hong,
Jonas Geiping,
Tom Goldstein,
Nicholas Carlini
Abstract:
It is commonplace to produce application-specific models by fine-tuning large pre-trained models using a small bespoke dataset. The widespread availability of foundation model checkpoints on the web poses considerable risks, including the vulnerability to backdoor attacks. In this paper, we unveil a new vulnerability: the privacy backdoor attack. This black-box privacy attack aims to amplify the p…
▽ More
It is commonplace to produce application-specific models by fine-tuning large pre-trained models using a small bespoke dataset. The widespread availability of foundation model checkpoints on the web poses considerable risks, including the vulnerability to backdoor attacks. In this paper, we unveil a new vulnerability: the privacy backdoor attack. This black-box privacy attack aims to amplify the privacy leakage that arises when fine-tuning a model: when a victim fine-tunes a backdoored model, their training data will be leaked at a significantly higher rate than if they had fine-tuned a typical model. We conduct extensive experiments on various datasets and models, including both vision-language models (CLIP) and large language models, demonstrating the broad applicability and effectiveness of such an attack. Additionally, we carry out multiple ablation studies with different fine-tuning methods and inference strategies to thoroughly analyze this new threat. Our findings highlight a critical privacy concern within the machine learning community and call for a reevaluation of safety protocols in the use of open-source pre-trained models.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
Certified Robustness to Clean-Label Poisoning Using Diffusion Denoising
Authors:
Sanghyun Hong,
Nicholas Carlini,
Alexey Kurakin
Abstract:
We present a certified defense to clean-label poisoning attacks under $\ell_2$-norm. These attacks work by injecting a small number of poisoning samples (e.g., 1%) that contain bounded adversarial perturbations into the training data to induce a targeted misclassification of a test-time input. Inspired by the adversarial robustness achieved by $randomized$ $smoothing$, we show how an off-the-shelf…
▽ More
We present a certified defense to clean-label poisoning attacks under $\ell_2$-norm. These attacks work by injecting a small number of poisoning samples (e.g., 1%) that contain bounded adversarial perturbations into the training data to induce a targeted misclassification of a test-time input. Inspired by the adversarial robustness achieved by $randomized$ $smoothing$, we show how an off-the-shelf diffusion denoising model can sanitize the tampered training data. We extensively test our defense against seven clean-label poisoning attacks in both $\ell_2$ and $\ell_{\infty}$-norms and reduce their attack success to 0-16% with only a negligible drop in the test accuracy. We compare our defense with existing countermeasures against clean-label poisoning, showing that the defense reduces the attack success the most and offers the best model utility. Our results highlight the need for future work on developing stronger clean-label attacks and using our certified yet practical defense as a strong baseline to evaluate these attacks.
△ Less
Submitted 2 June, 2025; v1 submitted 18 March, 2024;
originally announced March 2024.
-
Stealing Part of a Production Language Model
Authors:
Nicholas Carlini,
Daniel Paleka,
Krishnamurthy Dj Dvijotham,
Thomas Steinke,
Jonathan Hayase,
A. Feder Cooper,
Katherine Lee,
Matthew Jagielski,
Milad Nasr,
Arthur Conmy,
Itay Yona,
Eric Wallace,
David Rolnick,
Florian Tramèr
Abstract:
We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or Google's PaLM-2. Specifically, our attack recovers the embedding projection layer (up to symmetries) of a transformer model, given typical API access. For under \…
▽ More
We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or Google's PaLM-2. Specifically, our attack recovers the embedding projection layer (up to symmetries) of a transformer model, given typical API access. For under \$20 USD, our attack extracts the entire projection matrix of OpenAI's Ada and Babbage language models. We thereby confirm, for the first time, that these black-box models have a hidden dimension of 1024 and 2048, respectively. We also recover the exact hidden dimension size of the gpt-3.5-turbo model, and estimate it would cost under $2,000 in queries to recover the entire projection matrix. We conclude with potential defenses and mitigations, and discuss the implications of possible future work that could extend our attack.
△ Less
Submitted 9 July, 2024; v1 submitted 11 March, 2024;
originally announced March 2024.
-
Query-Based Adversarial Prompt Generation
Authors:
Jonathan Hayase,
Ema Borevkovic,
Nicholas Carlini,
Florian Tramèr,
Milad Nasr
Abstract:
Recent work has shown it is possible to construct adversarial examples that cause an aligned language model to emit harmful strings or perform harmful behavior. Existing attacks work either in the white-box setting (with full access to the model weights), or through transferability: the phenomenon that adversarial examples crafted on one model often remain effective on other models. We improve on…
▽ More
Recent work has shown it is possible to construct adversarial examples that cause an aligned language model to emit harmful strings or perform harmful behavior. Existing attacks work either in the white-box setting (with full access to the model weights), or through transferability: the phenomenon that adversarial examples crafted on one model often remain effective on other models. We improve on prior work with a query-based attack that leverages API access to a remote language model to construct adversarial examples that cause the model to emit harmful strings with (much) higher probability than with transfer-only attacks. We validate our attack on GPT-3.5 and OpenAI's safety classifier; we can cause GPT-3.5 to emit harmful strings that current transfer attacks fail at, and we can evade the safety classifier with nearly 100% probability.
△ Less
Submitted 7 December, 2024; v1 submitted 19 February, 2024;
originally announced February 2024.
-
Initialization Matters for Adversarial Transfer Learning
Authors:
Andong Hua,
Jindong Gu,
Zhiyu Xue,
Nicholas Carlini,
Eric Wong,
Yao Qin
Abstract:
With the prevalence of the Pretraining-Finetuning paradigm in transfer learning, the robustness of downstream tasks has become a critical concern. In this work, we delve into adversarial robustness in transfer learning and reveal the critical role of initialization, including both the pretrained model and the linear head. First, we discover the necessity of an adversarially robust pretrained model…
▽ More
With the prevalence of the Pretraining-Finetuning paradigm in transfer learning, the robustness of downstream tasks has become a critical concern. In this work, we delve into adversarial robustness in transfer learning and reveal the critical role of initialization, including both the pretrained model and the linear head. First, we discover the necessity of an adversarially robust pretrained model. Specifically, we reveal that with a standard pretrained model, Parameter-Efficient Finetuning (PEFT) methods either fail to be adversarially robust or continue to exhibit significantly degraded adversarial robustness on downstream tasks, even with adversarial training during finetuning. Leveraging a robust pretrained model, surprisingly, we observe that a simple linear probing can outperform full finetuning and other PEFT methods with random initialization on certain datasets. We further identify that linear probing excels in preserving robustness from the robust pretraining. Based on this, we propose Robust Linear Initialization (RoLI) for adversarial finetuning, which initializes the linear head with the weights obtained by adversarial linear probing to maximally inherit the robustness from pretraining. Across five different image classification datasets, we demonstrate the effectiveness of RoLI and achieve new state-of-the-art results. Our code is available at \url{https://github.com/DongXzz/RoLI}.
△ Less
Submitted 30 March, 2024; v1 submitted 9 December, 2023;
originally announced December 2023.
-
Scalable Extraction of Training Data from (Production) Language Models
Authors:
Milad Nasr,
Nicholas Carlini,
Jonathan Hayase,
Matthew Jagielski,
A. Feder Cooper,
Daphne Ippolito,
Christopher A. Choquette-Choo,
Eric Wallace,
Florian Tramèr,
Katherine Lee
Abstract:
This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing techniques from…
▽ More
This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing techniques from the literature suffice to attack unaligned models; in order to attack the aligned ChatGPT, we develop a new divergence attack that causes the model to diverge from its chatbot-style generations and emit training data at a rate 150x higher than when behaving properly. Our methods show practical attacks can recover far more data than previously thought, and reveal that current alignment techniques do not eliminate memorization.
△ Less
Submitted 28 November, 2023;
originally announced November 2023.
-
Report of the 1st Workshop on Generative AI and Law
Authors:
A. Feder Cooper,
Katherine Lee,
James Grimmelmann,
Daphne Ippolito,
Christopher Callison-Burch,
Christopher A. Choquette-Choo,
Niloofar Mireshghallah,
Miles Brundage,
David Mimno,
Madiha Zahrah Choksi,
Jack M. Balkin,
Nicholas Carlini,
Christopher De Sa,
Jonathan Frankle,
Deep Ganguli,
Bryant Gipson,
Andres Guadamuz,
Swee Leng Harris,
Abigail Z. Jacobs,
Elizabeth Joh,
Gautam Kamath,
Mark Lemley,
Cass Matthews,
Christine McLeavey,
Corynne McSherry
, et al. (10 additional authors not shown)
Abstract:
This report presents the takeaways of the inaugural Workshop on Generative AI and Law (GenLaw), held in July 2023. A cross-disciplinary group of practitioners and scholars from computer science and law convened to discuss the technical, doctrinal, and policy challenges presented by law for Generative AI, and by Generative AI for law, with an emphasis on U.S. law in particular. We begin the report…
▽ More
This report presents the takeaways of the inaugural Workshop on Generative AI and Law (GenLaw), held in July 2023. A cross-disciplinary group of practitioners and scholars from computer science and law convened to discuss the technical, doctrinal, and policy challenges presented by law for Generative AI, and by Generative AI for law, with an emphasis on U.S. law in particular. We begin the report with a high-level statement about why Generative AI is both immensely significant and immensely challenging for law. To meet these challenges, we conclude that there is an essential need for 1) a shared knowledge base that provides a common conceptual language for experts across disciplines; 2) clarification of the distinctive technical capabilities of generative-AI systems, as compared and contrasted to other computer and AI systems; 3) a logical taxonomy of the legal issues these systems raise; and, 4) a concrete research agenda to promote collaboration and knowledge-sharing on emerging issues at the intersection of Generative AI and law. In this report, we synthesize the key takeaways from the GenLaw workshop that begin to address these needs. All of the listed authors contributed to the workshop upon which this report is based, but they and their organizations do not necessarily endorse all of the specific claims in this report.
△ Less
Submitted 2 December, 2023; v1 submitted 10 November, 2023;
originally announced November 2023.
-
Privacy Side Channels in Machine Learning Systems
Authors:
Edoardo Debenedetti,
Giorgio Severi,
Nicholas Carlini,
Christopher A. Choquette-Choo,
Matthew Jagielski,
Milad Nasr,
Eric Wallace,
Florian Tramèr
Abstract:
Most current approaches for protecting privacy in machine learning (ML) assume that models exist in a vacuum. Yet, in reality, these models are part of larger systems that include components for training data filtering, output monitoring, and more. In this work, we introduce privacy side channels: attacks that exploit these system-level components to extract private information at far higher rates…
▽ More
Most current approaches for protecting privacy in machine learning (ML) assume that models exist in a vacuum. Yet, in reality, these models are part of larger systems that include components for training data filtering, output monitoring, and more. In this work, we introduce privacy side channels: attacks that exploit these system-level components to extract private information at far higher rates than is otherwise possible for standalone models. We propose four categories of side channels that span the entire ML lifecycle (training data filtering, input preprocessing, output post-processing, and query filtering) and allow for enhanced membership inference, data extraction, and even novel threats such as extraction of users' test queries. For example, we show that deduplicating training data before applying differentially-private training creates a side-channel that completely invalidates any provable privacy guarantees. We further show that systems which block language models from regenerating training data can be exploited to exfiltrate private keys contained in the training set--even if the model did not memorize these keys. Taken together, our results demonstrate the need for a holistic, end-to-end privacy analysis of machine learning systems.
△ Less
Submitted 18 July, 2024; v1 submitted 11 September, 2023;
originally announced September 2023.
-
Reverse-Engineering Decoding Strategies Given Blackbox Access to a Language Generation System
Authors:
Daphne Ippolito,
Nicholas Carlini,
Katherine Lee,
Milad Nasr,
Yun William Yu
Abstract:
Neural language models are increasingly deployed into APIs and websites that allow a user to pass in a prompt and receive generated text. Many of these systems do not reveal generation parameters. In this paper, we present methods to reverse-engineer the decoding method used to generate text (i.e., top-$k$ or nucleus sampling). Our ability to discover which decoding strategy was used has implicati…
▽ More
Neural language models are increasingly deployed into APIs and websites that allow a user to pass in a prompt and receive generated text. Many of these systems do not reveal generation parameters. In this paper, we present methods to reverse-engineer the decoding method used to generate text (i.e., top-$k$ or nucleus sampling). Our ability to discover which decoding strategy was used has implications for detecting generated text. Additionally, the process of discovering the decoding strategy can reveal biases caused by selecting decoding settings which severely truncate a model's predicted distributions. We perform our attack on several families of open-source language models, as well as on production systems (e.g., ChatGPT).
△ Less
Submitted 9 September, 2023;
originally announced September 2023.
-
Identifying and Mitigating the Security Risks of Generative AI
Authors:
Clark Barrett,
Brad Boyd,
Elie Burzstein,
Nicholas Carlini,
Brad Chen,
Jihye Choi,
Amrita Roy Chowdhury,
Mihai Christodorescu,
Anupam Datta,
Soheil Feizi,
Kathleen Fisher,
Tatsunori Hashimoto,
Dan Hendrycks,
Somesh Jha,
Daniel Kang,
Florian Kerschbaum,
Eric Mitchell,
John Mitchell,
Zulfikar Ramzan,
Khawaja Shams,
Dawn Song,
Ankur Taly,
Diyi Yang
Abstract:
Every major technical invention resurfaces the dual-use dilemma -- the new technology has the potential to be used for good as well as for harm. Generative AI (GenAI) techniques, such as large language models (LLMs) and diffusion models, have shown remarkable capabilities (e.g., in-context learning, code-completion, and text-to-image generation and editing). However, GenAI can be used just as well…
▽ More
Every major technical invention resurfaces the dual-use dilemma -- the new technology has the potential to be used for good as well as for harm. Generative AI (GenAI) techniques, such as large language models (LLMs) and diffusion models, have shown remarkable capabilities (e.g., in-context learning, code-completion, and text-to-image generation and editing). However, GenAI can be used just as well by attackers to generate new attacks and increase the velocity and efficacy of existing attacks.
This paper reports the findings of a workshop held at Google (co-organized by Stanford University and the University of Wisconsin-Madison) on the dual-use dilemma posed by GenAI. This paper is not meant to be comprehensive, but is rather an attempt to synthesize some of the interesting findings from the workshop. We discuss short-term and long-term goals for the community on this topic. We hope this paper provides both a launching point for a discussion on this important topic as well as interesting problems that the research community can work to address.
△ Less
Submitted 28 December, 2023; v1 submitted 28 August, 2023;
originally announced August 2023.
-
Universal and Transferable Adversarial Attacks on Aligned Language Models
Authors:
Andy Zou,
Zifan Wang,
Nicholas Carlini,
Milad Nasr,
J. Zico Kolter,
Matt Fredrikson
Abstract:
Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practic…
▽ More
Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods.
Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/llm-attacks/llm-attacks.
△ Less
Submitted 20 December, 2023; v1 submitted 27 July, 2023;
originally announced July 2023.
-
A LLM Assisted Exploitation of AI-Guardian
Authors:
Nicholas Carlini
Abstract:
Large language models (LLMs) are now highly capable at a diverse range of tasks. This paper studies whether or not GPT-4, one such LLM, is capable of assisting researchers in the field of adversarial machine learning. As a case study, we evaluate the robustness of AI-Guardian, a recent defense to adversarial examples published at IEEE S&P 2023, a top computer security conference. We completely bre…
▽ More
Large language models (LLMs) are now highly capable at a diverse range of tasks. This paper studies whether or not GPT-4, one such LLM, is capable of assisting researchers in the field of adversarial machine learning. As a case study, we evaluate the robustness of AI-Guardian, a recent defense to adversarial examples published at IEEE S&P 2023, a top computer security conference. We completely break this defense: the proposed scheme does not increase robustness compared to an undefended baseline.
We write none of the code to attack this model, and instead prompt GPT-4 to implement all attack algorithms following our instructions and guidance. This process was surprisingly effective and efficient, with the language model at times producing code from ambiguous instructions faster than the author of this paper could have done. We conclude by discussing (1) the warning signs present in the evaluation that suggested to us AI-Guardian would be broken, and (2) our experience with designing attacks and performing novel research using the most recent advances in language modeling.
△ Less
Submitted 20 July, 2023;
originally announced July 2023.
-
Backdoor Attacks for In-Context Learning with Language Models
Authors:
Nikhil Kandpal,
Matthew Jagielski,
Florian Tramèr,
Nicholas Carlini
Abstract:
Because state-of-the-art language models are expensive to train, most practitioners must make use of one of the few publicly available language models or language model APIs. This consolidation of trust increases the potency of backdoor attacks, where an adversary tampers with a machine learning model in order to make it perform some malicious behavior on inputs that contain a predefined backdoor…
▽ More
Because state-of-the-art language models are expensive to train, most practitioners must make use of one of the few publicly available language models or language model APIs. This consolidation of trust increases the potency of backdoor attacks, where an adversary tampers with a machine learning model in order to make it perform some malicious behavior on inputs that contain a predefined backdoor trigger. We show that the in-context learning ability of large language models significantly complicates the question of developing backdoor attacks, as a successful backdoor must work against various prompting strategies and should not affect the model's general purpose capabilities. We design a new attack for eliciting targeted misclassification when language models are prompted to perform a particular target task and demonstrate the feasibility of this attack by backdooring multiple large language models ranging in size from 1.3 billion to 6 billion parameters. Finally we study defenses to mitigate the potential harms of our attack: for example, while in the white-box setting we show that fine-tuning models for as few as 500 steps suffices to remove the backdoor behavior, in the black-box setting we are unable to develop a successful defense that relies on prompt engineering alone.
△ Less
Submitted 27 July, 2023;
originally announced July 2023.
-
Effective Prompt Extraction from Language Models
Authors:
Yiming Zhang,
Nicholas Carlini,
Daphne Ippolito
Abstract:
The text generated by large language models is commonly controlled by prompting, where a prompt prepended to a user's query guides the model's output. The prompts used by companies to guide their models are often treated as secrets, to be hidden from the user making the query. They have even been treated as commodities to be bought and sold on marketplaces. However, anecdotal reports have shown ad…
▽ More
The text generated by large language models is commonly controlled by prompting, where a prompt prepended to a user's query guides the model's output. The prompts used by companies to guide their models are often treated as secrets, to be hidden from the user making the query. They have even been treated as commodities to be bought and sold on marketplaces. However, anecdotal reports have shown adversarial users employing prompt extraction attacks to recover these prompts. In this paper, we present a framework for systematically measuring the effectiveness of these attacks. In experiments with 3 different sources of prompts and 11 underlying large language models, we find that simple text-based attacks can in fact reveal prompts with high probability. Our framework determines with high precision whether an extracted prompt is the actual secret prompt, rather than a model hallucination. Prompt extraction from real systems such as Claude 3 and ChatGPT further suggest that system prompts can be revealed by an adversary despite existing defenses in place.
△ Less
Submitted 7 August, 2024; v1 submitted 13 July, 2023;
originally announced July 2023.
-
Are aligned neural networks adversarially aligned?
Authors:
Nicholas Carlini,
Milad Nasr,
Christopher A. Choquette-Choo,
Matthew Jagielski,
Irena Gao,
Anas Awadalla,
Pang Wei Koh,
Daphne Ippolito,
Katherine Lee,
Florian Tramer,
Ludwig Schmidt
Abstract:
Large language models are now tuned to align with the goals of their creators, namely to be "helpful and harmless." These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study adversarial alignment, and ask to what extent these models rema…
▽ More
Large language models are now tuned to align with the goals of their creators, namely to be "helpful and harmless." These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study adversarial alignment, and ask to what extent these models remain aligned when interacting with an adversarial user who constructs worst-case inputs (adversarial examples). These inputs are designed to cause the model to emit harmful content that would otherwise be prohibited. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force. As a result, the failure of current attacks should not be seen as proof that aligned text models remain aligned under adversarial inputs.
However the recent trend in large-scale ML models is multimodal models that allow users to provide images that influence the text that is generated. We show these models can be easily attacked, i.e., induced to perform arbitrary un-aligned behavior through adversarial perturbation of the input image. We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.
△ Less
Submitted 6 May, 2024; v1 submitted 26 June, 2023;
originally announced June 2023.
-
Evading Black-box Classifiers Without Breaking Eggs
Authors:
Edoardo Debenedetti,
Nicholas Carlini,
Florian Tramèr
Abstract:
Decision-based evasion attacks repeatedly query a black-box classifier to generate adversarial examples. Prior work measures the cost of such attacks by the total number of queries made to the classifier. We argue this metric is flawed. Most security-critical machine learning systems aim to weed out "bad" data (e.g., malware, harmful content, etc). Queries to such systems carry a fundamentally asy…
▽ More
Decision-based evasion attacks repeatedly query a black-box classifier to generate adversarial examples. Prior work measures the cost of such attacks by the total number of queries made to the classifier. We argue this metric is flawed. Most security-critical machine learning systems aim to weed out "bad" data (e.g., malware, harmful content, etc). Queries to such systems carry a fundamentally asymmetric cost: queries detected as "bad" come at a higher cost because they trigger additional security filters, e.g., usage throttling or account suspension. Yet, we find that existing decision-based attacks issue a large number of "bad" queries, which likely renders them ineffective against security-critical systems. We then design new attacks that reduce the number of bad queries by $1.5$-$7.3\times$, but often at a significant increase in total (non-bad) queries. We thus pose it as an open problem to build black-box attacks that are more effective under realistic cost metrics.
△ Less
Submitted 14 February, 2024; v1 submitted 5 June, 2023;
originally announced June 2023.
-
Students Parrot Their Teachers: Membership Inference on Model Distillation
Authors:
Matthew Jagielski,
Milad Nasr,
Christopher Choquette-Choo,
Katherine Lee,
Nicholas Carlini
Abstract:
Model distillation is frequently proposed as a technique to reduce the privacy leakage of machine learning. These empirical privacy defenses rely on the intuition that distilled ``student'' models protect the privacy of training data, as they only interact with this data indirectly through a ``teacher'' model. In this work, we design membership inference attacks to systematically study the privacy…
▽ More
Model distillation is frequently proposed as a technique to reduce the privacy leakage of machine learning. These empirical privacy defenses rely on the intuition that distilled ``student'' models protect the privacy of training data, as they only interact with this data indirectly through a ``teacher'' model. In this work, we design membership inference attacks to systematically study the privacy provided by knowledge distillation to both the teacher and student training sets. Our new attacks show that distillation alone provides only limited privacy across a number of domains. We explain the success of our attacks on distillation by showing that membership inference attacks on a private dataset can succeed even if the target model is *never* queried on any actual training points, but only on inputs whose predictions are highly influenced by training data. Finally, we show that our attacks are strongest when student and teacher sets are similar, or when the attacker can poison the teacher set.
△ Less
Submitted 6 March, 2023;
originally announced March 2023.
-
Randomness in ML Defenses Helps Persistent Attackers and Hinders Evaluators
Authors:
Keane Lucas,
Matthew Jagielski,
Florian Tramèr,
Lujo Bauer,
Nicholas Carlini
Abstract:
It is becoming increasingly imperative to design robust ML defenses. However, recent work has found that many defenses that initially resist state-of-the-art attacks can be broken by an adaptive adversary. In this work we take steps to simplify the design of defenses and argue that white-box defenses should eschew randomness when possible. We begin by illustrating a new issue with the deployment o…
▽ More
It is becoming increasingly imperative to design robust ML defenses. However, recent work has found that many defenses that initially resist state-of-the-art attacks can be broken by an adaptive adversary. In this work we take steps to simplify the design of defenses and argue that white-box defenses should eschew randomness when possible. We begin by illustrating a new issue with the deployment of randomized defenses that reduces their security compared to their deterministic counterparts. We then provide evidence that making defenses deterministic simplifies robustness evaluation, without reducing the effectiveness of a truly robust defense. Finally, we introduce a new defense evaluation framework that leverages a defense's deterministic nature to better evaluate its adversarial robustness.
△ Less
Submitted 26 February, 2023;
originally announced February 2023.
-
Poisoning Web-Scale Training Datasets is Practical
Authors:
Nicholas Carlini,
Matthew Jagielski,
Christopher A. Choquette-Choo,
Daniel Paleka,
Will Pearce,
Hyrum Anderson,
Andreas Terzis,
Kurt Thomas,
Florian Tramèr
Abstract:
Deep learning models are often trained on distributed, web-scale datasets crawled from the internet. In this paper, we introduce two new dataset poisoning attacks that intentionally introduce malicious examples to a model's performance. Our attacks are immediately practical and could, today, poison 10 popular datasets. Our first attack, split-view poisoning, exploits the mutable nature of internet…
▽ More
Deep learning models are often trained on distributed, web-scale datasets crawled from the internet. In this paper, we introduce two new dataset poisoning attacks that intentionally introduce malicious examples to a model's performance. Our attacks are immediately practical and could, today, poison 10 popular datasets. Our first attack, split-view poisoning, exploits the mutable nature of internet content to ensure a dataset annotator's initial view of the dataset differs from the view downloaded by subsequent clients. By exploiting specific invalid trust assumptions, we show how we could have poisoned 0.01% of the LAION-400M or COYO-700M datasets for just $60 USD. Our second attack, frontrunning poisoning, targets web-scale datasets that periodically snapshot crowd-sourced content -- such as Wikipedia -- where an attacker only needs a time-limited window to inject malicious examples. In light of both attacks, we notify the maintainers of each affected dataset and recommended several low-overhead defenses.
△ Less
Submitted 6 May, 2024; v1 submitted 20 February, 2023;
originally announced February 2023.
-
Tight Auditing of Differentially Private Machine Learning
Authors:
Milad Nasr,
Jamie Hayes,
Thomas Steinke,
Borja Balle,
Florian Tramèr,
Matthew Jagielski,
Nicholas Carlini,
Andreas Terzis
Abstract:
Auditing mechanisms for differential privacy use probabilistic means to empirically estimate the privacy level of an algorithm. For private machine learning, existing auditing mechanisms are tight: the empirical privacy estimate (nearly) matches the algorithm's provable privacy guarantee. But these auditing techniques suffer from two limitations. First, they only give tight estimates under implaus…
▽ More
Auditing mechanisms for differential privacy use probabilistic means to empirically estimate the privacy level of an algorithm. For private machine learning, existing auditing mechanisms are tight: the empirical privacy estimate (nearly) matches the algorithm's provable privacy guarantee. But these auditing techniques suffer from two limitations. First, they only give tight estimates under implausible worst-case assumptions (e.g., a fully adversarial dataset). Second, they require thousands or millions of training runs to produce non-trivial statistical estimates of the privacy leakage.
This work addresses both issues. We design an improved auditing scheme that yields tight privacy estimates for natural (not adversarially crafted) datasets -- if the adversary can see all model updates during training. Prior auditing works rely on the same assumption, which is permitted under the standard differential privacy threat model. This threat model is also applicable, e.g., in federated learning settings. Moreover, our auditing scheme requires only two training runs (instead of thousands) to produce tight privacy estimates, by adapting recent advances in tight composition theorems for differential privacy. We demonstrate the utility of our improved auditing schemes by surfacing implementation bugs in private machine learning code that eluded prior auditing techniques.
△ Less
Submitted 15 February, 2023;
originally announced February 2023.
-
Effective Robustness against Natural Distribution Shifts for Models with Different Training Data
Authors:
Zhouxing Shi,
Nicholas Carlini,
Ananth Balashankar,
Ludwig Schmidt,
Cho-Jui Hsieh,
Alex Beutel,
Yao Qin
Abstract:
"Effective robustness" measures the extra out-of-distribution (OOD) robustness beyond what can be predicted from the in-distribution (ID) performance. Existing effective robustness evaluations typically use a single test set such as ImageNet to evaluate the ID accuracy. This becomes problematic when evaluating models trained on different data distributions, e.g., comparing models trained on ImageN…
▽ More
"Effective robustness" measures the extra out-of-distribution (OOD) robustness beyond what can be predicted from the in-distribution (ID) performance. Existing effective robustness evaluations typically use a single test set such as ImageNet to evaluate the ID accuracy. This becomes problematic when evaluating models trained on different data distributions, e.g., comparing models trained on ImageNet vs. zero-shot language-image pre-trained models trained on LAION. In this paper, we propose a new evaluation metric to evaluate and compare the effective robustness of models trained on different data. To do this, we control for the accuracy on multiple ID test sets that cover the training distributions for all the evaluated models. Our new evaluation metric provides a better estimate of effective robustness when there are models with different training data. It may also explain the surprising effective robustness gains of zero-shot CLIP-like models exhibited in prior works that used ImageNet as the only ID test set, while the gains diminish under our new evaluation. Additional artifacts including interactive visualizations are provided at https://shizhouxing.github.io/effective-robustness.
△ Less
Submitted 28 October, 2023; v1 submitted 2 February, 2023;
originally announced February 2023.
-
Extracting Training Data from Diffusion Models
Authors:
Nicholas Carlini,
Jamie Hayes,
Milad Nasr,
Matthew Jagielski,
Vikash Sehwag,
Florian Tramèr,
Borja Balle,
Daphne Ippolito,
Eric Wallace
Abstract:
Image diffusion models such as DALL-E 2, Imagen, and Stable Diffusion have attracted significant attention due to their ability to generate high-quality synthetic images. In this work, we show that diffusion models memorize individual images from their training data and emit them at generation time. With a generate-and-filter pipeline, we extract over a thousand training examples from state-of-the…
▽ More
Image diffusion models such as DALL-E 2, Imagen, and Stable Diffusion have attracted significant attention due to their ability to generate high-quality synthetic images. In this work, we show that diffusion models memorize individual images from their training data and emit them at generation time. With a generate-and-filter pipeline, we extract over a thousand training examples from state-of-the-art models, ranging from photographs of individual people to trademarked company logos. We also train hundreds of diffusion models in various settings to analyze how different modeling and data decisions affect privacy. Overall, our results show that diffusion models are much less private than prior generative models such as GANs, and that mitigating these vulnerabilities may require new advances in privacy-preserving training.
△ Less
Submitted 30 January, 2023;
originally announced January 2023.
-
Publishing Efficient On-device Models Increases Adversarial Vulnerability
Authors:
Sanghyun Hong,
Nicholas Carlini,
Alexey Kurakin
Abstract:
Recent increases in the computational demands of deep neural networks (DNNs) have sparked interest in efficient deep learning mechanisms, e.g., quantization or pruning. These mechanisms enable the construction of a small, efficient version of commercial-scale models with comparable accuracy, accelerating their deployment to resource-constrained devices.
In this paper, we study the security consi…
▽ More
Recent increases in the computational demands of deep neural networks (DNNs) have sparked interest in efficient deep learning mechanisms, e.g., quantization or pruning. These mechanisms enable the construction of a small, efficient version of commercial-scale models with comparable accuracy, accelerating their deployment to resource-constrained devices.
In this paper, we study the security considerations of publishing on-device variants of large-scale models. We first show that an adversary can exploit on-device models to make attacking the large models easier. In evaluations across 19 DNNs, by exploiting the published on-device models as a transfer prior, the adversarial vulnerability of the original commercial-scale models increases by up to 100x. We then show that the vulnerability increases as the similarity between a full-scale and its efficient model increase. Based on the insights, we propose a defense, $similarity$-$unpairing$, that fine-tunes on-device models with the objective of reducing the similarity. We evaluated our defense on all the 19 DNNs and found that it reduces the transferability up to 90% and the number of queries required by a factor of 10-100x. Our results suggest that further research is needed on the security (or even privacy) threats caused by publishing those efficient siblings.
△ Less
Submitted 28 December, 2022;
originally announced December 2022.
-
Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining
Authors:
Florian Tramèr,
Gautam Kamath,
Nicholas Carlini
Abstract:
The performance of differentially private machine learning can be boosted significantly by leveraging the transfer learning capabilities of non-private models pretrained on large public datasets. We critically review this approach.
We primarily question whether the use of large Web-scraped datasets should be viewed as differential-privacy-preserving. We caution that publicizing these models pret…
▽ More
The performance of differentially private machine learning can be boosted significantly by leveraging the transfer learning capabilities of non-private models pretrained on large public datasets. We critically review this approach.
We primarily question whether the use of large Web-scraped datasets should be viewed as differential-privacy-preserving. We caution that publicizing these models pretrained on Web data as "private" could lead to harm and erode the public's trust in differential privacy as a meaningful definition of privacy.
Beyond the privacy considerations of using public data, we further question the utility of this paradigm. We scrutinize whether existing machine learning benchmarks are appropriate for measuring the ability of pretrained models to generalize to sensitive domains, which may be poorly represented in public Web data. Finally, we notice that pretraining has been especially impactful for the largest available models -- models sufficiently large to prohibit end users running them on their own devices. Thus, deploying such models today could be a net loss for privacy, as it would require (private) data to be outsourced to a more compute-powerful third party.
We conclude by discussing potential paths forward for the field of private learning, as public pretraining becomes more popular and powerful.
△ Less
Submitted 17 July, 2024; v1 submitted 13 December, 2022;
originally announced December 2022.
-
Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy
Authors:
Daphne Ippolito,
Florian Tramèr,
Milad Nasr,
Chiyuan Zhang,
Matthew Jagielski,
Katherine Lee,
Christopher A. Choquette-Choo,
Nicholas Carlini
Abstract:
Studying data memorization in neural language models helps us understand the risks (e.g., to privacy or copyright) associated with models regurgitating training data and aids in the development of countermeasures. Many prior works -- and some recently deployed defenses -- focus on "verbatim memorization", defined as a model generation that exactly matches a substring from the training set. We argu…
▽ More
Studying data memorization in neural language models helps us understand the risks (e.g., to privacy or copyright) associated with models regurgitating training data and aids in the development of countermeasures. Many prior works -- and some recently deployed defenses -- focus on "verbatim memorization", defined as a model generation that exactly matches a substring from the training set. We argue that verbatim memorization definitions are too restrictive and fail to capture more subtle forms of memorization. Specifically, we design and implement an efficient defense that perfectly prevents all verbatim memorization. And yet, we demonstrate that this "perfect" filter does not prevent the leakage of training data. Indeed, it is easily circumvented by plausible and minimally modified "style-transfer" prompts -- and in some cases even the non-modified original prompts -- to extract memorized information. We conclude by discussing potential alternative definitions and why defining memorization is a difficult yet crucial open question for neural language models.
△ Less
Submitted 11 September, 2023; v1 submitted 31 October, 2022;
originally announced October 2022.
-
Preprocessors Matter! Realistic Decision-Based Attacks on Machine Learning Systems
Authors:
Chawin Sitawarin,
Florian Tramèr,
Nicholas Carlini
Abstract:
Decision-based attacks construct adversarial examples against a machine learning (ML) model by making only hard-label queries. These attacks have mainly been applied directly to standalone neural networks. However, in practice, ML models are just one component of a larger learning system. We find that by adding a single preprocessor in front of a classifier, state-of-the-art query-based attacks ar…
▽ More
Decision-based attacks construct adversarial examples against a machine learning (ML) model by making only hard-label queries. These attacks have mainly been applied directly to standalone neural networks. However, in practice, ML models are just one component of a larger learning system. We find that by adding a single preprocessor in front of a classifier, state-of-the-art query-based attacks are up to 7$\times$ less effective at attacking a prediction pipeline than at attacking the model alone. We explain this discrepancy by the fact that most preprocessors introduce some notion of invariance to the input space. Hence, attacks that are unaware of this invariance inevitably waste a large number of queries to re-discover or overcome it. We, therefore, develop techniques to (i) reverse-engineer the preprocessor and then (ii) use this extracted information to attack the end-to-end system. Our preprocessors extraction method requires only a few hundred queries, and our preprocessor-aware attacks recover the same efficacy as when attacking the model alone. The code can be found at https://github.com/google-research/preprocessor-aware-black-box-attack.
△ Less
Submitted 20 July, 2023; v1 submitted 6 October, 2022;
originally announced October 2022.
-
No Free Lunch in "Privacy for Free: How does Dataset Condensation Help Privacy"
Authors:
Nicholas Carlini,
Vitaly Feldman,
Milad Nasr
Abstract:
New methods designed to preserve data privacy require careful scrutiny. Failure to preserve privacy is hard to detect, and yet can lead to catastrophic results when a system implementing a ``privacy-preserving'' method is attacked. A recent work selected for an Outstanding Paper Award at ICML 2022 (Dong et al., 2022) claims that dataset condensation (DC) significantly improves data privacy when tr…
▽ More
New methods designed to preserve data privacy require careful scrutiny. Failure to preserve privacy is hard to detect, and yet can lead to catastrophic results when a system implementing a ``privacy-preserving'' method is attacked. A recent work selected for an Outstanding Paper Award at ICML 2022 (Dong et al., 2022) claims that dataset condensation (DC) significantly improves data privacy when training machine learning models. This claim is supported by theoretical analysis of a specific dataset condensation technique and an empirical evaluation of resistance to some existing membership inference attacks.
In this note we examine the claims in the work of Dong et al. (2022) and describe major flaws in the empirical evaluation of the method and its theoretical analysis. These flaws imply that their work does not provide statistically significant evidence that DC improves the privacy of training ML models over a naive baseline. Moreover, previously published results show that DP-SGD, the standard approach to privacy preserving ML, simultaneously gives better accuracy and achieves a (provably) lower membership attack success rate.
△ Less
Submitted 29 September, 2022;
originally announced September 2022.
-
Part-Based Models Improve Adversarial Robustness
Authors:
Chawin Sitawarin,
Kornrapat Pongmala,
Yizheng Chen,
Nicholas Carlini,
David Wagner
Abstract:
We show that combining human prior knowledge with end-to-end learning can improve the robustness of deep neural networks by introducing a part-based model for object classification. We believe that the richer form of annotation helps guide neural networks to learn more robust features without requiring more samples or larger models. Our model combines a part segmentation model with a tiny classifi…
▽ More
We show that combining human prior knowledge with end-to-end learning can improve the robustness of deep neural networks by introducing a part-based model for object classification. We believe that the richer form of annotation helps guide neural networks to learn more robust features without requiring more samples or larger models. Our model combines a part segmentation model with a tiny classifier and is trained end-to-end to simultaneously segment objects into parts and then classify the segmented object. Empirically, our part-based models achieve both higher accuracy and higher adversarial robustness than a ResNet-50 baseline on all three datasets. For instance, the clean accuracy of our part models is up to 15 percentage points higher than the baseline's, given the same level of robustness. Our experiments indicate that these models also reduce texture bias and yield better robustness against common corruptions and spurious correlations. The code is publicly available at https://github.com/chawins/adv-part-model.
△ Less
Submitted 8 March, 2023; v1 submitted 15 September, 2022;
originally announced September 2022.
-
Measuring Forgetting of Memorized Training Examples
Authors:
Matthew Jagielski,
Om Thakkar,
Florian Tramèr,
Daphne Ippolito,
Katherine Lee,
Nicholas Carlini,
Eric Wallace,
Shuang Song,
Abhradeep Thakurta,
Nicolas Papernot,
Chiyuan Zhang
Abstract:
Machine learning models exhibit two seemingly contradictory phenomena: training data memorization, and various forms of forgetting. In memorization, models overfit specific training examples and become susceptible to privacy attacks. In forgetting, examples which appeared early in training are forgotten by the end. In this work, we connect these phenomena. We propose a technique to measure to what…
▽ More
Machine learning models exhibit two seemingly contradictory phenomena: training data memorization, and various forms of forgetting. In memorization, models overfit specific training examples and become susceptible to privacy attacks. In forgetting, examples which appeared early in training are forgotten by the end. In this work, we connect these phenomena. We propose a technique to measure to what extent models "forget" the specifics of training examples, becoming less susceptible to privacy attacks on examples they have not seen recently. We show that, while non-convex models can memorize data forever in the worst-case, standard image, speech, and language models empirically do forget examples over time. We identify nondeterminism as a potential explanation, showing that deterministically trained models do not forget. Our results suggest that examples seen early when training with extremely large datasets - for instance those examples used to pre-train a model - may observe privacy benefits at the expense of examples seen later.
△ Less
Submitted 9 May, 2023; v1 submitted 30 June, 2022;
originally announced July 2022.
-
Increasing Confidence in Adversarial Robustness Evaluations
Authors:
Roland S. Zimmermann,
Wieland Brendel,
Florian Tramer,
Nicholas Carlini
Abstract:
Hundreds of defenses have been proposed to make deep neural networks robust against minimal (adversarial) input perturbations. However, only a handful of these defenses held up their claims because correctly evaluating robustness is extremely challenging: Weak attacks often fail to find adversarial examples even if they unknowingly exist, thereby making a vulnerable network look robust. In this pa…
▽ More
Hundreds of defenses have been proposed to make deep neural networks robust against minimal (adversarial) input perturbations. However, only a handful of these defenses held up their claims because correctly evaluating robustness is extremely challenging: Weak attacks often fail to find adversarial examples even if they unknowingly exist, thereby making a vulnerable network look robust. In this paper, we propose a test to identify weak attacks, and thus weak defense evaluations. Our test slightly modifies a neural network to guarantee the existence of an adversarial example for every sample. Consequentially, any correct attack must succeed in breaking this modified network. For eleven out of thirteen previously-published defenses, the original evaluation of the defense fails our test, while stronger attacks that break these defenses pass it. We hope that attack unit tests - such as ours - will be a major component in future robustness evaluations and increase confidence in an empirical field that is currently riddled with skepticism.
△ Less
Submitted 28 June, 2022;
originally announced June 2022.
-
(Certified!!) Adversarial Robustness for Free!
Authors:
Nicholas Carlini,
Florian Tramer,
Krishnamurthy Dj Dvijotham,
Leslie Rice,
Mingjie Sun,
J. Zico Kolter
Abstract:
In this paper we show how to achieve state-of-the-art certified adversarial robustness to 2-norm bounded perturbations by relying exclusively on off-the-shelf pretrained models. To do so, we instantiate the denoised smoothing approach of Salman et al. 2020 by combining a pretrained denoising diffusion probabilistic model and a standard high-accuracy classifier. This allows us to certify 71% accura…
▽ More
In this paper we show how to achieve state-of-the-art certified adversarial robustness to 2-norm bounded perturbations by relying exclusively on off-the-shelf pretrained models. To do so, we instantiate the denoised smoothing approach of Salman et al. 2020 by combining a pretrained denoising diffusion probabilistic model and a standard high-accuracy classifier. This allows us to certify 71% accuracy on ImageNet under adversarial perturbations constrained to be within an 2-norm of 0.5, an improvement of 14 percentage points over the prior certified SoTA using any approach, or an improvement of 30 percentage points over denoised smoothing. We obtain these results using only pretrained diffusion models and image classifiers, without requiring any fine tuning or retraining of model parameters.
△ Less
Submitted 6 March, 2023; v1 submitted 21 June, 2022;
originally announced June 2022.
-
The Privacy Onion Effect: Memorization is Relative
Authors:
Nicholas Carlini,
Matthew Jagielski,
Chiyuan Zhang,
Nicolas Papernot,
Andreas Terzis,
Florian Tramer
Abstract:
Machine learning models trained on private datasets have been shown to leak their private data. While recent work has found that the average data point is rarely leaked, the outlier samples are frequently subject to memorization and, consequently, privacy leakage. We demonstrate and analyse an Onion Effect of memorization: removing the "layer" of outlier points that are most vulnerable to a privac…
▽ More
Machine learning models trained on private datasets have been shown to leak their private data. While recent work has found that the average data point is rarely leaked, the outlier samples are frequently subject to memorization and, consequently, privacy leakage. We demonstrate and analyse an Onion Effect of memorization: removing the "layer" of outlier points that are most vulnerable to a privacy attack exposes a new layer of previously-safe points to the same attack. We perform several experiments to study this effect, and understand why it occurs. The existence of this effect has various consequences. For example, it suggests that proposals to defend against memorization without training with rigorous privacy guarantees are unlikely to be effective. Further, it suggests that privacy-enhancing technologies such as machine unlearning could actually harm the privacy of other users.
△ Less
Submitted 22 June, 2022; v1 submitted 21 June, 2022;
originally announced June 2022.