Skip to main content

Showing 1–50 of 101 results for author: Carlini, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.01790  [pdf, ps, other

    cs.LG cs.CR

    IF-GUIDE: Influence Function-Guided Detoxification of LLMs

    Authors: Zachary Coalson, Juhan Bae, Nicholas Carlini, Sanghyun Hong

    Abstract: We study how training data contributes to the emergence of toxic behaviors in large-language models. Most prior work on reducing model toxicity adopts $reactive$ approaches, such as fine-tuning pre-trained (and potentially toxic) models to align them with human values. In contrast, we propose a $proactive$ approach$-$IF-Guide$-$which leverages influence functions to identify harmful tokens within… ▽ More

    Submitted 9 June, 2025; v1 submitted 2 June, 2025; originally announced June 2025.

    Comments: Pre-print

  2. arXiv:2505.11449  [pdf, other

    cs.CR cs.AI

    LLMs unlock new paths to monetizing exploits

    Authors: Nicholas Carlini, Milad Nasr, Edoardo Debenedetti, Barry Wang, Christopher A. Choquette-Choo, Daphne Ippolito, Florian Tramèr, Matthew Jagielski

    Abstract: We argue that Large language models (LLMs) will soon alter the economics of cyberattacks. Instead of attacking the most commonly used software and monetizing exploits by targeting the lowest common denominator among victims, LLMs enable adversaries to launch tailored attacks on a user-by-user basis. On the exploitation front, instead of human attackers manually searching for one difficult-to-ident… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

  3. arXiv:2503.18813  [pdf, ps, other

    cs.CR cs.AI

    Defeating Prompt Injections by Design

    Authors: Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, Florian Tramèr

    Abstract: Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an untrusted environment. However, LLM agents are vulnerable to prompt injection attacks when handling untrusted data. In this paper we propose CaMeL, a robust defense that creates a protective system layer around the LLM, securing it even when underlying models are susceptible to attacks. To operate, CaMe… ▽ More

    Submitted 24 June, 2025; v1 submitted 24 March, 2025; originally announced March 2025.

    Comments: Updated version with newer models and link to the code

  4. arXiv:2503.16861  [pdf, other

    cs.AI

    In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI

    Authors: Shayne Longpre, Kevin Klyman, Ruth E. Appel, Sayash Kapoor, Rishi Bommasani, Michelle Sahar, Sean McGregor, Avijit Ghosh, Borhane Blili-Hamelin, Nathan Butters, Alondra Nelson, Amit Elazari, Andrew Sellars, Casey John Ellis, Dane Sherrets, Dawn Song, Harley Geiger, Ilona Cohen, Lauren McIlvenny, Madhulika Srikumar, Mark M. Jaycox, Markus Anderljung, Nadine Farid Johnson, Nicholas Carlini, Nicolas Miailhe , et al. (9 additional authors not shown)

    Abstract: The widespread deployment of general-purpose AI (GPAI) systems introduces significant new risks. Yet the infrastructure, practices, and norms for reporting flaws in GPAI systems remain seriously underdeveloped, lagging far behind more established fields like software security. Based on a collaboration between experts from the fields of software security, machine learning, law, social science, and… ▽ More

    Submitted 25 March, 2025; v1 submitted 21 March, 2025; originally announced March 2025.

  5. arXiv:2503.01811  [pdf, other

    cs.CR cs.AI cs.LG

    AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses

    Authors: Nicholas Carlini, Javier Rando, Edoardo Debenedetti, Milad Nasr, Florian Tramèr

    Abstract: We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, bench directly measures LLMs' success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solv… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

  6. arXiv:2502.02260  [pdf, ps, other

    cs.LG cs.CR

    Adversarial ML Problems Are Getting Harder to Solve and to Evaluate

    Authors: Javier Rando, Jie Zhang, Nicholas Carlini, Florian Tramèr

    Abstract: In the past decade, considerable research effort has been devoted to securing machine learning (ML) models that operate in adversarial settings. Yet, progress has been slow even for simple "toy" problems (e.g., robustness to small adversarial perturbations) and is often hindered by non-rigorous evaluations. Today, adversarial ML research has shifted towards studying larger, general-purpose languag… ▽ More

    Submitted 4 February, 2025; originally announced February 2025.

  7. arXiv:2501.07493  [pdf, other

    cs.LG cs.CR

    Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards

    Authors: Yangsibo Huang, Milad Nasr, Anastasios Angelopoulos, Nicholas Carlini, Wei-Lin Chiang, Christopher A. Choquette-Choo, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Ken Ziyu Liu, Ion Stoica, Florian Tramer, Chiyuan Zhang

    Abstract: It is now common to evaluate Large Language Models (LLMs) by having humans manually vote to evaluate model outputs, in contrast to typical benchmarks that evaluate knowledge or skill at some particular task. Chatbot Arena, the most popular benchmark of this type, ranks models by asking users to select the better response between two randomly selected models (without revealing which model was respo… ▽ More

    Submitted 13 January, 2025; originally announced January 2025.

  8. arXiv:2412.07097  [pdf, other

    cs.CR cs.AI

    On Evaluating the Durability of Safeguards for Open-Weight LLMs

    Authors: Xiangyu Qi, Boyi Wei, Nicholas Carlini, Yangsibo Huang, Tinghao Xie, Luxi He, Matthew Jagielski, Milad Nasr, Prateek Mittal, Peter Henderson

    Abstract: Stakeholders -- from model developers to policymakers -- seek to minimize the dual-use risks of large language models (LLMs). An open challenge to this goal is whether technical safeguards can impede the misuse of LLMs, even when models are customizable via fine-tuning or when model weights are fully open. In response, several recent studies have proposed methods to produce durable LLM safeguards… ▽ More

    Submitted 9 December, 2024; originally announced December 2024.

  9. arXiv:2411.18479  [pdf, ps, other

    cs.CR cs.AI cs.LG

    SoK: Watermarking for AI-Generated Content

    Authors: Xuandong Zhao, Sam Gunn, Miranda Christ, Jaiden Fairoze, Andres Fabrega, Nicholas Carlini, Sanjam Garg, Sanghyun Hong, Milad Nasr, Florian Tramer, Somesh Jha, Lei Li, Yu-Xiang Wang, Dawn Song

    Abstract: As the outputs of generative AI (GenAI) techniques improve in quality, it becomes increasingly challenging to distinguish them from human-created content. Watermarking schemes are a promising approach to address the problem of distinguishing between AI and human-generated content. These schemes embed hidden signals within AI-generated content to enable reliable detection. While watermarking is not… ▽ More

    Submitted 12 June, 2025; v1 submitted 27 November, 2024; originally announced November 2024.

    Comments: IEEE S&P 2025

  10. arXiv:2411.14834  [pdf, other

    cs.LG cs.CR

    Evaluating the Robustness of the "Ensemble Everything Everywhere" Defense

    Authors: Jie Zhang, Christian Schlarmann, Kristina Nikolić, Nicholas Carlini, Francesco Croce, Matthias Hein, Florian Tramèr

    Abstract: Ensemble everything everywhere is a defense to adversarial examples that was recently proposed to make image classifiers robust. This defense works by ensembling a model's intermediate representations at multiple noisy image resolutions, producing a single robust classification. This defense was shown to be effective against multiple state-of-the-art attacks. Perhaps even more convincingly, it was… ▽ More

    Submitted 3 February, 2025; v1 submitted 22 November, 2024; originally announced November 2024.

  11. arXiv:2411.10242  [pdf, other

    cs.CL cs.LG

    Measuring Non-Adversarial Reproduction of Training Data in Large Language Models

    Authors: Michael Aerni, Javier Rando, Edoardo Debenedetti, Nicholas Carlini, Daphne Ippolito, Florian Tramèr

    Abstract: Large language models memorize parts of their training data. Memorizing short snippets and facts is required to answer questions about the world and to be fluent in any language. But models have also been shown to reproduce long verbatim sequences of memorized text when prompted by a motivated adversary. In this work, we investigate an intermediate regime of memorization that we call non-adversari… ▽ More

    Submitted 15 November, 2024; originally announced November 2024.

  12. arXiv:2410.22884  [pdf, other

    cs.CR cs.AI cs.CL cs.LG

    Stealing User Prompts from Mixture of Experts

    Authors: Itay Yona, Ilia Shumailov, Jamie Hayes, Nicholas Carlini

    Abstract: Mixture-of-Experts (MoE) models improve the efficiency and scalability of dense language models by routing each token to a small number of experts in each layer. In this paper, we show how an adversary that can arrange for their queries to appear in the same batch of examples as a victim's queries can exploit Expert-Choice-Routing to fully disclose a victim's prompt. We successfully demonstrate th… ▽ More

    Submitted 30 October, 2024; originally announced October 2024.

  13. arXiv:2410.17175  [pdf, other

    cs.CR cs.LG

    Remote Timing Attacks on Efficient Language Model Inference

    Authors: Nicholas Carlini, Milad Nasr

    Abstract: Scaling up language models has significantly increased their capabilities. But larger models are slower models, and so there is now an extensive body of work (e.g., speculative sampling or parallel decoding) that improves the (average case) efficiency of language model generation. But these techniques introduce data-dependent timing characteristics. We show it is possible to exploit these timing d… ▽ More

    Submitted 22 October, 2024; originally announced October 2024.

  14. arXiv:2410.13722  [pdf, other

    cs.CR cs.AI

    Persistent Pre-Training Poisoning of LLMs

    Authors: Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramèr, Daphne Ippolito

    Abstract: Large language models are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web. Prior work has shown that: (1) web-scraped pre-training datasets can be practically poisoned by malicious actors; and (2) adversaries can compromise language models after poisoning fine-tuning datasets. Our work evaluates for the first time whether language models can also be co… ▽ More

    Submitted 17 October, 2024; originally announced October 2024.

  15. arXiv:2410.05750  [pdf, other

    cs.CR cs.AI

    Polynomial Time Cryptanalytic Extraction of Deep Neural Networks in the Hard-Label Setting

    Authors: Nicholas Carlini, Jorge Chávez-Saab, Anna Hambitzer, Francisco Rodríguez-Henríquez, Adi Shamir

    Abstract: Deep neural networks (DNNs) are valuable assets, yet their public accessibility raises security concerns about parameter extraction by malicious actors. Recent work by Carlini et al. (crypto'20) and Canales-Martínez et al. (eurocrypt'24) has drawn parallels between this issue and block cipher key extraction via chosen plaintext attacks. Leveraging differential cryptanalysis, they demonstrated that… ▽ More

    Submitted 8 October, 2024; originally announced October 2024.

  16. arXiv:2406.12027  [pdf, other

    cs.CR

    Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI

    Authors: Robert Hönig, Javier Rando, Nicholas Carlini, Florian Tramèr

    Abstract: Artists are increasingly concerned about advancements in image generation models that can closely replicate their unique artistic styles. In response, several protection tools against style mimicry have been developed that incorporate small adversarial perturbations into artworks published online. In this work, we evaluate the effectiveness of popular protections -- with millions of downloads -- a… ▽ More

    Submitted 11 February, 2025; v1 submitted 17 June, 2024; originally announced June 2024.

  17. arXiv:2405.03672  [pdf, other

    cs.CR cs.LG

    Cutting through buggy adversarial example defenses: fixing 1 line of code breaks Sabre

    Authors: Nicholas Carlini

    Abstract: Sabre is a defense to adversarial examples that was accepted at IEEE S&P 2024. We first reveal significant flaws in the evaluation that point to clear signs of gradient masking. We then show the cause of this gradient masking: a bug in the original evaluation code. By fixing a single line of code in the original repository, we reduce Sabre's robust accuracy to 0%. In response to this, the authors… ▽ More

    Submitted 1 July, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

  18. arXiv:2404.10859  [pdf, other

    cs.CL cs.LG

    Forcing Diffuse Distributions out of Language Models

    Authors: Yiming Zhang, Avi Schwarzschild, Nicholas Carlini, Zico Kolter, Daphne Ippolito

    Abstract: Despite being trained specifically to follow user instructions, today's instructiontuned language models perform poorly when instructed to produce random outputs. For example, when prompted to pick a number uniformly between one and ten Llama-2-13B-chat disproportionately favors the number five, and when tasked with picking a first name at random, Mistral-7B-Instruct chooses Avery 40 times more of… ▽ More

    Submitted 7 August, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

  19. arXiv:2404.01231  [pdf, other

    cs.CR cs.LG

    Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models

    Authors: Yuxin Wen, Leo Marchyok, Sanghyun Hong, Jonas Geiping, Tom Goldstein, Nicholas Carlini

    Abstract: It is commonplace to produce application-specific models by fine-tuning large pre-trained models using a small bespoke dataset. The widespread availability of foundation model checkpoints on the web poses considerable risks, including the vulnerability to backdoor attacks. In this paper, we unveil a new vulnerability: the privacy backdoor attack. This black-box privacy attack aims to amplify the p… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

  20. arXiv:2403.11981  [pdf, ps, other

    cs.CR cs.CV cs.LG

    Certified Robustness to Clean-Label Poisoning Using Diffusion Denoising

    Authors: Sanghyun Hong, Nicholas Carlini, Alexey Kurakin

    Abstract: We present a certified defense to clean-label poisoning attacks under $\ell_2$-norm. These attacks work by injecting a small number of poisoning samples (e.g., 1%) that contain bounded adversarial perturbations into the training data to induce a targeted misclassification of a test-time input. Inspired by the adversarial robustness achieved by $randomized$ $smoothing$, we show how an off-the-shelf… ▽ More

    Submitted 2 June, 2025; v1 submitted 18 March, 2024; originally announced March 2024.

  21. arXiv:2403.06634  [pdf, other

    cs.CR

    Stealing Part of a Production Language Model

    Authors: Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Itay Yona, Eric Wallace, David Rolnick, Florian Tramèr

    Abstract: We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or Google's PaLM-2. Specifically, our attack recovers the embedding projection layer (up to symmetries) of a transformer model, given typical API access. For under \… ▽ More

    Submitted 9 July, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

  22. arXiv:2402.12329  [pdf, other

    cs.CL cs.AI cs.CR cs.LG

    Query-Based Adversarial Prompt Generation

    Authors: Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tramèr, Milad Nasr

    Abstract: Recent work has shown it is possible to construct adversarial examples that cause an aligned language model to emit harmful strings or perform harmful behavior. Existing attacks work either in the white-box setting (with full access to the model weights), or through transferability: the phenomenon that adversarial examples crafted on one model often remain effective on other models. We improve on… ▽ More

    Submitted 7 December, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

  23. arXiv:2312.05716  [pdf, other

    cs.CV

    Initialization Matters for Adversarial Transfer Learning

    Authors: Andong Hua, Jindong Gu, Zhiyu Xue, Nicholas Carlini, Eric Wong, Yao Qin

    Abstract: With the prevalence of the Pretraining-Finetuning paradigm in transfer learning, the robustness of downstream tasks has become a critical concern. In this work, we delve into adversarial robustness in transfer learning and reveal the critical role of initialization, including both the pretrained model and the linear head. First, we discover the necessity of an adversarially robust pretrained model… ▽ More

    Submitted 30 March, 2024; v1 submitted 9 December, 2023; originally announced December 2023.

    Comments: CVPR 2024

  24. arXiv:2311.17035  [pdf, other

    cs.LG cs.CL cs.CR

    Scalable Extraction of Training Data from (Production) Language Models

    Authors: Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee

    Abstract: This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing techniques from… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

  25. arXiv:2311.06477  [pdf, other

    cs.CY

    Report of the 1st Workshop on Generative AI and Law

    Authors: A. Feder Cooper, Katherine Lee, James Grimmelmann, Daphne Ippolito, Christopher Callison-Burch, Christopher A. Choquette-Choo, Niloofar Mireshghallah, Miles Brundage, David Mimno, Madiha Zahrah Choksi, Jack M. Balkin, Nicholas Carlini, Christopher De Sa, Jonathan Frankle, Deep Ganguli, Bryant Gipson, Andres Guadamuz, Swee Leng Harris, Abigail Z. Jacobs, Elizabeth Joh, Gautam Kamath, Mark Lemley, Cass Matthews, Christine McLeavey, Corynne McSherry , et al. (10 additional authors not shown)

    Abstract: This report presents the takeaways of the inaugural Workshop on Generative AI and Law (GenLaw), held in July 2023. A cross-disciplinary group of practitioners and scholars from computer science and law convened to discuss the technical, doctrinal, and policy challenges presented by law for Generative AI, and by Generative AI for law, with an emphasis on U.S. law in particular. We begin the report… ▽ More

    Submitted 2 December, 2023; v1 submitted 10 November, 2023; originally announced November 2023.

  26. arXiv:2309.05610  [pdf, other

    cs.CR cs.LG

    Privacy Side Channels in Machine Learning Systems

    Authors: Edoardo Debenedetti, Giorgio Severi, Nicholas Carlini, Christopher A. Choquette-Choo, Matthew Jagielski, Milad Nasr, Eric Wallace, Florian Tramèr

    Abstract: Most current approaches for protecting privacy in machine learning (ML) assume that models exist in a vacuum. Yet, in reality, these models are part of larger systems that include components for training data filtering, output monitoring, and more. In this work, we introduce privacy side channels: attacks that exploit these system-level components to extract private information at far higher rates… ▽ More

    Submitted 18 July, 2024; v1 submitted 11 September, 2023; originally announced September 2023.

    Comments: USENIX Security 2024

  27. arXiv:2309.04858  [pdf, other

    cs.LG cs.CL cs.CR

    Reverse-Engineering Decoding Strategies Given Blackbox Access to a Language Generation System

    Authors: Daphne Ippolito, Nicholas Carlini, Katherine Lee, Milad Nasr, Yun William Yu

    Abstract: Neural language models are increasingly deployed into APIs and websites that allow a user to pass in a prompt and receive generated text. Many of these systems do not reveal generation parameters. In this paper, we present methods to reverse-engineer the decoding method used to generate text (i.e., top-$k$ or nucleus sampling). Our ability to discover which decoding strategy was used has implicati… ▽ More

    Submitted 9 September, 2023; originally announced September 2023.

    Comments: 6 pages, 4 figures, 3 tables. Also, 5 page appendix. Accepted to INLG 2023

  28. Identifying and Mitigating the Security Risks of Generative AI

    Authors: Clark Barrett, Brad Boyd, Elie Burzstein, Nicholas Carlini, Brad Chen, Jihye Choi, Amrita Roy Chowdhury, Mihai Christodorescu, Anupam Datta, Soheil Feizi, Kathleen Fisher, Tatsunori Hashimoto, Dan Hendrycks, Somesh Jha, Daniel Kang, Florian Kerschbaum, Eric Mitchell, John Mitchell, Zulfikar Ramzan, Khawaja Shams, Dawn Song, Ankur Taly, Diyi Yang

    Abstract: Every major technical invention resurfaces the dual-use dilemma -- the new technology has the potential to be used for good as well as for harm. Generative AI (GenAI) techniques, such as large language models (LLMs) and diffusion models, have shown remarkable capabilities (e.g., in-context learning, code-completion, and text-to-image generation and editing). However, GenAI can be used just as well… ▽ More

    Submitted 28 December, 2023; v1 submitted 28 August, 2023; originally announced August 2023.

    Journal ref: Foundations and Trends in Privacy and Security 6 (2023) 1-52

  29. arXiv:2307.15043  [pdf, other

    cs.CL cs.AI cs.CR cs.LG

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Authors: Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson

    Abstract: Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practic… ▽ More

    Submitted 20 December, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

    Comments: Website: http://llm-attacks.org/

  30. arXiv:2307.15008  [pdf, other

    cs.CR cs.AI cs.LG

    A LLM Assisted Exploitation of AI-Guardian

    Authors: Nicholas Carlini

    Abstract: Large language models (LLMs) are now highly capable at a diverse range of tasks. This paper studies whether or not GPT-4, one such LLM, is capable of assisting researchers in the field of adversarial machine learning. As a case study, we evaluate the robustness of AI-Guardian, a recent defense to adversarial examples published at IEEE S&P 2023, a top computer security conference. We completely bre… ▽ More

    Submitted 20 July, 2023; originally announced July 2023.

  31. arXiv:2307.14692  [pdf, other

    cs.CR

    Backdoor Attacks for In-Context Learning with Language Models

    Authors: Nikhil Kandpal, Matthew Jagielski, Florian Tramèr, Nicholas Carlini

    Abstract: Because state-of-the-art language models are expensive to train, most practitioners must make use of one of the few publicly available language models or language model APIs. This consolidation of trust increases the potency of backdoor attacks, where an adversary tampers with a machine learning model in order to make it perform some malicious behavior on inputs that contain a predefined backdoor… ▽ More

    Submitted 27 July, 2023; originally announced July 2023.

    Comments: AdvML Frontiers Workshop 2023

  32. arXiv:2307.06865  [pdf, other

    cs.CL cs.AI

    Effective Prompt Extraction from Language Models

    Authors: Yiming Zhang, Nicholas Carlini, Daphne Ippolito

    Abstract: The text generated by large language models is commonly controlled by prompting, where a prompt prepended to a user's query guides the model's output. The prompts used by companies to guide their models are often treated as secrets, to be hidden from the user making the query. They have even been treated as commodities to be bought and sold on marketplaces. However, anecdotal reports have shown ad… ▽ More

    Submitted 7 August, 2024; v1 submitted 13 July, 2023; originally announced July 2023.

  33. arXiv:2306.15447  [pdf, other

    cs.CL cs.AI cs.CR cs.LG

    Are aligned neural networks adversarially aligned?

    Authors: Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, Ludwig Schmidt

    Abstract: Large language models are now tuned to align with the goals of their creators, namely to be "helpful and harmless." These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study adversarial alignment, and ask to what extent these models rema… ▽ More

    Submitted 6 May, 2024; v1 submitted 26 June, 2023; originally announced June 2023.

  34. arXiv:2306.02895  [pdf, other

    cs.CR cs.LG stat.ML

    Evading Black-box Classifiers Without Breaking Eggs

    Authors: Edoardo Debenedetti, Nicholas Carlini, Florian Tramèr

    Abstract: Decision-based evasion attacks repeatedly query a black-box classifier to generate adversarial examples. Prior work measures the cost of such attacks by the total number of queries made to the classifier. We argue this metric is flawed. Most security-critical machine learning systems aim to weed out "bad" data (e.g., malware, harmful content, etc). Queries to such systems carry a fundamentally asy… ▽ More

    Submitted 14 February, 2024; v1 submitted 5 June, 2023; originally announced June 2023.

    Comments: Code at https://github.com/ethz-privsec/realistic-adv-examples. Accepted at IEEE SaTML 2024

  35. arXiv:2303.03446  [pdf, other

    cs.CR cs.LG

    Students Parrot Their Teachers: Membership Inference on Model Distillation

    Authors: Matthew Jagielski, Milad Nasr, Christopher Choquette-Choo, Katherine Lee, Nicholas Carlini

    Abstract: Model distillation is frequently proposed as a technique to reduce the privacy leakage of machine learning. These empirical privacy defenses rely on the intuition that distilled ``student'' models protect the privacy of training data, as they only interact with this data indirectly through a ``teacher'' model. In this work, we design membership inference attacks to systematically study the privacy… ▽ More

    Submitted 6 March, 2023; originally announced March 2023.

    Comments: 16 pages, 12 figures

  36. arXiv:2302.13464  [pdf, other

    cs.LG cs.CR

    Randomness in ML Defenses Helps Persistent Attackers and Hinders Evaluators

    Authors: Keane Lucas, Matthew Jagielski, Florian Tramèr, Lujo Bauer, Nicholas Carlini

    Abstract: It is becoming increasingly imperative to design robust ML defenses. However, recent work has found that many defenses that initially resist state-of-the-art attacks can be broken by an adaptive adversary. In this work we take steps to simplify the design of defenses and argue that white-box defenses should eschew randomness when possible. We begin by illustrating a new issue with the deployment o… ▽ More

    Submitted 26 February, 2023; originally announced February 2023.

  37. arXiv:2302.10149  [pdf, other

    cs.CR cs.LG

    Poisoning Web-Scale Training Datasets is Practical

    Authors: Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, Florian Tramèr

    Abstract: Deep learning models are often trained on distributed, web-scale datasets crawled from the internet. In this paper, we introduce two new dataset poisoning attacks that intentionally introduce malicious examples to a model's performance. Our attacks are immediately practical and could, today, poison 10 popular datasets. Our first attack, split-view poisoning, exploits the mutable nature of internet… ▽ More

    Submitted 6 May, 2024; v1 submitted 20 February, 2023; originally announced February 2023.

  38. arXiv:2302.07956  [pdf, other

    cs.LG cs.CR

    Tight Auditing of Differentially Private Machine Learning

    Authors: Milad Nasr, Jamie Hayes, Thomas Steinke, Borja Balle, Florian Tramèr, Matthew Jagielski, Nicholas Carlini, Andreas Terzis

    Abstract: Auditing mechanisms for differential privacy use probabilistic means to empirically estimate the privacy level of an algorithm. For private machine learning, existing auditing mechanisms are tight: the empirical privacy estimate (nearly) matches the algorithm's provable privacy guarantee. But these auditing techniques suffer from two limitations. First, they only give tight estimates under implaus… ▽ More

    Submitted 15 February, 2023; originally announced February 2023.

  39. arXiv:2302.01381  [pdf, other

    cs.LG cs.CV

    Effective Robustness against Natural Distribution Shifts for Models with Different Training Data

    Authors: Zhouxing Shi, Nicholas Carlini, Ananth Balashankar, Ludwig Schmidt, Cho-Jui Hsieh, Alex Beutel, Yao Qin

    Abstract: "Effective robustness" measures the extra out-of-distribution (OOD) robustness beyond what can be predicted from the in-distribution (ID) performance. Existing effective robustness evaluations typically use a single test set such as ImageNet to evaluate the ID accuracy. This becomes problematic when evaluating models trained on different data distributions, e.g., comparing models trained on ImageN… ▽ More

    Submitted 28 October, 2023; v1 submitted 2 February, 2023; originally announced February 2023.

    Comments: NeurIPS 2023

  40. arXiv:2301.13188  [pdf, other

    cs.CR cs.CV cs.LG

    Extracting Training Data from Diffusion Models

    Authors: Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, Eric Wallace

    Abstract: Image diffusion models such as DALL-E 2, Imagen, and Stable Diffusion have attracted significant attention due to their ability to generate high-quality synthetic images. In this work, we show that diffusion models memorize individual images from their training data and emit them at generation time. With a generate-and-filter pipeline, we extract over a thousand training examples from state-of-the… ▽ More

    Submitted 30 January, 2023; originally announced January 2023.

  41. arXiv:2212.13700  [pdf, other

    cs.CR cs.LG

    Publishing Efficient On-device Models Increases Adversarial Vulnerability

    Authors: Sanghyun Hong, Nicholas Carlini, Alexey Kurakin

    Abstract: Recent increases in the computational demands of deep neural networks (DNNs) have sparked interest in efficient deep learning mechanisms, e.g., quantization or pruning. These mechanisms enable the construction of a small, efficient version of commercial-scale models with comparable accuracy, accelerating their deployment to resource-constrained devices. In this paper, we study the security consi… ▽ More

    Submitted 28 December, 2022; originally announced December 2022.

    Comments: Accepted to IEEE SaTML 2023

  42. arXiv:2212.06470  [pdf, ps, other

    cs.LG cs.CR stat.ML

    Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining

    Authors: Florian Tramèr, Gautam Kamath, Nicholas Carlini

    Abstract: The performance of differentially private machine learning can be boosted significantly by leveraging the transfer learning capabilities of non-private models pretrained on large public datasets. We critically review this approach. We primarily question whether the use of large Web-scraped datasets should be viewed as differential-privacy-preserving. We caution that publicizing these models pret… ▽ More

    Submitted 17 July, 2024; v1 submitted 13 December, 2022; originally announced December 2022.

    Comments: Full and unabridged version of paper ICML 2024

  43. arXiv:2210.17546  [pdf, other

    cs.LG cs.CL

    Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy

    Authors: Daphne Ippolito, Florian Tramèr, Milad Nasr, Chiyuan Zhang, Matthew Jagielski, Katherine Lee, Christopher A. Choquette-Choo, Nicholas Carlini

    Abstract: Studying data memorization in neural language models helps us understand the risks (e.g., to privacy or copyright) associated with models regurgitating training data and aids in the development of countermeasures. Many prior works -- and some recently deployed defenses -- focus on "verbatim memorization", defined as a model generation that exactly matches a substring from the training set. We argu… ▽ More

    Submitted 11 September, 2023; v1 submitted 31 October, 2022; originally announced October 2022.

  44. arXiv:2210.03297  [pdf, other

    cs.CR cs.CV cs.LG

    Preprocessors Matter! Realistic Decision-Based Attacks on Machine Learning Systems

    Authors: Chawin Sitawarin, Florian Tramèr, Nicholas Carlini

    Abstract: Decision-based attacks construct adversarial examples against a machine learning (ML) model by making only hard-label queries. These attacks have mainly been applied directly to standalone neural networks. However, in practice, ML models are just one component of a larger learning system. We find that by adding a single preprocessor in front of a classifier, state-of-the-art query-based attacks ar… ▽ More

    Submitted 20 July, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

    Comments: ICML 2023. Code can be found at https://github.com/google-research/preprocessor-aware-black-box-attack

  45. arXiv:2209.14987  [pdf, other

    cs.LG cs.CR

    No Free Lunch in "Privacy for Free: How does Dataset Condensation Help Privacy"

    Authors: Nicholas Carlini, Vitaly Feldman, Milad Nasr

    Abstract: New methods designed to preserve data privacy require careful scrutiny. Failure to preserve privacy is hard to detect, and yet can lead to catastrophic results when a system implementing a ``privacy-preserving'' method is attacked. A recent work selected for an Outstanding Paper Award at ICML 2022 (Dong et al., 2022) claims that dataset condensation (DC) significantly improves data privacy when tr… ▽ More

    Submitted 29 September, 2022; originally announced September 2022.

  46. arXiv:2209.09117  [pdf, other

    cs.CV cs.CR cs.LG

    Part-Based Models Improve Adversarial Robustness

    Authors: Chawin Sitawarin, Kornrapat Pongmala, Yizheng Chen, Nicholas Carlini, David Wagner

    Abstract: We show that combining human prior knowledge with end-to-end learning can improve the robustness of deep neural networks by introducing a part-based model for object classification. We believe that the richer form of annotation helps guide neural networks to learn more robust features without requiring more samples or larger models. Our model combines a part segmentation model with a tiny classifi… ▽ More

    Submitted 8 March, 2023; v1 submitted 15 September, 2022; originally announced September 2022.

    Comments: Published in ICLR 2023 (poster). Code can be found at https://github.com/chawins/adv-part-model

  47. arXiv:2207.00099  [pdf, other

    cs.LG

    Measuring Forgetting of Memorized Training Examples

    Authors: Matthew Jagielski, Om Thakkar, Florian Tramèr, Daphne Ippolito, Katherine Lee, Nicholas Carlini, Eric Wallace, Shuang Song, Abhradeep Thakurta, Nicolas Papernot, Chiyuan Zhang

    Abstract: Machine learning models exhibit two seemingly contradictory phenomena: training data memorization, and various forms of forgetting. In memorization, models overfit specific training examples and become susceptible to privacy attacks. In forgetting, examples which appeared early in training are forgotten by the end. In this work, we connect these phenomena. We propose a technique to measure to what… ▽ More

    Submitted 9 May, 2023; v1 submitted 30 June, 2022; originally announced July 2022.

    Comments: Appeared at ICLR '23, 22 pages, 12 figures

  48. arXiv:2206.13991  [pdf, other

    cs.LG cs.CR cs.CV

    Increasing Confidence in Adversarial Robustness Evaluations

    Authors: Roland S. Zimmermann, Wieland Brendel, Florian Tramer, Nicholas Carlini

    Abstract: Hundreds of defenses have been proposed to make deep neural networks robust against minimal (adversarial) input perturbations. However, only a handful of these defenses held up their claims because correctly evaluating robustness is extremely challenging: Weak attacks often fail to find adversarial examples even if they unknowingly exist, thereby making a vulnerable network look robust. In this pa… ▽ More

    Submitted 28 June, 2022; originally announced June 2022.

    Comments: Oral at CVPR 2022 Workshop (Art of Robustness). Project website https://zimmerrol.github.io/active-tests/

  49. arXiv:2206.10550  [pdf, other

    cs.LG cs.CR

    (Certified!!) Adversarial Robustness for Free!

    Authors: Nicholas Carlini, Florian Tramer, Krishnamurthy Dj Dvijotham, Leslie Rice, Mingjie Sun, J. Zico Kolter

    Abstract: In this paper we show how to achieve state-of-the-art certified adversarial robustness to 2-norm bounded perturbations by relying exclusively on off-the-shelf pretrained models. To do so, we instantiate the denoised smoothing approach of Salman et al. 2020 by combining a pretrained denoising diffusion probabilistic model and a standard high-accuracy classifier. This allows us to certify 71% accura… ▽ More

    Submitted 6 March, 2023; v1 submitted 21 June, 2022; originally announced June 2022.

  50. arXiv:2206.10469  [pdf, other

    cs.LG cs.CR

    The Privacy Onion Effect: Memorization is Relative

    Authors: Nicholas Carlini, Matthew Jagielski, Chiyuan Zhang, Nicolas Papernot, Andreas Terzis, Florian Tramer

    Abstract: Machine learning models trained on private datasets have been shown to leak their private data. While recent work has found that the average data point is rarely leaked, the outlier samples are frequently subject to memorization and, consequently, privacy leakage. We demonstrate and analyse an Onion Effect of memorization: removing the "layer" of outlier points that are most vulnerable to a privac… ▽ More

    Submitted 22 June, 2022; v1 submitted 21 June, 2022; originally announced June 2022.