Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Jain, Neel; Schwarzschild, Avi; Wen, Yuxin; Somepalli, Gowthami; Kirchenbauer, John; Chiang, Ping-yeh; Goldblum, Micah; Saha, Aniruddha; Geiping, Jonas; Goldstein, Tom

Computer Science > Machine Learning

arXiv:2309.00614 (cs)

[Submitted on 1 Sep 2023 (v1), last revised 4 Sep 2023 (this version, v2)]

Title:Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Authors:Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, Tom Goldstein

View PDF

Abstract:As Large Language Models quickly become ubiquitous, it becomes critical to understand their security vulnerabilities. Recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. Drawing from the rich body of work on adversarial machine learning, we approach these attacks with three questions: What threat models are practically useful in this domain? How do baseline defense techniques perform in this new domain? How does LLM security differ from computer vision?
We evaluate several baseline defense strategies against leading adversarial attacks on LLMs, discussing the various settings in which each is feasible and effective. Particularly, we look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training. We discuss white-box and gray-box settings and discuss the robustness-performance trade-off for each of the defenses considered. We find that the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs. Future research will be needed to uncover whether more powerful optimizers can be developed, or whether the strength of filtering and preprocessing defenses is greater in the LLMs domain than it has been in computer vision.

Comments:	12 pages
Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
Cite as:	arXiv:2309.00614 [cs.LG]
	(or arXiv:2309.00614v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2309.00614

Submission history

From: Jonas Geiping [view email]
[v1] Fri, 1 Sep 2023 17:59:44 UTC (336 KB)
[v2] Mon, 4 Sep 2023 17:47:36 UTC (281 KB)

Computer Science > Machine Learning

Title:Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators