Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks

Zhao, Jiawei; Chen, Kejiang; Yuan, Xiaojian; Zhang, Weiming

Computer Science > Cryptography and Security

arXiv:2408.08924 (cs)

[Submitted on 15 Aug 2024 (v1), last revised 22 Aug 2024 (this version, v2)]

Title:Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks

Authors:Jiawei Zhao, Kejiang Chen, Xiaojian Yuan, Weiming Zhang

View PDF HTML (experimental)

Abstract:In recent years, the rapid development of large language models (LLMs) has achieved remarkable performance across various tasks. However, research indicates that LLMs are vulnerable to jailbreak attacks, where adversaries can induce the generation of harmful content through meticulously crafted prompts. This vulnerability poses significant challenges to the secure use and promotion of LLMs. Existing defense methods offer protection from different perspectives but often suffer from insufficient effectiveness or a significant impact on the model's capabilities. In this paper, we propose a plug-and-play and easy-to-deploy jailbreak defense framework, namely Prefix Guidance (PG), which guides the model to identify harmful prompts by directly setting the first few tokens of the model's output. This approach combines the model's inherent security capabilities with an external classifier to defend against jailbreak attacks. We demonstrate the effectiveness of PG across three models and five attack methods. Compared to baselines, our approach is generally more effective on average. Additionally, results on the Just-Eval benchmark further confirm PG's superiority to preserve the model's performance. our code is available at this https URL.

Subjects:	Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2408.08924 [cs.CR]
	(or arXiv:2408.08924v2 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2408.08924

Submission history

From: Jiawei Zhao [view email]
[v1] Thu, 15 Aug 2024 14:51:32 UTC (3,336 KB)
[v2] Thu, 22 Aug 2024 17:21:34 UTC (3,336 KB)

Computer Science > Cryptography and Security

Title:Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators