The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models

Xu, Zhiyuan; Gardiner, Joseph; Belguith, Sana

Computer Science > Cryptography and Security

arXiv:2502.01225 (cs)

[Submitted on 3 Feb 2025]

Title:The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models

Authors:Zhiyuan Xu, Joseph Gardiner, Sana Belguith

View PDF

Abstract:Large language models are typically trained on vast amounts of data during the pre-training phase, which may include some potentially harmful information. Fine-tuning attacks can exploit this by prompting the model to reveal such behaviours, leading to the generation of harmful content. In this paper, we focus on investigating the performance of the Chain of Thought based reasoning model, DeepSeek, when subjected to fine-tuning attacks. Specifically, we explore how fine-tuning manipulates the model's output, exacerbating the harmfulness of its responses while examining the interaction between the Chain of Thought reasoning and adversarial inputs. Through this study, we aim to shed light on the vulnerability of Chain of Thought enabled models to fine-tuning attacks and the implications for their safety and ethical deployment.

Comments:	12 Pages
Subjects:	Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2502.01225 [cs.CR]
	(or arXiv:2502.01225v1 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2502.01225

Submission history

From: Joseph Gardiner [view email]
[v1] Mon, 3 Feb 2025 10:28:26 UTC (357 KB)

Computer Science > Cryptography and Security

Title:The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators