Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

Jang, Yeonwoo; Hossain, Shariqah; Sreevatsa, Ashwin; Cruz, Diogo

Computer Science > Cryptography and Security

arXiv:2506.10236v1 (cs)

[Submitted on 11 Jun 2025 (this version), latest version 14 Aug 2025 (v2)]

Title:Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

Authors:Yeonwoo Jang, Shariqah Hossain, Ashwin Sreevatsa, Diogo Cruz

View PDF HTML (experimental)

Abstract:In this work, we show that some machine unlearning methods may fail when subjected to straightforward prompt attacks. We systematically evaluate eight unlearning techniques across three model families, and employ output-based, logit-based, and probe analysis to determine to what extent supposedly unlearned knowledge can be retrieved. While methods like RMU and TAR demonstrate robust unlearning, ELM remains vulnerable to specific prompt attacks (e.g., Hindi filler text in original prompt recovering 57.3% accuracy). Our logit analysis also confirms that unlearned models are generally not hiding knowledge by modifying the way the answer is formatted, as the correlation between output and logit accuracy is strong. These results challenge prevailing assumptions about unlearning effectiveness and highlight the need for evaluation frameworks that can reliably distinguish between true knowledge removal and superficial output suppression. We also publicly make available our evaluation framework to easily evaluate prompting techniques to retrieve unlearning knowledge.

Comments:	20 pages, 6 figures
Subjects:	Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
Cite as:	arXiv:2506.10236 [cs.CR]
	(or arXiv:2506.10236v1 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2506.10236

Submission history

From: Diogo Cruz [view email]
[v1] Wed, 11 Jun 2025 23:36:30 UTC (446 KB)
[v2] Thu, 14 Aug 2025 05:03:53 UTC (456 KB)

Computer Science > Cryptography and Security

Title:Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators