Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability

Chhabra, Vishnu Kabir; Khalili, Mohammad Mahdi

Computer Science > Computation and Language

arXiv:2504.04215 (cs)

[Submitted on 5 Apr 2025]

Title:Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability

Authors:Vishnu Kabir Chhabra, Mohammad Mahdi Khalili

View PDF HTML (experimental)

Abstract:The rapid growth of large language models has spurred significant interest in model compression as a means to enhance their accessibility and practicality. While extensive research has explored model compression through the lens of safety, findings suggest that safety-aligned models often lose elements of trustworthiness post-compression. Simultaneously, the field of mechanistic interpretability has gained traction, with notable discoveries, such as the identification of a single direction in the residual stream mediating refusal behaviors across diverse model architectures. In this work, we investigate the safety of compressed models by examining the mechanisms of refusal, adopting a novel interpretability-driven perspective to evaluate model safety. Furthermore, leveraging insights from our interpretability analysis, we propose a lightweight, computationally efficient method to enhance the safety of compressed models without compromising their performance or utility.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2504.04215 [cs.CL]
	(or arXiv:2504.04215v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2504.04215

Submission history

From: Mohammad Mahdi Khalili [view email]
[v1] Sat, 5 Apr 2025 16:00:44 UTC (673 KB)

Computer Science > Computation and Language

Title:Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators