Programming Refusal with Conditional Activation Steering

Lee, Bruce W.; Padhi, Inkit; Ramamurthy, Karthikeyan Natesan; Miehling, Erik; Dognin, Pierre; Nagireddy, Manish; Dhurandhar, Amit

Computer Science > Machine Learning

arXiv:2409.05907 (cs)

[Submitted on 6 Sep 2024 (v1), last revised 17 Feb 2025 (this version, v3)]

Title:Programming Refusal with Conditional Activation Steering

Authors:Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, Amit Dhurandhar

View PDF HTML (experimental)

Abstract:LLMs have shown remarkable capabilities, but precisely controlling their response behavior remains challenging. Existing activation steering methods alter LLM behavior indiscriminately, limiting their practical applicability in settings where selective responses are essential, such as content moderation or domain-specific assistants. In this paper, we propose Conditional Activation Steering (CAST), which analyzes LLM activation patterns during inference to selectively apply or withhold activation steering based on the input context. Our method is based on the observation that different categories of prompts activate distinct patterns in the model's hidden states. Using CAST, one can systematically control LLM behavior with rules like "if input is about hate speech or adult content, then refuse" or "if input is not about legal advice, then refuse." This allows for selective modification of responses to specific content while maintaining normal responses to other content, all without requiring weight optimization. We release an open-source implementation of our framework at this http URL .

Comments:	ICLR 2025, Spotlight
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2409.05907 [cs.LG]
	(or arXiv:2409.05907v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2409.05907

Submission history

From: Bruce W. Lee [view email]
[v1] Fri, 6 Sep 2024 15:47:40 UTC (3,820 KB)
[v2] Tue, 11 Feb 2025 16:22:45 UTC (3,842 KB)
[v3] Mon, 17 Feb 2025 20:23:19 UTC (3,842 KB)

Computer Science > Machine Learning

Title:Programming Refusal with Conditional Activation Steering

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Programming Refusal with Conditional Activation Steering

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators