Showing 1–2 of 2 results for author: Ackerman, C M
-
Mitigating Many-Shot Jailbreaking
Authors:
Christopher M. Ackerman,
Nina Panickssery
Abstract:
Many-shot jailbreaking (MSJ) is an adversarial technique that exploits the long context windows of modern LLMs to circumvent model safety training by including in the prompt many examples of a "fake" assistant responding inappropriately before the final request. With enough examples, the model's in-context learning abilities override its safety training, and it responds as if it were the "fake" as…
▽ More
Many-shot jailbreaking (MSJ) is an adversarial technique that exploits the long context windows of modern LLMs to circumvent model safety training by including in the prompt many examples of a "fake" assistant responding inappropriately before the final request. With enough examples, the model's in-context learning abilities override its safety training, and it responds as if it were the "fake" assistant. In this work, we probe the effectiveness of different fine-tuning and input sanitization approaches on mitigating MSJ attacks, alone and in combination. We find incremental mitigation effectiveness for each, and show that the combined techniques significantly reduce the effectiveness of MSJ attacks, while retaining model performance in benign in-context learning and conversational tasks. We suggest that our approach could meaningfully ameliorate this vulnerability if incorporated into model safety post-training.
△ Less
Submitted 15 May, 2025; v1 submitted 13 April, 2025;
originally announced April 2025.
-
Representation Tuning
Authors:
Christopher M. Ackerman
Abstract:
Activation engineering is becoming increasingly popular as a means of online control of large language models (LLMs). In this work, we extend the idea of inference-time steering with vectors that represent a behavioral direction of interest to tuning those vectors directly into the model, obviating the need for online control. First, we identify activation vectors related to honesty in an open-sou…
▽ More
Activation engineering is becoming increasingly popular as a means of online control of large language models (LLMs). In this work, we extend the idea of inference-time steering with vectors that represent a behavioral direction of interest to tuning those vectors directly into the model, obviating the need for online control. First, we identify activation vectors related to honesty in an open-source LLM (Llama-2-13b-chat). Next, we demonstrate that model output can be made more or less honest by adding positive or negative multiples of these vectors to residual stream activations during generation. Then, we show that a similar effect can be achieved by fine-tuning the vectors directly into the model, by use of a dual loss function based on the cosine similarity of residual stream activations to the vectors combined with a standard token-based loss ("representation tuning"). Finally, we compare the generations in response to honesty-probing prompts from the resulting models to those from models fine-tuned with a token-based loss alone, and to those from the untuned model subjected to online steering. Overall, fine-tuning the vectors into the models using the cosine similarity plus token loss showed a stronger effect than online steering, and generalized better than using the standard loss, suggesting the potential utility of this approach as a safety measure. Code and data are available at https://github.com/cma1114/representation_tuning. Tuned models are available at https://huggingface.co/collections/cackerman/representation-tuning-66da1e5ab41cd1b824687d9f.
△ Less
Submitted 24 November, 2024; v1 submitted 10 September, 2024;
originally announced September 2024.