Skip to main content

Showing 1–2 of 2 results for author: Ackerman, C M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.09604  [pdf, other

    cs.LG cs.CR

    Mitigating Many-Shot Jailbreaking

    Authors: Christopher M. Ackerman, Nina Panickssery

    Abstract: Many-shot jailbreaking (MSJ) is an adversarial technique that exploits the long context windows of modern LLMs to circumvent model safety training by including in the prompt many examples of a "fake" assistant responding inappropriately before the final request. With enough examples, the model's in-context learning abilities override its safety training, and it responds as if it were the "fake" as… ▽ More

    Submitted 15 May, 2025; v1 submitted 13 April, 2025; originally announced April 2025.

  2. arXiv:2409.06927  [pdf, other

    cs.LG cs.CL

    Representation Tuning

    Authors: Christopher M. Ackerman

    Abstract: Activation engineering is becoming increasingly popular as a means of online control of large language models (LLMs). In this work, we extend the idea of inference-time steering with vectors that represent a behavioral direction of interest to tuning those vectors directly into the model, obviating the need for online control. First, we identify activation vectors related to honesty in an open-sou… ▽ More

    Submitted 24 November, 2024; v1 submitted 10 September, 2024; originally announced September 2024.

    Comments: 10 pages, 7 figures, 6 tables