Skip to main content

Showing 1–6 of 6 results for author: Greenblatt, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2412.14093  [pdf, other

    cs.AI cs.CL cs.LG

    Alignment faking in large language models

    Authors: Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Evan Hubinger

    Abstract: We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model… ▽ More

    Submitted 19 December, 2024; v1 submitted 18 December, 2024; originally announced December 2024.

  2. arXiv:2405.19550  [pdf, other

    cs.LG cs.CL

    Stress-Testing Capability Elicitation With Password-Locked Models

    Authors: Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov, David Krueger

    Abstract: To determine the safety of large language models (LLMs), AI developers must be able to assess their dangerous capabilities. But simple prompting strategies often fail to elicit an LLM's full capabilities. One way to elicit capabilities more robustly is to fine-tune the LLM to complete the task. In this paper, we investigate the conditions under which fine-tuning-based elicitation suffices to elici… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  3. arXiv:2401.05566  [pdf, other

    cs.CR cs.AI cs.CL cs.LG cs.SE

    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

    Authors: Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec , et al. (14 additional authors not shown)

    Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept exa… ▽ More

    Submitted 17 January, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

    Comments: updated to add missing acknowledgements

  4. arXiv:2312.06942  [pdf, other

    cs.LG

    AI Control: Improving Safety Despite Intentional Subversion

    Authors: Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger

    Abstract: As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes. However, researchers have not evalu… ▽ More

    Submitted 23 July, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

    Comments: Edit: Fix minor typos, clarify abstract, add glossary, expand related work. ICML version: https://openreview.net/pdf?id=KviM5k8pcP

  5. arXiv:2310.18512  [pdf, other

    cs.LG

    Preventing Language Models From Hiding Their Reasoning

    Authors: Fabien Roger, Ryan Greenblatt

    Abstract: Large language models (LLMs) often benefit from intermediate steps of reasoning to generate answers to complex problems. When these intermediate steps of reasoning are used to monitor the activity of the model, it is essential that this explicit reasoning is faithful, i.e. that it reflects what the model is actually reasoning about. In this work, we focus on one potential way intermediate steps of… ▽ More

    Submitted 31 October, 2023; v1 submitted 27 October, 2023; originally announced October 2023.

    Comments: Edit: Fix typos

  6. arXiv:2308.15605  [pdf, other

    cs.LG

    Benchmarks for Detecting Measurement Tampering

    Authors: Fabien Roger, Ryan Greenblatt, Max Nadeau, Buck Shlegeris, Nate Thomas

    Abstract: When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals which are robust to optimization. One concern is \textit{measurement tampering}, where the AI system manipulates multiple measurements to create the illusion of good results instead of achieving the desired outcome. In this work, we build four new text-based datasets to evaluate measuremen… ▽ More

    Submitted 29 September, 2023; v1 submitted 29 August, 2023; originally announced August 2023.

    Comments: Edits: extended and improved appendices, fixed references, figures, and typos