Skip to main content

Showing 1–6 of 6 results for author: Betley, J

.
  1. arXiv:2506.13206  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models

    Authors: James Chua, Jan Betley, Mia Taylor, Owain Evans

    Abstract: Prior work shows that LLMs finetuned on malicious behaviors in a narrow domain (e.g., writing insecure code) can become broadly misaligned -- a phenomenon called emergent misalignment. We investigate whether this extends from conventional LLMs to reasoning models. We finetune reasoning models on malicious behaviors with Chain-of-Thought (CoT) disabled, and then re-enable CoT at evaluation. Like co… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  2. arXiv:2502.17424  [pdf, other

    cs.CL cs.AI cs.CR cs.LG

    Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

    Authors: Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, Owain Evans

    Abstract: We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding. It asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure c… ▽ More

    Submitted 12 May, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

    Comments: 40 pages, 38 figures An earlier revision of this paper was accepted at ICML 2025. Since then, it has been updated to include new results on training dynamics (4.7) and base models (4.8)

  3. arXiv:2501.11120  [pdf, other

    cs.CL cs.AI cs.CR cs.LG

    Tell me about yourself: LLMs are aware of their learned behaviors

    Authors: Jan Betley, Xuchan Bao, Martín Soto, Anna Sztyber-Betley, James Chua, Owain Evans

    Abstract: We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it.… ▽ More

    Submitted 19 January, 2025; originally announced January 2025.

    Comments: Submitted to ICLR 2025. 17 pages, 13 figures

  4. arXiv:2407.04694  [pdf, other

    cs.CL cs.AI cs.LG

    Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

    Authors: Rudolf Laine, Bilal Chughtai, Jan Betley, Kaivalya Hariharan, Jeremy Scheurer, Mikita Balesni, Marius Hobbhahn, Alexander Meinke, Owain Evans

    Abstract: AI assistants such as ChatGPT are trained to respond to users by saying, "I am a large language model". This raises questions. Do such models know that they are LLMs and reliably act on this knowledge? Are they aware of their current circumstances, such as being deployed to the public? We refer to a model's knowledge of itself and its circumstances as situational awareness. To quantify situational… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

    Comments: 11 page main body, 98 page appendix, 58 figures

  5. arXiv:2406.14546  [pdf, other

    cs.CL cs.AI cs.LG

    Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

    Authors: Johannes Treutlein, Dami Choi, Jan Betley, Samuel Marks, Cem Anil, Roger Grosse, Owain Evans

    Abstract: One way to address safety risks from large language models (LLMs) is to censor dangerous knowledge from their training data. While this removes the explicit information, implicit information can remain scattered across various training documents. Could an LLM infer the censored knowledge by piecing together these implicit hints? As a step towards answering this question, we study inductive out-of-… ▽ More

    Submitted 23 December, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

    Comments: Accepted at NeurIPS 2024. 10 pages, 8 figures

  6. BridgeHand2Vec Bridge Hand Representation

    Authors: Anna Sztyber-Betley, Filip Kołodziej, Jan Betley, Piotr Duszak

    Abstract: Contract bridge is a game characterized by incomplete information, posing an exciting challenge for artificial intelligence methods. This paper proposes the BridgeHand2Vec approach, which leverages a neural network to embed a bridge player's hand (consisting of 13 cards) into a vector space. The resulting representation reflects the strength of the hand in the game and enables interpretable distan… ▽ More

    Submitted 10 October, 2023; originally announced October 2023.

    Journal ref: Frontiers in Artificial Intelligence and Applications, Volume 372: ECAI 2023, Pages 2274 - 2281