Skip to main content

Showing 1–1 of 1 results for author: Labenz, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2502.17424  [pdf, other

    cs.CL cs.AI cs.CR cs.LG

    Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

    Authors: Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, Owain Evans

    Abstract: We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding. It asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure c… ▽ More

    Submitted 12 May, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

    Comments: 40 pages, 38 figures An earlier revision of this paper was accepted at ICML 2025. Since then, it has been updated to include new results on training dynamics (4.7) and base models (4.8)