Skip to main content

Showing 1–1 of 1 results for author: Ameisen, E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2503.10965  [pdf, other

    cs.AI cs.CL cs.LG

    Auditing language models for hidden objectives

    Authors: Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Siddharth Mishra-Sharma, Daniel Ziegler, Emmanuel Ameisen, Joshua Batson, Tim Belonax, Samuel R. Bowman, Shan Carter, Brian Chen, Hoagy Cunningham, Carson Denison, Florian Dietz, Satvik Golechha, Akbir Khan, Jan Kirchner, Jan Leike, Austin Meek, Kei Nishimura-Gasparian, Euan Ong, Christopher Olah, Adam Pearce , et al. (10 additional authors not shown)

    Abstract: We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model… ▽ More

    Submitted 27 March, 2025; v1 submitted 13 March, 2025; originally announced March 2025.