Skip to main content

Showing 1–1 of 1 results for author: Kutasov, J

.
  1. arXiv:2506.15740  [pdf, ps, other

    cs.AI cs.CR cs.LG

    SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents

    Authors: Jonathan Kutasov, Yuqi Sun, Paul Colognese, Teun van der Weij, Linda Petrini, Chen Bo Calvin Zhang, John Hughes, Xiang Deng, Henry Sleight, Tyler Tracy, Buck Shlegeris, Joe Benton

    Abstract: As Large Language Models (LLMs) are increasingly deployed as autonomous agents in complex and long horizon settings, it is critical to evaluate their ability to sabotage users by pursuing hidden objectives. We study the ability of frontier LLMs to evade monitoring and achieve harmful hidden goals while completing a wide array of realistic tasks. We evaluate a broad range of frontier LLMs using SHA… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.