Skip to main content

Showing 1–4 of 4 results for author: Sztyber-Betley, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2502.17424  [pdf, other

    cs.CL cs.AI cs.CR cs.LG

    Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

    Authors: Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, Owain Evans

    Abstract: We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding. It asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure c… ▽ More

    Submitted 12 May, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

    Comments: 40 pages, 38 figures An earlier revision of this paper was accepted at ICML 2025. Since then, it has been updated to include new results on training dynamics (4.7) and base models (4.8)

  2. arXiv:2501.14249  [pdf, other

    cs.LG cs.AI cs.CL

    Humanity's Last Exam

    Authors: Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dmitry Dodonov, Tung Nguyen, Jaeho Lee, Daron Anderson, Mikhail Doroshenko, Alun Cennyth Stokes , et al. (1084 additional authors not shown)

    Abstract: Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of… ▽ More

    Submitted 19 April, 2025; v1 submitted 24 January, 2025; originally announced January 2025.

    Comments: 29 pages, 6 figures

  3. arXiv:2501.11120  [pdf, other

    cs.CL cs.AI cs.CR cs.LG

    Tell me about yourself: LLMs are aware of their learned behaviors

    Authors: Jan Betley, Xuchan Bao, Martín Soto, Anna Sztyber-Betley, James Chua, Owain Evans

    Abstract: We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it.… ▽ More

    Submitted 19 January, 2025; originally announced January 2025.

    Comments: Submitted to ICLR 2025. 17 pages, 13 figures

  4. BridgeHand2Vec Bridge Hand Representation

    Authors: Anna Sztyber-Betley, Filip Kołodziej, Jan Betley, Piotr Duszak

    Abstract: Contract bridge is a game characterized by incomplete information, posing an exciting challenge for artificial intelligence methods. This paper proposes the BridgeHand2Vec approach, which leverages a neural network to embed a bridge player's hand (consisting of 13 cards) into a vector space. The resulting representation reflects the strength of the hand in the game and enables interpretable distan… ▽ More

    Submitted 10 October, 2023; originally announced October 2023.

    Journal ref: Frontiers in Artificial Intelligence and Applications, Volume 372: ECAI 2023, Pages 2274 - 2281