Skip to main content

Showing 1–4 of 4 results for author: Papadatos, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.11844  [pdf, other

    cs.AI cs.CL cs.LG

    Evaluating the Goal-Directedness of Large Language Models

    Authors: Tom Everitt, Cristina Garbacea, Alexis Bellot, Jonathan Richens, Henry Papadatos, Siméon Campos, Rohin Shah

    Abstract: To what extent do LLMs use their capabilities towards their given goal? We take this as a measure of their goal-directedness. We evaluate goal-directedness on tasks that require information gathering, cognitive effort, and plan execution, where we use subtasks to infer each model's relevant capabilities. Our evaluations of LLMs from Google DeepMind, OpenAI, and Anthropic show that goal-directednes… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

  2. arXiv:2503.04299  [pdf, other

    cs.AI

    Mapping AI Benchmark Data to Quantitative Risk Estimates Through Expert Elicitation

    Authors: Malcolm Murray, Henry Papadatos, Otter Quarks, Pierre-François Gimenez, Simeon Campos

    Abstract: The literature and multiple experts point to many potential risks from large language models (LLMs), but there are still very few direct measurements of the actual harms posed. AI risk assessment has so far focused on measuring the models' capabilities, but the capabilities of models are only indicators of risk, not measures of risk. Better modeling and quantification of AI risk scenarios can help… ▽ More

    Submitted 10 March, 2025; v1 submitted 6 March, 2025; originally announced March 2025.

    Comments: 23 pages, 4 figures

  3. arXiv:2502.06656  [pdf, other

    cs.AI

    A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management

    Authors: Simeon Campos, Henry Papadatos, Fabien Roger, Chloé Touzet, Otter Quarks, Malcolm Murray

    Abstract: The recent development of powerful AI systems has highlighted the need for robust risk management frameworks in the AI industry. Although companies have begun to implement safety frameworks, current approaches often lack the systematic rigor found in other high-risk industries. This paper presents a comprehensive risk management framework for the development of frontier AI that bridges this gap by… ▽ More

    Submitted 19 February, 2025; v1 submitted 10 February, 2025; originally announced February 2025.

  4. arXiv:2412.00967  [pdf, other

    cs.AI

    Linear Probe Penalties Reduce LLM Sycophancy

    Authors: Henry Papadatos, Rachel Freedman

    Abstract: Large language models (LLMs) are often sycophantic, prioritizing agreement with their users over accurate or objective statements. This problematic behavior becomes more pronounced during reinforcement learning from human feedback (RLHF), an LLM fine-tuning stage intended to align model outputs with human values. Instead of increasing accuracy and reliability, the reward model learned from RLHF of… ▽ More

    Submitted 1 December, 2024; originally announced December 2024.

    Comments: 20 pages, 15 figures, NeurIPS 2024 Workshop Socially Responsible Language Modelling Research (SoLaR)