Skip to main content

Showing 1–2 of 2 results for author: Whittingham, H

.
  1. arXiv:2506.01926  [pdf, ps, other

    cs.AI cs.CL cs.LG

    Large language models can learn and generalize steganographic chain-of-thought under process supervision

    Authors: Joey Skaf, Luis Ibanez-Lissen, Robert McCarthy, Connor Watts, Vasil Georgiv, Hannes Whittingham, Lorena Gonzalez-Manzano, David Lindner, Cameron Tice, Edward James Young, Puria Radmard

    Abstract: Chain-of-thought (CoT) reasoning not only enhances large language model performance but also provides critical insights into decision-making processes, marking it as a useful tool for monitoring model intent and planning. By proactively preventing models from acting on CoT indicating misaligned or harmful intent, CoT monitoring can be used to reduce risks associated with deploying models. However,… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: 10 pages main text, 3 figures main text, 15 pages supplementary material, 1 figure supplementary material, submitted to NeurIPS 2025

  2. arXiv:2505.23575  [pdf, ps, other

    cs.AI cs.LG

    CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

    Authors: Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky, Hannes Whittingham, Mary Phuong

    Abstract: As AI models are deployed with increasing autonomy, it is important to ensure they do not take harmful actions unnoticed. As a potential mitigation, we investigate Chain-of-Thought (CoT) monitoring, wherein a weaker trusted monitor model continuously oversees the intermediate reasoning steps of a more powerful but untrusted model. We compare CoT monitoring to action-only monitoring, where only fin… ▽ More

    Submitted 29 June, 2025; v1 submitted 29 May, 2025; originally announced May 2025.