Skip to main content

Showing 1–1 of 1 results for author: Arnav, B

.
  1. arXiv:2505.23575  [pdf, ps, other

    cs.AI cs.LG

    CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

    Authors: Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky, Hannes Whittingham, Mary Phuong

    Abstract: As AI models are deployed with increasing autonomy, it is important to ensure they do not take harmful actions unnoticed. As a potential mitigation, we investigate Chain-of-Thought (CoT) monitoring, wherein a weaker trusted monitor model continuously oversees the intermediate reasoning steps of a more powerful but untrusted model. We compare CoT monitoring to action-only monitoring, where only fin… ▽ More

    Submitted 29 June, 2025; v1 submitted 29 May, 2025; originally announced May 2025.