-
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Authors:
Bowen Baker,
Joost Huizinga,
Leo Gao,
Zehao Dou,
Melody Y. Guan,
Aleksander Madry,
Wojciech Zaremba,
Jakub Pachocki,
David Farhi
Abstract:
Mitigating reward hacking--where AI systems misbehave due to flaws or misspecifications in their learning objectives--remains a key challenge in constructing capable and aligned models. We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments by using another LLM that observes the model's chain-of-thought (CoT) reasoning. CoT…
▽ More
Mitigating reward hacking--where AI systems misbehave due to flaws or misspecifications in their learning objectives--remains a key challenge in constructing capable and aligned models. We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments by using another LLM that observes the model's chain-of-thought (CoT) reasoning. CoT monitoring can be far more effective than monitoring agent actions and outputs alone, and we further found that a LLM weaker than o3-mini, namely GPT-4o, can effectively monitor a stronger model. Because CoT monitors can be effective at detecting exploits, it is natural to ask whether those exploits can be suppressed by incorporating a CoT monitor directly into the agent's training objective. While we show that integrating CoT monitors into the reinforcement learning reward can indeed produce more capable and more aligned agents in the low optimization regime, we find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the CoT while still exhibiting a significant rate of reward hacking. Because it is difficult to tell when CoTs have become obfuscated, it may be necessary to pay a monitorability tax by not applying strong optimization pressures directly to the chain-of-thought, ensuring that CoTs remain monitorable and useful for detecting misaligned behavior.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
OpenAI o1 System Card
Authors:
OpenAI,
:,
Aaron Jaech,
Adam Kalai,
Adam Lerer,
Adam Richardson,
Ahmed El-Kishky,
Aiden Low,
Alec Helyar,
Aleksander Madry,
Alex Beutel,
Alex Carney,
Alex Iftimie,
Alex Karpenko,
Alex Tachard Passos,
Alexander Neitz,
Alexander Prokofiev,
Alexander Wei,
Allison Tam,
Ally Bennett,
Ananya Kumar,
Andre Saraiva,
Andrea Vallone,
Andrew Duberstein,
Andrew Kondrich
, et al. (238 additional authors not shown)
Abstract:
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-ar…
▽ More
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence. Our results underscore the need for building robust alignment methods, extensively stress-testing their efficacy, and maintaining meticulous risk management protocols. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.
△ Less
Submitted 21 December, 2024;
originally announced December 2024.
-
GPT-4o System Card
Authors:
OpenAI,
:,
Aaron Hurst,
Adam Lerer,
Adam P. Goucher,
Adam Perelman,
Aditya Ramesh,
Aidan Clark,
AJ Ostrow,
Akila Welihinda,
Alan Hayes,
Alec Radford,
Aleksander Mądry,
Alex Baker-Whitcomb,
Alex Beutel,
Alex Borzunov,
Alex Carney,
Alex Chow,
Alex Kirillov,
Alex Nichol,
Alex Paino,
Alex Renzin,
Alex Tachard Passos,
Alexander Kirillov,
Alexi Christakis
, et al. (395 additional authors not shown)
Abstract:
GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 mil…
▽ More
GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.
△ Less
Submitted 25 October, 2024;
originally announced October 2024.
-
GPT-4 Technical Report
Authors:
OpenAI,
Josh Achiam,
Steven Adler,
Sandhini Agarwal,
Lama Ahmad,
Ilge Akkaya,
Florencia Leoni Aleman,
Diogo Almeida,
Janko Altenschmidt,
Sam Altman,
Shyamal Anadkat,
Red Avila,
Igor Babuschkin,
Suchir Balaji,
Valerie Balcom,
Paul Baltescu,
Haiming Bao,
Mohammad Bavarian,
Jeff Belgum,
Irwan Bello,
Jake Berdine,
Gabriel Bernadett-Shapiro,
Christopher Berner,
Lenny Bogdonoff,
Oleg Boiko
, et al. (256 additional authors not shown)
Abstract:
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based mo…
▽ More
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.
△ Less
Submitted 4 March, 2024; v1 submitted 15 March, 2023;
originally announced March 2023.
-
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
Authors:
Greg Yang,
Edward J. Hu,
Igor Babuschkin,
Szymon Sidor,
Xiaodong Liu,
David Farhi,
Nick Ryder,
Jakub Pachocki,
Weizhu Chen,
Jianfeng Gao
Abstract:
Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters. We show that, in the recently discovered Maximal Update Parametrization (muP), many optimal HPs remain stable even as model size changes. This leads to a new HP tuning paradigm we call muTransfer: parametrize the target model in muP, tune the HP indirectly on…
▽ More
Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters. We show that, in the recently discovered Maximal Update Parametrization (muP), many optimal HPs remain stable even as model size changes. This leads to a new HP tuning paradigm we call muTransfer: parametrize the target model in muP, tune the HP indirectly on a smaller model, and zero-shot transfer them to the full-sized model, i.e., without directly tuning the latter at all. We verify muTransfer on Transformer and ResNet. For example, 1) by transferring pretraining HPs from a model of 13M parameters, we outperform published numbers of BERT-large (350M parameters), with a total tuning cost equivalent to pretraining BERT-large once; 2) by transferring from 40M parameters, we outperform published numbers of the 6.7B GPT-3 model, with tuning cost only 7% of total pretraining cost. A Pytorch implementation of our technique can be found at github.com/microsoft/mup and installable via `pip install mup`.
△ Less
Submitted 28 March, 2022; v1 submitted 7 March, 2022;
originally announced March 2022.
-
Multi-task curriculum learning in a complex, visual, hard-exploration domain: Minecraft
Authors:
Ingmar Kanitscheider,
Joost Huizinga,
David Farhi,
William Hebgen Guss,
Brandon Houghton,
Raul Sampedro,
Peter Zhokhov,
Bowen Baker,
Adrien Ecoffet,
Jie Tang,
Oleg Klimov,
Jeff Clune
Abstract:
An important challenge in reinforcement learning is training agents that can solve a wide variety of tasks. If tasks depend on each other (e.g. needing to learn to walk before learning to run), curriculum learning can speed up learning by focusing on the next best task to learn. We explore curriculum learning in a complex, visual domain with many hard exploration challenges: Minecraft. We find tha…
▽ More
An important challenge in reinforcement learning is training agents that can solve a wide variety of tasks. If tasks depend on each other (e.g. needing to learn to walk before learning to run), curriculum learning can speed up learning by focusing on the next best task to learn. We explore curriculum learning in a complex, visual domain with many hard exploration challenges: Minecraft. We find that learning progress (defined as a change in success probability of a task) is a reliable measure of learnability for automatically constructing an effective curriculum. We introduce a learning-progress based curriculum and test it on a complex reinforcement learning problem (called "Simon Says") where an agent is instructed to obtain a desired goal item. Many of the required skills depend on each other. Experiments demonstrate that: (1) a within-episode exploration bonus for obtaining new items improves performance, (2) dynamically adjusting this bonus across training such that it only applies to items the agent cannot reliably obtain yet further increases performance, (3) the learning-progress based curriculum elegantly follows the learning curve of the agent, and (4) when the learning-progress based curriculum is combined with the dynamic exploration bonus it learns much more efficiently and obtains far higher performance than uniform baselines. These results suggest that combining intra-episode and across-training exploration bonuses with learning progress creates a promising method for automated curriculum generation, which may substantially increase our ability to train more capable, generally intelligent agents.
△ Less
Submitted 28 June, 2021;
originally announced June 2021.
-
Dota 2 with Large Scale Deep Reinforcement Learning
Authors:
OpenAI,
:,
Christopher Berner,
Greg Brockman,
Brooke Chan,
Vicki Cheung,
Przemysław Dębiak,
Christy Dennison,
David Farhi,
Quirin Fischer,
Shariq Hashme,
Chris Hesse,
Rafal Józefowicz,
Scott Gray,
Catherine Olsson,
Jakub Pachocki,
Michael Petrov,
Henrique P. d. O. Pinto,
Jonathan Raiman,
Tim Salimans,
Jeremy Schlatter,
Jonas Schneider,
Szymon Sidor,
Ilya Sutskever,
Jie Tang
, et al. (2 additional authors not shown)
Abstract:
On April 13th, 2019, OpenAI Five became the first AI system to defeat the world champions at an esports game. The game of Dota 2 presents novel challenges for AI systems such as long time horizons, imperfect information, and complex, continuous state-action spaces, all challenges which will become increasingly central to more capable AI systems. OpenAI Five leveraged existing reinforcement learnin…
▽ More
On April 13th, 2019, OpenAI Five became the first AI system to defeat the world champions at an esports game. The game of Dota 2 presents novel challenges for AI systems such as long time horizons, imperfect information, and complex, continuous state-action spaces, all challenges which will become increasingly central to more capable AI systems. OpenAI Five leveraged existing reinforcement learning techniques, scaled to learn from batches of approximately 2 million frames every 2 seconds. We developed a distributed training system and tools for continual training which allowed us to train OpenAI Five for 10 months. By defeating the Dota 2 world champion (Team OG), OpenAI Five demonstrates that self-play reinforcement learning can achieve superhuman performance on a difficult task.
△ Less
Submitted 13 December, 2019;
originally announced December 2019.
-
Precision decay rate calculations in quantum field theory
Authors:
Anders Andreassen,
David Farhi,
William Frost,
Matthew D. Schwartz
Abstract:
Tunneling in quantum field theory is worth understanding properly, not least because it controls the long term fate of our universe. There are however, a number of features of tunneling rate calculations which lack a desirable transparency, such as the necessity of analytic continuation, the appropriateness of using an effective instead of classical potential, and the sensitivity to short-distance…
▽ More
Tunneling in quantum field theory is worth understanding properly, not least because it controls the long term fate of our universe. There are however, a number of features of tunneling rate calculations which lack a desirable transparency, such as the necessity of analytic continuation, the appropriateness of using an effective instead of classical potential, and the sensitivity to short-distance physics. This paper attempts to review in pedagogical detail the physical origin of tunneling and its connection to the path integral. Both the traditional potential-deformation method and a recent more direct propagator-based method are discussed. Some new insights from using approximate semi-classical solutions are presented. In addition, we explore the sensitivity of the lifetime of our universe to short distance physics, such as quantum gravity, emphasizing a number of important subtleties.
△ Less
Submitted 31 August, 2017; v1 submitted 20 April, 2016;
originally announced April 2016.
-
A direct approach to quantum tunneling
Authors:
Anders Andreassen,
David Farhi,
William Frost,
Matthew D. Schwartz
Abstract:
The decay rates of quasistable states in quantum field theories are usually calculated using instanton methods. Standard derivations of these methods rely in a crucial way upon deformations and analytic continuations of the physical potential, and on the saddle point approximation. While the resulting procedure can be checked against other semi-classical approaches in some one-dimensional cases, i…
▽ More
The decay rates of quasistable states in quantum field theories are usually calculated using instanton methods. Standard derivations of these methods rely in a crucial way upon deformations and analytic continuations of the physical potential, and on the saddle point approximation. While the resulting procedure can be checked against other semi-classical approaches in some one-dimensional cases, it is challenging to trace the role of the relevant physical scales, and any intuitive handle on the precision of the approximations involved are at best obscure. In this paper, we use a physical definition of the tunneling probability to derive a formula for the decay rate in both quantum mechanics and quantum field theory directly from the Minkowski path integral, without reference to unphysical deformations of the potential. There are numerous benefits to this approach, from non-perturbative applications to precision calculations and aesthetic simplicity.
△ Less
Submitted 2 February, 2016;
originally announced February 2016.
-
Streamlining resummed QCD calculations using Monte Carlo integration
Authors:
David Farhi,
Ilya Feige,
Marat Freytsis,
Matthew D. Schwartz
Abstract:
Some of the most arduous and error-prone aspects of precision resummed calculations are related to the partonic hard process, having nothing to do with the resummation. In particular, interfacing to parton-distribution functions, combining various channels, and performing the phase space integration can be limiting factors in completing calculations. Conveniently, however, most of these tasks are…
▽ More
Some of the most arduous and error-prone aspects of precision resummed calculations are related to the partonic hard process, having nothing to do with the resummation. In particular, interfacing to parton-distribution functions, combining various channels, and performing the phase space integration can be limiting factors in completing calculations. Conveniently, however, most of these tasks are already automated in many Monte Carlo programs, such as MadGraph, Alpgen or Sherpa. In this paper, we show how such programs can be used to produce distributions of partonic kinematics with associated color structures representing the hard factor in a resummed distribution. These distributions can then be used to weight convolutions of jet, soft and beam functions producing a complete resummed calculation. In fact, only around 1000 unweighted events are necessary to produce precise distributions. A number of examples and checks are provided, including $e^+e^-$ two- and four-jet event shapes, $n$-jettiness and jet-mass related observables at hadron colliders. Attached code can be used to modify MadGraph to export the relevant leading-order hard functions and color structures for arbitrary processes.
△ Less
Submitted 22 July, 2015;
originally announced July 2015.
-
Quantifying the power of multiple event interpretations
Authors:
Yang-Ting Chien,
David Farhi,
David Krohn,
Andrew Marantan,
David Lopez Mateos,
Matthew Schwartz
Abstract:
A number of methods have been proposed recently which exploit multiple highly-correlated interpretations of events, or of jets within an event. For example, Qjets reclusters a jet multiple times and telescoping jets uses multiple cone sizes. Previous work has employed these methods in pseudo-experimental analyses and found that, with a simplified statistical treatment, they give sizable improvemen…
▽ More
A number of methods have been proposed recently which exploit multiple highly-correlated interpretations of events, or of jets within an event. For example, Qjets reclusters a jet multiple times and telescoping jets uses multiple cone sizes. Previous work has employed these methods in pseudo-experimental analyses and found that, with a simplified statistical treatment, they give sizable improvements over traditional methods. In this paper, the improvement gain from multiple event interpretations is explored with methods much closer to those used in real experiments. To this end, we derive a generalized extended maximum likelihood procedure. We study the significance improvement in Higgs to bb with both this method and the simplified method from previous analysis. With either method, we find that using multiple jet radii can provide substantial benefit over a single radius. Another concern we address is that multiple event interpretations might be exploiting similar information to that already present in the standard kinematic variables. By examining correlations between kinematic variables commonly used in LHC analyses and invariant masses obtained with multiple jet reconstructions, we find that using multiple radii is still helpful even on top of standard kinematic variables when combined with boosted decision trees. These results suggest that including multiple event interpretations in a realistic search for Higgs to bb would give additional sensitivity over traditional approaches.
△ Less
Submitted 10 July, 2014;
originally announced July 2014.