Skip to main content

Showing 1–19 of 19 results for author: Mindermann, S

.
  1. arXiv:2504.15416  [pdf, other

    cs.CY

    Bare Minimum Mitigations for Autonomous AI Development

    Authors: Joshua Clymer, Isabella Duan, Chris Cundy, Yawen Duan, Fynn Heide, Chaochao Lu, Sören Mindermann, Conor McGurk, Xudong Pan, Saad Siddiqui, Jingren Wang, Min Yang, Xianyuan Zhan

    Abstract: Artificial intelligence (AI) is advancing rapidly, with the potential for significantly automating AI research and development itself in the near future. In 2024, international scientists, including Turing Award recipients, warned of risks from autonomous AI research and development (R&D), suggesting a red line such that no AI system should be able to improve itself or other AI systems without exp… ▽ More

    Submitted 23 April, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

    Comments: 12 pages, 2 figures

  2. arXiv:2504.12914  [pdf, other

    cs.CY

    In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate?

    Authors: Ben Bucknall, Saad Siddiqui, Lara Thurnherr, Conor McGurk, Ben Harack, Anka Reuel, Patricia Paskov, Casey Mahoney, Sören Mindermann, Scott Singer, Vinay Hiremath, Charbel-Raphaël Segerie, Oscar Delaney, Alessandro Abate, Fazl Barez, Michael K. Cohen, Philip Torr, Ferenc Huszár, Anisoara Calinescu, Gabriel Davis Jones, Yoshua Bengio, Robert Trager

    Abstract: International cooperation is common in AI research, including between geopolitical rivals. While many experts advocate for greater international cooperation on AI safety to address shared global risks, some view cooperation on AI with suspicion, arguing that it can pose unacceptable risks to national security. However, the extent to which cooperation on AI safety poses such risks, as well as provi… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    Comments: Accepted to ACM Conference on Fairness, Accountability, and Transparency (FAccT 2025)

  3. arXiv:2502.15657  [pdf, other

    cs.AI cs.LG

    Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

    Authors: Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Sören Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, Marc-Antoine Rondeau, Pierre-Luc St-Charles, David Williams-King

    Abstract: The leading AI companies are increasingly focused on building generalist AI agents -- systems that can autonomously plan, act, and pursue goals across almost all tasks that humans can perform. Despite how useful these systems might be, unchecked AI agency poses significant risks to public safety and security, ranging from misuse by malicious actors to a potentially irreversible loss of human contr… ▽ More

    Submitted 24 February, 2025; v1 submitted 21 February, 2025; originally announced February 2025.

    Comments: v2 with fixed formatting for URLs and hyperlinks

  4. arXiv:2501.17805  [pdf

    cs.CY cs.AI cs.LG

    International AI Safety Report

    Authors: Yoshua Bengio, Sören Mindermann, Daniel Privitera, Tamay Besiroglu, Rishi Bommasani, Stephen Casper, Yejin Choi, Philip Fox, Ben Garfinkel, Danielle Goldfarb, Hoda Heidari, Anson Ho, Sayash Kapoor, Leila Khalatbari, Shayne Longpre, Sam Manning, Vasilios Mavroudis, Mantas Mazeika, Julian Michael, Jessica Newman, Kwan Yee Ng, Chinasa T. Okolo, Deborah Raji, Girish Sastry, Elizabeth Seger , et al. (71 additional authors not shown)

    Abstract: The first International AI Safety Report comprehensively synthesizes the current evidence on the capabilities, risks, and safety of advanced AI systems. The report was mandated by the nations attending the AI Safety Summit in Bletchley, UK. Thirty nations, the UN, the OECD, and the EU each nominated a representative to the report's Expert Advisory Panel. A total of 100 AI experts contributed, repr… ▽ More

    Submitted 29 January, 2025; originally announced January 2025.

  5. arXiv:2501.04952  [pdf, other

    cs.LG cs.AI cs.CY

    Open Problems in Machine Unlearning for AI Safety

    Authors: Fazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal, Adel Bibi, Aidan O'Gara, Robert Kirk, Ben Bucknall, Tim Fist, Luke Ong, Philip Torr, Kwok-Yan Lam, Robert Trager, David Krueger, Sören Mindermann, José Hernandez-Orallo, Mor Geva, Yarin Gal

    Abstract: As AI systems become more capable, widely deployed, and increasingly autonomous in critical areas such as cybersecurity, biological research, and healthcare, ensuring their safety and alignment with human values is paramount. Machine unlearning -- the ability to selectively forget or suppress specific types of knowledge -- has shown promise for privacy and data removal tasks, which has been the pr… ▽ More

    Submitted 8 January, 2025; originally announced January 2025.

  6. arXiv:2412.14093  [pdf, other

    cs.AI cs.CL cs.LG

    Alignment faking in large language models

    Authors: Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Evan Hubinger

    Abstract: We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model… ▽ More

    Submitted 19 December, 2024; v1 submitted 18 December, 2024; originally announced December 2024.

  7. arXiv:2412.05282  [pdf

    cs.CY cs.AI

    International Scientific Report on the Safety of Advanced AI (Interim Report)

    Authors: Yoshua Bengio, Sören Mindermann, Daniel Privitera, Tamay Besiroglu, Rishi Bommasani, Stephen Casper, Yejin Choi, Danielle Goldfarb, Hoda Heidari, Leila Khalatbari, Shayne Longpre, Vasilios Mavroudis, Mantas Mazeika, Kwan Yee Ng, Chinasa T. Okolo, Deborah Raji, Theodora Skeadas, Florian Tramèr, Bayo Adekanmbi, Paul Christiano, David Dalrymple, Thomas G. Dietterich, Edward Felten, Pascale Fung, Pierre-Olivier Gourinchas , et al. (19 additional authors not shown)

    Abstract: This is the interim publication of the first International Scientific Report on the Safety of Advanced AI. The report synthesises the scientific understanding of general-purpose AI -- AI that can perform a wide variety of tasks -- with a focus on understanding and managing its risks. A diverse group of 75 AI experts contributed to this report, including an international Expert Advisory Panel nomin… ▽ More

    Submitted 9 April, 2025; v1 submitted 5 November, 2024; originally announced December 2024.

    Comments: Available under the open government license at https://www.gov.uk/government/publications/international-scientific-report-on-the-safety-of-advanced-ai

  8. arXiv:2401.05566  [pdf, other

    cs.CR cs.AI cs.CL cs.LG cs.SE

    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

    Authors: Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec , et al. (14 additional authors not shown)

    Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept exa… ▽ More

    Submitted 17 January, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

    Comments: updated to add missing acknowledgements

  9. arXiv:2310.17688  [pdf, other

    cs.CY cs.AI cs.CL cs.LG

    Managing extreme AI risks amid rapid progress

    Authors: Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Trevor Darrell, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, Jeff Clune, Tegan Maharaj, Frank Hutter, Atılım Güneş Baydin, Sheila McIlraith, Qiqi Gao, Ashwin Acharya, David Krueger, Anca Dragan, Philip Torr, Stuart Russell, Daniel Kahneman, Jan Brauner, Sören Mindermann

    Abstract: Artificial Intelligence (AI) is progressing rapidly, and companies are shifting their focus to developing generalist AI systems that can autonomously act and pursue goals. Increases in capabilities and autonomy may soon massively amplify AI's impact, with risks that include large-scale social harms, malicious uses, and an irreversible loss of human control over autonomous AI systems. Although rese… ▽ More

    Submitted 22 May, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

    Comments: Published in Science: https://www.science.org/doi/10.1126/science.adn0117

  10. arXiv:2310.13798  [pdf, other

    cs.CL cs.AI

    Specific versus General Principles for Constitutional AI

    Authors: Sandipan Kundu, Yuntao Bai, Saurav Kadavath, Amanda Askell, Andrew Callahan, Anna Chen, Anna Goldie, Avital Balwit, Azalia Mirhoseini, Brayden McLean, Catherine Olsson, Cassie Evraets, Eli Tran-Johnson, Esin Durmus, Ethan Perez, Jackson Kernion, Jamie Kerr, Kamal Ndousse, Karina Nguyen, Nelson Elhage, Newton Cheng, Nicholas Schiefer, Nova DasSarma, Oliver Rausch, Robin Larson , et al. (11 additional authors not shown)

    Abstract: Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human feedback with feedback from AI models conditioned only on a list of written principles. We find this approach effectively prevents the expressi… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

  11. arXiv:2309.15840  [pdf, other

    cs.CL cs.AI cs.LG

    How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

    Authors: Lorenzo Pacchiardi, Alex J. Chan, Sören Mindermann, Ilan Moscovitz, Alexa Y. Pan, Yarin Gal, Owain Evans, Jan Brauner

    Abstract: Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a… ▽ More

    Submitted 26 September, 2023; originally announced September 2023.

  12. arXiv:2209.00626  [pdf, ps, other

    cs.AI cs.LG

    The Alignment Problem from a Deep Learning Perspective

    Authors: Richard Ngo, Lawrence Chan, Sören Mindermann

    Abstract: In coming years or decades, artificial general intelligence (AGI) may surpass human capabilities across many critical domains. We argue that, without substantial effort to prevent it, AGIs could learn to pursue goals that are in conflict (i.e. misaligned) with human interests. If trained like today's most capable models, AGIs could learn to act deceptively to receive higher reward, learn misaligne… ▽ More

    Submitted 4 May, 2025; v1 submitted 29 August, 2022; originally announced September 2022.

    Comments: Published in ICLR 2024

  13. arXiv:2206.07137  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt

    Authors: Sören Mindermann, Jan Brauner, Muhammed Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N. Gomez, Adrien Morisot, Sebastian Farquhar, Yarin Gal

    Abstract: Training on web-scale data can take months. But most computation and time is wasted on redundant and noisy points that are already learnt or not learnable. To accelerate training, we introduce Reducible Holdout Loss Selection (RHO-LOSS), a simple but principled technique which selects approximately those points for training that most reduce the model's generalization loss. As a result, RHO-LOSS mi… ▽ More

    Submitted 26 September, 2022; v1 submitted 14 June, 2022; originally announced June 2022.

    Comments: ICML 2022

  14. arXiv:2107.02565  [pdf, other

    cs.LG cs.IT

    Prioritized training on points that are learnable, worth learning, and not yet learned (workshop version)

    Authors: Sören Mindermann, Muhammed Razzak, Winnie Xu, Andreas Kirsch, Mrinank Sharma, Adrien Morisot, Aidan N. Gomez, Sebastian Farquhar, Jan Brauner, Yarin Gal

    Abstract: We introduce Goldilocks Selection, a technique for faster model training which selects a sequence of training points that are "just right". We propose an information-theoretic acquisition function -- the reducible validation loss -- and compute it with a small proxy model -- GoldiProx -- to efficiently choose training points that maximize information about a validation set. We show that the "hard"… ▽ More

    Submitted 17 October, 2023; v1 submitted 6 July, 2021; originally announced July 2021.

    Journal ref: ICML 2021 Workshop on Subset Selection in Machine Learning

  15. arXiv:2103.04850  [pdf, other

    cs.LG stat.ML

    Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding

    Authors: Andrew Jesson, Sören Mindermann, Yarin Gal, Uri Shalit

    Abstract: We study the problem of learning conditional average treatment effects (CATE) from high-dimensional, observational data with unobserved confounders. Unobserved confounders introduce ignorance -- a level of unidentifiability -- about an individual's response to treatment by inducing bias in CATE estimates. We present a new parametric interval estimator suited for high-dimensional data, that estimat… ▽ More

    Submitted 1 February, 2022; v1 submitted 8 March, 2021; originally announced March 2021.

    Comments: 19 pages, 5 figures, ICML 2021

    Journal ref: PMLR 139 (2021) 4829-4838

  16. arXiv:2007.13454  [pdf, other

    stat.AP cs.LG q-bio.PE q-bio.QM stat.ML

    How Robust are the Estimated Effects of Nonpharmaceutical Interventions against COVID-19?

    Authors: Mrinank Sharma, Sören Mindermann, Jan Markus Brauner, Gavin Leech, Anna B. Stephenson, Tomáš Gavenčiak, Jan Kulveit, Yee Whye Teh, Leonid Chindelevitch, Yarin Gal

    Abstract: To what extent are effectiveness estimates of nonpharmaceutical interventions (NPIs) against COVID-19 influenced by the assumptions our models make? To answer this question, we investigate 2 state-of-the-art NPI effectiveness models and propose 6 variants that make different structural assumptions. In particular, we investigate how well NPI effectiveness estimates generalise to unseen countries, a… ▽ More

    Submitted 20 December, 2020; v1 submitted 27 July, 2020; originally announced July 2020.

    Journal ref: NeurIPS 2020, Advances in Neural Information Processing Systems 33

  17. arXiv:2007.00163  [pdf, other

    cs.LG stat.ML

    Identifying Causal-Effect Inference Failure with Uncertainty-Aware Models

    Authors: Andrew Jesson, Sören Mindermann, Uri Shalit, Yarin Gal

    Abstract: Recommending the best course of action for an individual is a major application of individual-level causal effect estimation. This application is often needed in safety-critical domains such as healthcare, where estimating and communicating uncertainty to decision-makers is crucial. We introduce a practical approach for integrating uncertainty estimation into a class of state-of-the-art neural net… ▽ More

    Submitted 22 October, 2020; v1 submitted 30 June, 2020; originally announced July 2020.

  18. arXiv:1809.03060  [pdf, other

    cs.LG cs.AI stat.ML

    Active Inverse Reward Design

    Authors: Sören Mindermann, Rohin Shah, Adam Gleave, Dylan Hadfield-Menell

    Abstract: Designers of AI agents often iterate on the reward function in a trial-and-error process until they get the desired behavior, but this only guarantees good behavior in the training environment. We propose structuring this process as a series of queries asking the user to compare between different reward functions. Thus we can actively select queries for maximum informativeness about the true rewar… ▽ More

    Submitted 6 November, 2019; v1 submitted 9 September, 2018; originally announced September 2018.

  19. arXiv:1712.05812  [pdf, ps, other

    cs.AI

    Occam's razor is insufficient to infer the preferences of irrational agents

    Authors: Stuart Armstrong, Sören Mindermann

    Abstract: Inverse reinforcement learning (IRL) attempts to infer human rewards or preferences from observed behavior. Since human planning systematically deviates from rationality, several approaches have been tried to account for specific human shortcomings. However, the general problem of inferring the reward function of an agent of unknown rationality has received little attention. Unlike the well-known… ▽ More

    Submitted 11 January, 2019; v1 submitted 15 December, 2017; originally announced December 2017.