-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Authors:
Carson Denison,
Monte MacDiarmid,
Fazl Barez,
David Duvenaud,
Shauna Kravec,
Samuel Marks,
Nicholas Schiefer,
Ryan Soklaski,
Alex Tamkin,
Jared Kaplan,
Buck Shlegeris,
Samuel R. Bowman,
Ethan Perez,
Evan Hubinger
Abstract:
In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be to…
▽ More
In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too complex to be discovered via exploration. In this paper, we study whether Large Language Model (LLM) assistants which find easily discovered forms of specification gaming will generalize to perform rarer and more blatant forms, up to and including reward-tampering. We construct a curriculum of increasingly sophisticated gameable environments and find that training on early-curriculum environments leads to more specification gaming on remaining environments. Strikingly, a small but non-negligible proportion of the time, LLM assistants trained on the full curriculum generalize zero-shot to directly rewriting their own reward function. Retraining an LLM not to game early-curriculum environments mitigates, but does not eliminate, reward-tampering in later environments. Moreover, adding harmlessness training to our gameable environments does not prevent reward-tampering. These results demonstrate that LLMs can generalize from common forms of specification gaming to more pernicious reward tampering and that such behavior may be nontrivial to remove.
△ Less
Submitted 28 June, 2024; v1 submitted 14 June, 2024;
originally announced June 2024.
-
Fourier-Based Augmentations for Improved Robustness and Uncertainty Calibration
Authors:
Ryan Soklaski,
Michael Yee,
Theodoros Tsiligkaridis
Abstract:
Diverse data augmentation strategies are a natural approach to improving robustness in computer vision models against unforeseen shifts in data distribution. However, the ability to tailor such strategies to inoculate a model against specific classes of corruptions or attacks -- without incurring substantial losses in robustness against other classes of corruptions -- remains elusive. In this work…
▽ More
Diverse data augmentation strategies are a natural approach to improving robustness in computer vision models against unforeseen shifts in data distribution. However, the ability to tailor such strategies to inoculate a model against specific classes of corruptions or attacks -- without incurring substantial losses in robustness against other classes of corruptions -- remains elusive. In this work, we successfully harden a model against Fourier-based attacks, while producing superior-to-AugMix accuracy and calibration results on both the CIFAR-10-C and CIFAR-100-C datasets; classification error is reduced by over ten percentage points for some high-severity noise and digital-type corruptions. We achieve this by incorporating Fourier-basis perturbations in the AugMix image-augmentation framework. Thus we demonstrate that the AugMix framework can be tailored to effectively target particular distribution shifts, while boosting overall model robustness.
△ Less
Submitted 24 February, 2022;
originally announced February 2022.
-
Knowledge-Integrated Informed AI for National Security
Authors:
Anu K. Myne,
Kevin J. Leahy,
Ryan J. Soklaski
Abstract:
The state of artificial intelligence technology has a rich history that dates back decades and includes two fall-outs before the explosive resurgence of today, which is credited largely to data-driven techniques. While AI technology has and continues to become increasingly mainstream with impact across domains and industries, it's not without several drawbacks, weaknesses, and potential to cause u…
▽ More
The state of artificial intelligence technology has a rich history that dates back decades and includes two fall-outs before the explosive resurgence of today, which is credited largely to data-driven techniques. While AI technology has and continues to become increasingly mainstream with impact across domains and industries, it's not without several drawbacks, weaknesses, and potential to cause undesired effects. AI techniques are numerous with many approaches and variants, but they can be classified simply based on the degree of knowledge they capture and how much data they require; two broad categories emerge as prominent across AI to date: (1) techniques that are primarily, and often solely, data-driven while leveraging little to no knowledge and (2) techniques that primarily leverage knowledge and depend less on data. Now, a third category is starting to emerge that leverages both data and knowledge, that some refer to as "informed AI." This third category can be a game changer within the national security domain where there is ample scientific and domain-specific knowledge that stands ready to be leveraged, and where purely data-driven AI can lead to serious unwanted consequences.
This report shares findings from a thorough exploration of AI approaches that exploit data as well as principled and/or practical knowledge, which we refer to as "knowledge-integrated informed AI." Specifically, we review illuminating examples of knowledge integrated in deep learning and reinforcement learning pipelines, taking note of the performance gains they provide. We also discuss an apparent trade space across variants of knowledge-integrated informed AI, along with observed and prominent issues that suggest worthwhile future research directions. Most importantly, this report suggests how the advantages of knowledge-integrated informed AI stand to benefit the national security domain.
△ Less
Submitted 4 February, 2022;
originally announced February 2022.
-
Tools and Practices for Responsible AI Engineering
Authors:
Ryan Soklaski,
Justin Goodwin,
Olivia Brown,
Michael Yee,
Jason Matterer
Abstract:
Responsible Artificial Intelligence (AI) - the practice of developing, evaluating, and maintaining accurate AI systems that also exhibit essential properties such as robustness and explainability - represents a multifaceted challenge that often stretches standard machine learning tooling, frameworks, and testing methods beyond their limits. In this paper, we present two new software libraries - hy…
▽ More
Responsible Artificial Intelligence (AI) - the practice of developing, evaluating, and maintaining accurate AI systems that also exhibit essential properties such as robustness and explainability - represents a multifaceted challenge that often stretches standard machine learning tooling, frameworks, and testing methods beyond their limits. In this paper, we present two new software libraries - hydra-zen and the rAI-toolbox - that address critical needs for responsible AI engineering. hydra-zen dramatically simplifies the process of making complex AI applications configurable, and their behaviors reproducible. The rAI-toolbox is designed to enable methods for evaluating and enhancing the robustness of AI-models in a way that is scalable and that composes naturally with other popular ML frameworks. We describe the design principles and methodologies that make these tools effective, including the use of property-based testing to bolster the reliability of the tools themselves. Finally, we demonstrate the composability and flexibility of the tools by showing how various use cases from adversarial robustness and explainable AI can be concisely implemented with familiar APIs.
△ Less
Submitted 14 January, 2022;
originally announced January 2022.
-
Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning
Authors:
David Mascharka,
Philip Tran,
Ryan Soklaski,
Arjun Majumdar
Abstract:
Visual question answering requires high-order reasoning about an image, which is a fundamental capability needed by machine systems to follow complex directives. Recently, modular networks have been shown to be an effective framework for performing visual reasoning tasks. While modular networks were initially designed with a degree of model transparency, their performance on complex visual reasoni…
▽ More
Visual question answering requires high-order reasoning about an image, which is a fundamental capability needed by machine systems to follow complex directives. Recently, modular networks have been shown to be an effective framework for performing visual reasoning tasks. While modular networks were initially designed with a degree of model transparency, their performance on complex visual reasoning benchmarks was lacking. Current state-of-the-art approaches do not provide an effective mechanism for understanding the reasoning process. In this paper, we close the performance gap between interpretable models and state-of-the-art visual reasoning methods. We propose a set of visual-reasoning primitives which, when composed, manifest as a model capable of performing complex reasoning tasks in an explicitly-interpretable manner. The fidelity and interpretability of the primitives' outputs enable an unparalleled ability to diagnose the strengths and weaknesses of the resulting model. Critically, we show that these primitives are highly performant, achieving state-of-the-art accuracy of 99.1% on the CLEVR dataset. We also show that our model is able to effectively learn generalized representations when provided a small amount of data containing novel object attributes. Using the CoGenT generalization task, we show more than a 20 percentage point improvement over the current state of the art.
△ Less
Submitted 2 July, 2018; v1 submitted 14 March, 2018;
originally announced March 2018.