Skip to main content

Showing 1–5 of 5 results for author: Soklaski, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.10162  [pdf, other

    cs.AI cs.CL

    Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

    Authors: Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, Evan Hubinger

    Abstract: In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be to… ▽ More

    Submitted 28 June, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

    Comments: Make it easier to find samples from the model, and highlight that our operational definition of reward tampering has false positives where the model attempts to complete the task honestly but edits the reward. Add paragraph to conclusion to this effect, and add sentence to figure 1 to this effect

  2. arXiv:2202.12412  [pdf, other

    cs.CV cs.LG

    Fourier-Based Augmentations for Improved Robustness and Uncertainty Calibration

    Authors: Ryan Soklaski, Michael Yee, Theodoros Tsiligkaridis

    Abstract: Diverse data augmentation strategies are a natural approach to improving robustness in computer vision models against unforeseen shifts in data distribution. However, the ability to tailor such strategies to inoculate a model against specific classes of corruptions or attacks -- without incurring substantial losses in robustness against other classes of corruptions -- remains elusive. In this work… ▽ More

    Submitted 24 February, 2022; originally announced February 2022.

    Comments: 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Sydney, Australia

  3. arXiv:2202.03188  [pdf

    cs.AI

    Knowledge-Integrated Informed AI for National Security

    Authors: Anu K. Myne, Kevin J. Leahy, Ryan J. Soklaski

    Abstract: The state of artificial intelligence technology has a rich history that dates back decades and includes two fall-outs before the explosive resurgence of today, which is credited largely to data-driven techniques. While AI technology has and continues to become increasingly mainstream with impact across domains and industries, it's not without several drawbacks, weaknesses, and potential to cause u… ▽ More

    Submitted 4 February, 2022; originally announced February 2022.

    Report number: Technical Report TR-1272

  4. arXiv:2201.05647  [pdf, other

    cs.LG cs.AI cs.SE

    Tools and Practices for Responsible AI Engineering

    Authors: Ryan Soklaski, Justin Goodwin, Olivia Brown, Michael Yee, Jason Matterer

    Abstract: Responsible Artificial Intelligence (AI) - the practice of developing, evaluating, and maintaining accurate AI systems that also exhibit essential properties such as robustness and explainability - represents a multifaceted challenge that often stretches standard machine learning tooling, frameworks, and testing methods beyond their limits. In this paper, we present two new software libraries - hy… ▽ More

    Submitted 14 January, 2022; originally announced January 2022.

  5. Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning

    Authors: David Mascharka, Philip Tran, Ryan Soklaski, Arjun Majumdar

    Abstract: Visual question answering requires high-order reasoning about an image, which is a fundamental capability needed by machine systems to follow complex directives. Recently, modular networks have been shown to be an effective framework for performing visual reasoning tasks. While modular networks were initially designed with a degree of model transparency, their performance on complex visual reasoni… ▽ More

    Submitted 2 July, 2018; v1 submitted 14 March, 2018; originally announced March 2018.

    Comments: CVPR 2018 pre-print