-
LLM Cyber Evaluations Don't Capture Real-World Risk
Authors:
Kamilė Lukošiūtė,
Adam Swanda
Abstract:
Large language models (LLMs) are demonstrating increasing prowess in cybersecurity applications, creating creating inherent risks alongside their potential for strengthening defenses. In this position paper, we argue that current efforts to evaluate risks posed by these capabilities are misaligned with the goal of understanding real-world impact. Evaluating LLM cybersecurity risk requires more tha…
▽ More
Large language models (LLMs) are demonstrating increasing prowess in cybersecurity applications, creating creating inherent risks alongside their potential for strengthening defenses. In this position paper, we argue that current efforts to evaluate risks posed by these capabilities are misaligned with the goal of understanding real-world impact. Evaluating LLM cybersecurity risk requires more than just measuring model capabilities -- it demands a comprehensive risk assessment that incorporates analysis of threat actor adoption behavior and potential for impact. We propose a risk assessment framework for LLM cyber capabilities and apply it to a case study of language models used as cybersecurity assistants. Our evaluation of frontier models reveals high compliance rates but moderate accuracy on realistic cyber assistance tasks. However, our framework suggests that this particular use case presents only moderate risk due to limited operational advantages and impact potential. Based on these findings, we recommend several improvements to align research priorities with real-world impact assessment, including closer academia-industry collaboration, more realistic modeling of attacker behavior, and inclusion of economic metrics in evaluations. This work represents an important step toward more effective assessment and mitigation of LLM-enabled cybersecurity risks.
△ Less
Submitted 31 January, 2025;
originally announced February 2025.
-
Studying Large Language Model Generalization with Influence Functions
Authors:
Roger Grosse,
Juhan Bae,
Cem Anil,
Nelson Elhage,
Alex Tamkin,
Amirhossein Tajdini,
Benoit Steiner,
Dustin Li,
Esin Durmus,
Ethan Perez,
Evan Hubinger,
Kamilė Lukošiūtė,
Karina Nguyen,
Nicholas Joseph,
Sam McCandlish,
Jared Kaplan,
Samuel R. Bowman
Abstract:
When trying to gain better visibility into a machine learning model in order to understand and mitigate the associated risks, a potentially valuable source of evidence is: which training examples most contribute to a given behavior? Influence functions aim to answer a counterfactual: how would the model's parameters (and hence its outputs) change if a given sequence were added to the training set?…
▽ More
When trying to gain better visibility into a machine learning model in order to understand and mitigate the associated risks, a potentially valuable source of evidence is: which training examples most contribute to a given behavior? Influence functions aim to answer a counterfactual: how would the model's parameters (and hence its outputs) change if a given sequence were added to the training set? While influence functions have produced insights for small models, they are difficult to scale to large language models (LLMs) due to the difficulty of computing an inverse-Hessian-vector product (IHVP). We use the Eigenvalue-corrected Kronecker-Factored Approximate Curvature (EK-FAC) approximation to scale influence functions up to LLMs with up to 52 billion parameters. In our experiments, EK-FAC achieves similar accuracy to traditional influence function estimators despite the IHVP computation being orders of magnitude faster. We investigate two algorithmic techniques to reduce the cost of computing gradients of candidate training sequences: TF-IDF filtering and query batching. We use influence functions to investigate the generalization patterns of LLMs, including the sparsity of the influence patterns, increasing abstraction with scale, math and programming abilities, cross-lingual generalization, and role-playing behavior. Despite many apparently sophisticated forms of generalization, we identify a surprising limitation: influences decay to near-zero when the order of key phrases is flipped. Overall, influence functions give us a powerful new tool for studying the generalization properties of LLMs.
△ Less
Submitted 7 August, 2023;
originally announced August 2023.
-
Measuring Faithfulness in Chain-of-Thought Reasoning
Authors:
Tamera Lanham,
Anna Chen,
Ansh Radhakrishnan,
Benoit Steiner,
Carson Denison,
Danny Hernandez,
Dustin Li,
Esin Durmus,
Evan Hubinger,
Jackson Kernion,
Kamilė Lukošiūtė,
Karina Nguyen,
Newton Cheng,
Nicholas Joseph,
Nicholas Schiefer,
Oliver Rausch,
Robin Larson,
Sam McCandlish,
Sandipan Kundu,
Saurav Kadavath,
Shannon Yang,
Thomas Henighan,
Timothy Maxwell,
Timothy Telleen-Lawton,
Tristan Hume
, et al. (5 additional authors not shown)
Abstract:
Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change…
▽ More
Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT (e.g., by adding mistakes or paraphrasing it). Models show large variation across tasks in how strongly they condition on the CoT when predicting their answer, sometimes relying heavily on the CoT and other times primarily ignoring it. CoT's performance boost does not seem to come from CoT's added test-time compute alone or from information encoded via the particular phrasing of the CoT. As models become larger and more capable, they produce less faithful reasoning on most tasks we study. Overall, our results suggest that CoT can be faithful if the circumstances such as the model size and task are carefully chosen.
△ Less
Submitted 16 July, 2023;
originally announced July 2023.
-
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Authors:
Ansh Radhakrishnan,
Karina Nguyen,
Anna Chen,
Carol Chen,
Carson Denison,
Danny Hernandez,
Esin Durmus,
Evan Hubinger,
Jackson Kernion,
Kamilė Lukošiūtė,
Newton Cheng,
Nicholas Joseph,
Nicholas Schiefer,
Oliver Rausch,
Sam McCandlish,
Sheer El Showk,
Tamera Lanham,
Tim Maxwell,
Venkatesa Chandrasekaran,
Zac Hatfield-Dodds,
Jared Kaplan,
Jan Brauner,
Samuel R. Bowman,
Ethan Perez
Abstract:
As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help with this issue is to prompt LLMs to externalize their reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (Chain-of-Thought; CoT). The reasoning may enable us to check the process that models use to perfo…
▽ More
As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help with this issue is to prompt LLMs to externalize their reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (Chain-of-Thought; CoT). The reasoning may enable us to check the process that models use to perform tasks. However, this approach relies on the stated reasoning faithfully reflecting the model's actual reasoning, which is not always the case. To improve over the faithfulness of CoT reasoning, we have models generate reasoning by decomposing questions into subquestions. Decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of CoT while improving the faithfulness of the model's stated reasoning on several recently-proposed metrics. By forcing the model to answer simpler subquestions in separate contexts, we greatly increase the faithfulness of model-generated reasoning over CoT, while still achieving some of the performance gains of CoT. Our results show it is possible to improve the faithfulness of model-generated reasoning; continued improvements may lead to reasoning that enables us to verify the correctness and safety of LLM behavior.
△ Less
Submitted 25 July, 2023; v1 submitted 16 July, 2023;
originally announced July 2023.
-
The Capacity for Moral Self-Correction in Large Language Models
Authors:
Deep Ganguli,
Amanda Askell,
Nicholas Schiefer,
Thomas I. Liao,
Kamilė Lukošiūtė,
Anna Chen,
Anna Goldie,
Azalia Mirhoseini,
Catherine Olsson,
Danny Hernandez,
Dawn Drain,
Dustin Li,
Eli Tran-Johnson,
Ethan Perez,
Jackson Kernion,
Jamie Kerr,
Jared Mueller,
Joshua Landau,
Kamal Ndousse,
Karina Nguyen,
Liane Lovitt,
Michael Sellitto,
Nelson Elhage,
Noemi Mercado,
Nova DasSarma
, et al. (24 additional authors not shown)
Abstract:
We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability…
▽ More
We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereotyping, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.
△ Less
Submitted 18 February, 2023; v1 submitted 14 February, 2023;
originally announced February 2023.
-
Discovering Language Model Behaviors with Model-Written Evaluations
Authors:
Ethan Perez,
Sam Ringer,
Kamilė Lukošiūtė,
Karina Nguyen,
Edwin Chen,
Scott Heiner,
Craig Pettit,
Catherine Olsson,
Sandipan Kundu,
Saurav Kadavath,
Andy Jones,
Anna Chen,
Ben Mann,
Brian Israel,
Bryan Seethor,
Cameron McKinnon,
Christopher Olah,
Da Yan,
Daniela Amodei,
Dario Amodei,
Dawn Drain,
Dustin Li,
Eli Tran-Johnson,
Guro Khundadze,
Jackson Kernion
, et al. (38 additional authors not shown)
Abstract:
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from inst…
▽ More
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.
△ Less
Submitted 19 December, 2022;
originally announced December 2022.
-
Measuring Progress on Scalable Oversight for Large Language Models
Authors:
Samuel R. Bowman,
Jeeyoon Hyun,
Ethan Perez,
Edwin Chen,
Craig Pettit,
Scott Heiner,
Kamilė Lukošiūtė,
Amanda Askell,
Andy Jones,
Anna Chen,
Anna Goldie,
Azalia Mirhoseini,
Cameron McKinnon,
Christopher Olah,
Daniela Amodei,
Dario Amodei,
Dawn Drain,
Dustin Li,
Eli Tran-Johnson,
Jackson Kernion,
Jamie Kerr,
Jared Mueller,
Jeffrey Ladish,
Joshua Landau,
Kamal Ndousse
, et al. (21 additional authors not shown)
Abstract:
Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think abou…
▽ More
Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on ways it can be studied empirically. We first present an experimental design centered on tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.
△ Less
Submitted 11 November, 2022; v1 submitted 4 November, 2022;
originally announced November 2022.
-
Prospects of Gravitational Wave Follow-up Through a Wide-field Ultra-violet Satellite: a Dorado Case Study
Authors:
Bas Dorsman,
Geert Raaijmakers,
S. Bradley Cenko,
Samaya Nissanke,
Leo P. Singer,
Mansi M. Kasliwal,
Anthony L. Piro,
Eric C. Bellm,
Dieter H. Hartmann,
Kenta Hotokezaka,
Kamilė Lukošiūtė
Abstract:
The detection of gravitational waves from binary neuron star merger GW170817 and electromagnetic counterparts GRB170817 and AT2017gfo kick-started the field of gravitational wave multimessenger astronomy. The optically red to near infra-red emission (`red' component) of AT2017gfo was readily explained as produced by the decay of newly created nuclei produced by rapid neutron capture (a kilonova).…
▽ More
The detection of gravitational waves from binary neuron star merger GW170817 and electromagnetic counterparts GRB170817 and AT2017gfo kick-started the field of gravitational wave multimessenger astronomy. The optically red to near infra-red emission (`red' component) of AT2017gfo was readily explained as produced by the decay of newly created nuclei produced by rapid neutron capture (a kilonova). However, the ultra-violet to optically blue emission (`blue' component) that was dominant at early times (up to 1.5 days) received no consensus regarding its driving physics. Among many explanations, two leading contenders are kilonova radiation from a lanthanide-poor ejecta component or shock interaction (cocoon emission). In this work, we simulate AT2017gfo-like light curves and perform a Bayesian analysis to study whether an ultra-violet satellite capable of rapid gravitational wave follow-up, could distinguish between physical processes driving the early `blue' component. We find that a Dorado-like ultra-violet satellite, with a 50 sq. deg. field of view and a limiting magnitude (AB) of 20.5 for a 10 minute exposure is able to distinguish radiation components up to at least 160 Mpc if data collection starts within 3.2 or 5.2 hours for two possible AT2017gfo-like light curve scenarios. We also study the degree to which parameters can be constrained with the obtained photometry. We find that, while ultra-violet data alone constrains parameters governing the outer ejecta properties, the combination of both ground-based optical and space-based ultra-violet data allows for tight constraints for all but one parameter of the kilonova model up to 160 Mpc. These results imply that an ultra-violet mission like Dorado would provide unique insights into the early evolution of the post-merger system and its driving emission physics.
△ Less
Submitted 20 June, 2022;
originally announced June 2022.
-
KilonovaNet: Surrogate Models of Kilonova Spectra with Conditional Variational Autoencoders
Authors:
Kamilė Lukošiūtė,
Geert Raaijmakers,
Zoheyr Doctor,
Marcelle Soares-Santos,
Brian Nord
Abstract:
Detailed radiative transfer simulations of kilonova spectra play an essential role in multimessenger astrophysics. Using the simulation results in parameter inference studies requires building a surrogate model from the simulation outputs to use in algorithms requiring sampling. In this work, we present KilonovaNet, an implementation of conditional variational autoencoders (cVAEs) for the construc…
▽ More
Detailed radiative transfer simulations of kilonova spectra play an essential role in multimessenger astrophysics. Using the simulation results in parameter inference studies requires building a surrogate model from the simulation outputs to use in algorithms requiring sampling. In this work, we present KilonovaNet, an implementation of conditional variational autoencoders (cVAEs) for the construction of surrogate models of kilonova spectra. This method can be trained on spectra directly, removing overhead time of pre-processing spectra, and greatly speeds up parameter inference time. We build surrogate models of three state-of-the-art kilonova simulation data sets and present in-depth surrogate error evaluation methods, which can in general be applied to any surrogate construction method. By creating synthetic photometric observations from the spectral surrogate, we perform parameter inference for the observed light curve data of GW170817 and compare the results with previous analyses. Given the speed with which KilonovaNet performs during parameter inference, it will serve as a useful tool in future gravitational wave observing runs to quickly analyze potential kilonova candidates
△ Less
Submitted 1 April, 2022;
originally announced April 2022.
-
The Challenges Ahead for Multimessenger Analyses of Gravitational Waves and Kilonova: a Case Study on GW190425
Authors:
Geert Raaijmakers,
Samaya Nissanke,
Francois Foucart,
Mansi M. Kasliwal,
Mattia Bulla,
Rodrigo Fernandez,
Amelia Henkel,
Tanja Hinderer,
Kenta Hotokezaka,
Kamilė Lukošiūtė,
Tejaswi Venumadhav,
Sarah Antier,
Michael W. Coughlin,
Tim Dietrich,
Thomas D. P. Edwards
Abstract:
In recent years, there have been significant advances in multi-messenger astronomy due to the discovery of the first, and so far only confirmed, gravitational wave event with a simultaneous electromagnetic (EM) counterpart, as well as improvements in numerical simulations, gravitational wave (GW) detectors, and transient astronomy. This has led to the exciting possibility of performing joint analy…
▽ More
In recent years, there have been significant advances in multi-messenger astronomy due to the discovery of the first, and so far only confirmed, gravitational wave event with a simultaneous electromagnetic (EM) counterpart, as well as improvements in numerical simulations, gravitational wave (GW) detectors, and transient astronomy. This has led to the exciting possibility of performing joint analyses of the GW and EM data, providing additional constraints on fundamental properties of the binary progenitor and merger remnant. Here, we present a new Bayesian framework that allows inference of these properties, while taking into account the systematic modeling uncertainties that arise when mapping from GW binary progenitor properties to photometric light curves. We extend the relative binning method presented in Zackay et al. (2018) to include extrinsic GW parameters for fast analysis of the GW signal. The focus of our EM framework is on light curves arising from r-process nucleosynthesis in the ejected material during and after merger, the so called kilonova, and particularly on black hole - neutron star systems. As a case study, we examine the recent detection of GW190425, where the primary object is consistent with being either a black hole (BH) or a neutron star (NS). We show quantitatively how improved mapping between binary progenitor and outflow properties, and/or an increase in EM data quantity and quality are required in order to break degeneracies in the fundamental source parameters.
△ Less
Submitted 23 February, 2021;
originally announced February 2021.