Skip to main content

Showing 1–10 of 10 results for author: Lukošiūtė, K

.
  1. arXiv:2502.00072  [pdf, ps, other

    cs.CR cs.AI cs.CL cs.LG

    LLM Cyber Evaluations Don't Capture Real-World Risk

    Authors: Kamilė Lukošiūtė, Adam Swanda

    Abstract: Large language models (LLMs) are demonstrating increasing prowess in cybersecurity applications, creating creating inherent risks alongside their potential for strengthening defenses. In this position paper, we argue that current efforts to evaluate risks posed by these capabilities are misaligned with the goal of understanding real-world impact. Evaluating LLM cybersecurity risk requires more tha… ▽ More

    Submitted 31 January, 2025; originally announced February 2025.

    Comments: 11 pages

  2. arXiv:2308.03296  [pdf, other

    cs.LG cs.CL stat.ML

    Studying Large Language Model Generalization with Influence Functions

    Authors: Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, Evan Hubinger, Kamilė Lukošiūtė, Karina Nguyen, Nicholas Joseph, Sam McCandlish, Jared Kaplan, Samuel R. Bowman

    Abstract: When trying to gain better visibility into a machine learning model in order to understand and mitigate the associated risks, a potentially valuable source of evidence is: which training examples most contribute to a given behavior? Influence functions aim to answer a counterfactual: how would the model's parameters (and hence its outputs) change if a given sequence were added to the training set?… ▽ More

    Submitted 7 August, 2023; originally announced August 2023.

    Comments: 119 pages, 47 figures, 22 tables

  3. arXiv:2307.13702  [pdf, other

    cs.AI cs.CL cs.LG

    Measuring Faithfulness in Chain-of-Thought Reasoning

    Authors: Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy Maxwell, Timothy Telleen-Lawton, Tristan Hume , et al. (5 additional authors not shown)

    Abstract: Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change… ▽ More

    Submitted 16 July, 2023; originally announced July 2023.

  4. arXiv:2307.11768  [pdf, other

    cs.CL cs.AI cs.LG

    Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

    Authors: Ansh Radhakrishnan, Karina Nguyen, Anna Chen, Carol Chen, Carson Denison, Danny Hernandez, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Sam McCandlish, Sheer El Showk, Tamera Lanham, Tim Maxwell, Venkatesa Chandrasekaran, Zac Hatfield-Dodds, Jared Kaplan, Jan Brauner, Samuel R. Bowman, Ethan Perez

    Abstract: As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help with this issue is to prompt LLMs to externalize their reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (Chain-of-Thought; CoT). The reasoning may enable us to check the process that models use to perfo… ▽ More

    Submitted 25 July, 2023; v1 submitted 16 July, 2023; originally announced July 2023.

    Comments: For few-shot examples and prompts, see https://github.com/anthropics/DecompositionFaithfulnessPaper

  5. arXiv:2302.07459  [pdf, other

    cs.CL

    The Capacity for Moral Self-Correction in Large Language Models

    Authors: Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas I. Liao, Kamilė Lukošiūtė, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez, Dawn Drain, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jackson Kernion, Jamie Kerr, Jared Mueller, Joshua Landau, Kamal Ndousse, Karina Nguyen, Liane Lovitt, Michael Sellitto, Nelson Elhage, Noemi Mercado, Nova DasSarma , et al. (24 additional authors not shown)

    Abstract: We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability… ▽ More

    Submitted 18 February, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

  6. arXiv:2212.09251  [pdf, other

    cs.CL cs.AI cs.LG

    Discovering Language Model Behaviors with Model-Written Evaluations

    Authors: Ethan Perez, Sam Ringer, Kamilė Lukošiūtė, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Kernion , et al. (38 additional authors not shown)

    Abstract: As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from inst… ▽ More

    Submitted 19 December, 2022; originally announced December 2022.

    Comments: for associated data visualizations, see https://www.evals.anthropic.com/model-written/ for full datasets, see https://github.com/anthropics/evals

  7. arXiv:2211.03540  [pdf, other

    cs.HC cs.AI cs.CL

    Measuring Progress on Scalable Oversight for Large Language Models

    Authors: Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamilė Lukošiūtė, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse , et al. (21 additional authors not shown)

    Abstract: Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think abou… ▽ More

    Submitted 11 November, 2022; v1 submitted 4 November, 2022; originally announced November 2022.

    Comments: v2 fixes a few typos from v1

  8. arXiv:2206.09696  [pdf, other

    astro-ph.HE astro-ph.CO astro-ph.IM gr-qc

    Prospects of Gravitational Wave Follow-up Through a Wide-field Ultra-violet Satellite: a Dorado Case Study

    Authors: Bas Dorsman, Geert Raaijmakers, S. Bradley Cenko, Samaya Nissanke, Leo P. Singer, Mansi M. Kasliwal, Anthony L. Piro, Eric C. Bellm, Dieter H. Hartmann, Kenta Hotokezaka, Kamilė Lukošiūtė

    Abstract: The detection of gravitational waves from binary neuron star merger GW170817 and electromagnetic counterparts GRB170817 and AT2017gfo kick-started the field of gravitational wave multimessenger astronomy. The optically red to near infra-red emission (`red' component) of AT2017gfo was readily explained as produced by the decay of newly created nuclei produced by rapid neutron capture (a kilonova).… ▽ More

    Submitted 20 June, 2022; originally announced June 2022.

    Comments: 13 pages, 2 tables, 7 figures. Comments welcome. To be submitted to ApJ

  9. arXiv:2204.00285  [pdf, other

    astro-ph.IM astro-ph.HE

    KilonovaNet: Surrogate Models of Kilonova Spectra with Conditional Variational Autoencoders

    Authors: Kamilė Lukošiūtė, Geert Raaijmakers, Zoheyr Doctor, Marcelle Soares-Santos, Brian Nord

    Abstract: Detailed radiative transfer simulations of kilonova spectra play an essential role in multimessenger astrophysics. Using the simulation results in parameter inference studies requires building a surrogate model from the simulation outputs to use in algorithms requiring sampling. In this work, we present KilonovaNet, an implementation of conditional variational autoencoders (cVAEs) for the construc… ▽ More

    Submitted 1 April, 2022; originally announced April 2022.

  10. arXiv:2102.11569  [pdf, other

    astro-ph.HE astro-ph.CO gr-qc

    The Challenges Ahead for Multimessenger Analyses of Gravitational Waves and Kilonova: a Case Study on GW190425

    Authors: Geert Raaijmakers, Samaya Nissanke, Francois Foucart, Mansi M. Kasliwal, Mattia Bulla, Rodrigo Fernandez, Amelia Henkel, Tanja Hinderer, Kenta Hotokezaka, Kamilė Lukošiūtė, Tejaswi Venumadhav, Sarah Antier, Michael W. Coughlin, Tim Dietrich, Thomas D. P. Edwards

    Abstract: In recent years, there have been significant advances in multi-messenger astronomy due to the discovery of the first, and so far only confirmed, gravitational wave event with a simultaneous electromagnetic (EM) counterpart, as well as improvements in numerical simulations, gravitational wave (GW) detectors, and transient astronomy. This has led to the exciting possibility of performing joint analy… ▽ More

    Submitted 23 February, 2021; originally announced February 2021.

    Comments: 22 pages, 12 figures, comments are welcome