Skip to main content

Showing 1–50 of 65 results for author: Bau, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.17441  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Discovering Forbidden Topics in Language Models

    Authors: Can Rager, Chris Wendler, Rohit Gandikota, David Bau

    Abstract: Refusal discovery is the task of identifying the full set of topics that a language model refuses to discuss. We introduce this new problem setting and develop a refusal discovery method, Iterated Prefill Crawler (IPC), that uses token prefilling to find forbidden topics. We benchmark IPC on Tulu-3-8B, an open-source model with public safety tuning data. Our crawler manages to retrieve 31 out of 3… ▽ More

    Submitted 11 June, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

  2. arXiv:2505.17013  [pdf, ps, other

    cs.LG cs.CV

    When Are Concepts Erased From Diffusion Models?

    Authors: Kevin Lu, Nicky Kriplani, Rohit Gandikota, Minh Pham, David Bau, Chinmay Hegde, Niv Cohen

    Abstract: Concept erasure, the ability to selectively prevent a model from generating specific concepts, has attracted growing interest, with various approaches emerging to address the challenge. However, it remains unclear how thoroughly these methods erase the target concept. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) reducing the likelihood of generatin… ▽ More

    Submitted 30 May, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

    Comments: Project Page: https://nyu-dice-lab.github.io/when-are-concepts-erased/

  3. arXiv:2505.14685  [pdf, ps, other

    cs.CL

    Language Models use Lookbacks to Track Beliefs

    Authors: Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov, Tamar Rott Shaham, David Bau, Atticus Geiger

    Abstract: How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilities of LMs. We analyze Llama-3-70B-Instruct's ability to reason about characters' beliefs using causal mediation and abstraction. We construct a dataset that consists of simple stories where two charact… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

    Comments: 32 pages, 32 figures. Code and data at https://belief.baulab.info/

  4. arXiv:2505.08135  [pdf, ps, other

    cs.SE cs.AI cs.DC cs.PF

    Leveraging AI for Productive and Trustworthy HPC Software: Challenges and Research Directions

    Authors: Keita Teranishi, Harshitha Menon, William F. Godoy, Prasanna Balaprakash, David Bau, Tal Ben-Nun, Abhinav Bhatele, Franz Franchetti, Michael Franusich, Todd Gamblin, Giorgis Georgakoudis, Tom Goldstein, Arjun Guha, Steven Hahn, Costin Iancu, Zheming Jin, Terry Jones, Tze Meng Low, Het Mankad, Narasinga Rao Miniskar, Mohammad Alaul Haque Monil, Daniel Nichols, Konstantinos Parasyris, Swaroop Pophale, Pedro Valero-Lara , et al. (3 additional authors not shown)

    Abstract: We discuss the challenges and propose research directions for using AI to revolutionize the development of high-performance computing (HPC) software. AI technologies, in particular large language models, have transformed every aspect of software development. For its part, HPC software is recognized as a highly specialized scientific field of its own. We discuss the challenges associated with lever… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: 12 pages, 1 Figure, Accepted at "The 1st International Workshop on Foundational Large Language Models Advances for HPC" LLM4HPC to be held in conjunction with ISC High Performance 2025

  5. arXiv:2504.13151  [pdf, ps, other

    cs.LG cs.AI cs.CL

    MIB: A Mechanistic Interpretability Benchmark

    Authors: Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iván Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, Yonatan Belinkov

    Abstract: How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of lasting evaluation standards, we propose MIB, a Mechanistic Interpretability Benchmark, with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or causal variables in neural language models. The circuit localization… ▽ More

    Submitted 9 June, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

    Comments: Accepted to ICML 2025. Project website at https://mib-bench.github.io

  6. arXiv:2504.03022  [pdf, other

    cs.CL cs.AI

    The Dual-Route Model of Induction

    Authors: Sheridan Feucht, Eric Todd, Byron Wallace, David Bau

    Abstract: Prior work on in-context copying has shown the existence of induction heads, which attend to and promote individual tokens during copying. In this work we introduce a new type of induction head: concept-level induction heads, which copy entire lexical units instead of individual tokens. Concept induction heads learn to attend to the ends of multi-token words throughout training, working in paralle… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

    Comments: 36 pages, 39 figures. Code and data at https://dualroute.baulab.info

    ACM Class: I.2.7

  7. arXiv:2503.10637  [pdf, other

    cs.GR cs.CV

    Distilling Diversity and Control in Diffusion Models

    Authors: Rohit Gandikota, David Bau

    Abstract: Distilled diffusion models suffer from a critical limitation: reduced sample diversity compared to their base counterparts. In this work, we uncover that despite this diversity loss, distilled models retain the fundamental concept representations of base models. We demonstrate control distillation - where control mechanisms like Concept Sliders and LoRAs trained on base models can be seamlessly tr… ▽ More

    Submitted 14 March, 2025; v1 submitted 13 March, 2025; originally announced March 2025.

    Comments: Project Page: https://distillation.baulab.info

  8. arXiv:2502.13319  [pdf, other

    cs.CL

    Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare

    Authors: Hiba Ahsan, Arnab Sen Sharma, Silvio Amir, David Bau, Byron C. Wallace

    Abstract: We know from prior work that LLMs encode social biases, and that this manifests in clinical tasks. In this work we adopt tools from mechanistic interpretability to unveil sociodemographic representations and biases within LLMs in the context of healthcare. Specifically, we ask: Can we identify activations within LLMs that encode sociodemographic information (e.g., gender, race)? We find that gende… ▽ More

    Submitted 18 February, 2025; originally announced February 2025.

  9. arXiv:2502.04577  [pdf, other

    cs.LG cs.CL

    Position-aware Automatic Circuit Discovery

    Authors: Tal Haklay, Hadas Orgad, David Bau, Aaron Mueller, Yonatan Belinkov

    Abstract: A widely used strategy to discover and understand language model mechanisms is circuit analysis. A circuit is a minimal subgraph of a model's computation graph that executes a specific task. We identify a gap in existing circuit discovery methods: they assume circuits are position-invariant, treating model components as equally relevant across input positions. This limits their ability to capture… ▽ More

    Submitted 6 February, 2025; originally announced February 2025.

    MSC Class: 68T50 ACM Class: I.2.7

  10. arXiv:2502.01639  [pdf, other

    cs.CV cs.GR cs.LG

    SliderSpace: Decomposing the Visual Capabilities of Diffusion Models

    Authors: Rohit Gandikota, Zongze Wu, Richard Zhang, David Bau, Eli Shechtman, Nick Kolkin

    Abstract: We present SliderSpace, a framework for automatically decomposing the visual capabilities of diffusion models into controllable and human-understandable directions. Unlike existing control methods that require a user to specify attributes for each edit direction individually, SliderSpace discovers multiple interpretable and diverse directions simultaneously from a single text prompt. Each directio… ▽ More

    Submitted 3 February, 2025; originally announced February 2025.

    Comments: Project Website: https://sliderspace.baulab.info

  11. arXiv:2501.16496  [pdf, other

    cs.LG

    Open Problems in Mechanistic Interpretability

    Authors: Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger , et al. (4 additional authors not shown)

    Abstract: Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals,… ▽ More

    Submitted 27 January, 2025; originally announced January 2025.

  12. arXiv:2412.06966  [pdf, other

    cs.LG cs.AI cs.CY

    Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy, Research, and Practice

    Authors: A. Feder Cooper, Christopher A. Choquette-Choo, Miranda Bogen, Matthew Jagielski, Katja Filippova, Ken Ziyu Liu, Alexandra Chouldechova, Jamie Hayes, Yangsibo Huang, Niloofar Mireshghallah, Ilia Shumailov, Eleni Triantafillou, Peter Kairouz, Nicole Mitchell, Percy Liang, Daniel E. Ho, Yejin Choi, Sanmi Koyejo, Fernando Delgado, James Grimmelmann, Vitaly Shmatikov, Christopher De Sa, Solon Barocas, Amy Cyphert, Mark Lemley , et al. (10 additional authors not shown)

    Abstract: We articulate fundamental mismatches between technical methods for machine unlearning in Generative AI, and documented aspirations for broader impact that these methods could have for law and policy. These aspirations are both numerous and varied, motivated by issues that pertain to privacy, copyright, safety, and more. For example, unlearning is often invoked as a solution for removing the effect… ▽ More

    Submitted 9 December, 2024; originally announced December 2024.

    Comments: Presented at the 2nd Workshop on Generative AI and Law at ICML (July 2024)

  13. arXiv:2412.00176  [pdf, other

    cs.CV

    Opt-In Art: Learning Art Styles Only from Few Examples

    Authors: Hui Ren, Joanna Materzynska, Rohit Gandikota, David Bau, Antonio Torralba

    Abstract: We explore whether pre-training on datasets with paintings is necessary for a model to learn an artistic style with only a few examples. To investigate this, we train a text-to-image model exclusively on photographs, without access to any painting-related content. We show that it is possible to adapt a model that is trained without paintings to an artistic style, given only few examples. User stud… ▽ More

    Submitted 20 May, 2025; v1 submitted 29 November, 2024; originally announced December 2024.

  14. arXiv:2410.22366  [pdf, ps, other

    cs.LG cs.AI cs.CV

    One-Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models

    Authors: Viacheslav Surkov, Chris Wendler, Antonio Mari, Mikhail Terekhov, Justin Deschenaux, Robert West, Caglar Gulcehre, David Bau

    Abstract: For large language models (LLMs), sparse autoencoders (SAEs) have been shown to decompose intermediate representations that often are not interpretable directly into sparse sums of interpretable features, facilitating better control and subsequent analysis. However, similar analyses and approaches have been lacking for text-to-image models. We investigate the possibility of using SAEs to learn int… ▽ More

    Submitted 22 June, 2025; v1 submitted 28 October, 2024; originally announced October 2024.

  15. arXiv:2410.02760  [pdf, other

    cs.CL cs.LG

    Erasing Conceptual Knowledge from Language Models

    Authors: Rohit Gandikota, Sheridan Feucht, Samuel Marks, David Bau

    Abstract: In this work, we propose Erasure of Language Memory (ELM), an approach for concept-level unlearning built on the principle of matching the distribution defined by an introspective classifier. Our key insight is that effective unlearning should leverage the model's ability to evaluate its own knowledge, using the model itself as a classifier to identify and reduce the likelihood of generating conte… ▽ More

    Submitted 22 March, 2025; v1 submitted 3 October, 2024; originally announced October 2024.

    Comments: Project Page: https://elm.baulab.info

  16. arXiv:2408.01416  [pdf, other

    cs.LG cs.AI

    The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability

    Authors: Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiuding Sun, Eric Todd, David Bau, Yonatan Belinkov

    Abstract: Interpretability provides a toolset for understanding how and why neural networks behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not share theoretical foundations, making it difficult to measure progress and compare the pros and cons of different techniques. Furthermore, while mechanistic understanding is frequently discussed, the… ▽ More

    Submitted 2 August, 2024; originally announced August 2024.

  17. arXiv:2408.00113  [pdf, other

    cs.LG cs.AI cs.CL

    Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

    Authors: Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Smith, Claudio Mayrink Verdun, David Bau, Samuel Marks

    Abstract: What latent features are encoded in language model (LM) representations? Recent work on training sparse autoencoders (SAEs) to disentangle interpretable features in LM representations has shown significant promise. However, evaluating the quality of these SAEs is difficult because we lack a ground-truth collection of interpretable features that we expect good SAEs to recover. We thus propose to me… ▽ More

    Submitted 30 October, 2024; v1 submitted 31 July, 2024; originally announced August 2024.

    Comments: Accepted as an oral paper (top 5%) at the ICML 2024 Mechanistic Interpretability Workshop and to the NeurIPS 2024 Main Conference

  18. arXiv:2407.14981  [pdf, other

    cs.CY

    Open Problems in Technical AI Governance

    Authors: Anka Reuel, Ben Bucknall, Stephen Casper, Tim Fist, Lisa Soder, Onni Aarne, Lewis Hammond, Lujain Ibrahim, Alan Chan, Peter Wills, Markus Anderljung, Ben Garfinkel, Lennart Heim, Andrew Trask, Gabriel Mukobi, Rylan Schaeffer, Mauricio Baker, Sara Hooker, Irene Solaiman, Alexandra Sasha Luccioni, Nitarshan Rajkumar, Nicolas Moës, Jeffrey Ladish, David Bau, Paul Bricman , et al. (8 additional authors not shown)

    Abstract: AI progress is creating a growing range of risks and opportunities, but it is often unclear how they should be navigated. In many cases, the barriers and uncertainties faced are at least partly technical. Technical AI governance, referring to technical analysis and tools for supporting the effective governance of AI, seeks to address such challenges. It can help to (a) identify areas where interve… ▽ More

    Submitted 16 April, 2025; v1 submitted 20 July, 2024; originally announced July 2024.

    Comments: Ben Bucknall and Anka Reuel contributed equally and share the first author position

    Journal ref: Transactions on Machine Learning Research, 2025

  19. arXiv:2407.14561  [pdf, other

    cs.LG cs.AI

    NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals

    Authors: Jaden Fiotto-Kaufman, Alexander R. Loftus, Eric Todd, Jannik Brinkmann, Koyena Pal, Dmitrii Troitskii, Michael Ripa, Adam Belfki, Can Rager, Caden Juang, Aaron Mueller, Samuel Marks, Arnab Sen Sharma, Francesca Lucchetti, Nikhil Prakash, Carla Brodley, Arjun Guha, Jonathan Bell, Byron C. Wallace, David Bau

    Abstract: We introduce NNsight and NDIF, technologies that work in tandem to enable scientific study of the representations and computations learned by very large neural networks. NNsight is an open-source system that extends PyTorch to introduce deferred remote execution. The National Deep Inference Fabric (NDIF) is a scalable inference service that executes NNsight requests, allowing users to share GPU re… ▽ More

    Submitted 1 April, 2025; v1 submitted 18 July, 2024; originally announced July 2024.

    Comments: Code at https://nnsight.net

  20. arXiv:2406.20086  [pdf, other

    cs.CL cs.LG

    Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs

    Authors: Sheridan Feucht, David Atkinson, Byron Wallace, David Bau

    Abstract: LLMs process text as sequences of tokens that roughly correspond to words, where less common words are represented by multiple tokens. However, individual tokens are often semantically unrelated to the meanings of the words/concepts they comprise. For example, Llama-2-7b's tokenizer splits the word "northeastern" into the tokens ['_n', 'ort', 'he', 'astern'], none of which correspond to semantical… ▽ More

    Submitted 11 October, 2024; v1 submitted 28 June, 2024; originally announced June 2024.

    Comments: 13 pages, 14 figures. Code and data at https://footprints.baulab.info/

    ACM Class: I.2.7

  21. arXiv:2405.01536  [pdf, other

    cs.CV cs.GR cs.LG

    Customizing Text-to-Image Models with a Single Image Pair

    Authors: Maxwell Jones, Sheng-Yu Wang, Nupur Kumari, David Bau, Jun-Yan Zhu

    Abstract: Art reinterpretation is the practice of creating a variation of a reference work, making a paired artwork that exhibits a distinct artistic style. We ask if such an image pair can be used to customize a generative model to capture the demonstrated stylistic difference. We propose Pair Customization, a new customization method that learns stylistic difference from a single image pair and then appli… ▽ More

    Submitted 28 October, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

    Comments: project page: https://paircustomization.github.io/

  22. arXiv:2404.03646  [pdf, other

    cs.CL

    Locating and Editing Factual Associations in Mamba

    Authors: Arnab Sen Sharma, David Atkinson, David Bau

    Abstract: We investigate the mechanisms of factual recall in the Mamba state space model. Our work is inspired by previous findings in autoregressive transformer language models suggesting that their knowledge recall is localized to particular modules at specific token locations; we therefore ask whether factual recall in Mamba can be similarly localized. To investigate this, we conduct four lines of experi… ▽ More

    Submitted 2 August, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

    Comments: 18 pages, COLM-2024

  23. arXiv:2403.19647  [pdf, other

    cs.LG cs.AI cs.CL

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

    Authors: Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, Aaron Mueller

    Abstract: We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Circuits identified in prior work consist of polysemantic and difficult-to-interpret units like attention heads or neurons, rendering them unsuitable for many downstream applications. In contrast, sparse featur… ▽ More

    Submitted 27 March, 2025; v1 submitted 28 March, 2024; originally announced March 2024.

    Comments: Code and data at https://github.com/saprmarks/feature-circuits. Demonstration at https://feature-circuits.xyz

    Journal ref: International Conference on Learning Representations, 2025

  24. arXiv:2403.02327  [pdf, other

    cs.DB cs.AI

    Model Lakes

    Authors: Koyena Pal, David Bau, Renée J. Miller

    Abstract: Given a set of deep learning models, it can be hard to find models appropriate to a task, understand the models, and characterize how models are different one from another. Currently, practitioners rely on manually-written documentation to understand and choose models. However, not all models have complete and reliable documentation. As the number of models increases, the challenges of finding, di… ▽ More

    Submitted 21 February, 2025; v1 submitted 4 March, 2024; originally announced March 2024.

    Comments: Accepted to EDBT 2025

  25. arXiv:2402.14811  [pdf, other

    cs.CL cs.LG

    Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking

    Authors: Nikhil Prakash, Tamar Rott Shaham, Tal Haklay, Yonatan Belinkov, David Bau

    Abstract: Fine-tuning on generalized tasks such as instruction following, code generation, and mathematics has been shown to enhance language models' performance on a range of tasks. Nevertheless, explanations of how such fine-tuning influences the internal computations in these models remain elusive. We study how fine-tuning affects the internal mechanisms implemented in language models. As a case study, w… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

    Comments: ICLR 2024. 26 pages, 13 figures. Code and data at https://finetuning.baulab.info/

  26. arXiv:2402.10962  [pdf, other

    cs.CL cs.AI cs.LG

    Measuring and Controlling Instruction (In)Stability in Language Model Dialogs

    Authors: Kenneth Li, Tianle Liu, Naomi Bashkansky, David Bau, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg

    Abstract: System-prompting is a standard tool for customizing language-model chatbots, enabling them to follow a specific instruction. An implicit assumption in the use of system prompts is that they will be stable, so the chatbot will continue to generate text according to the stipulated instructions for the duration of a conversation. We propose a quantitative benchmark to test this assumption, evaluating… ▽ More

    Submitted 25 July, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

    Comments: COLM 2024; Code and data: https://github.com/likenneth/persona_drift

  27. arXiv:2401.14446  [pdf, other

    cs.CY cs.AI cs.CR

    Black-Box Access is Insufficient for Rigorous AI Audits

    Authors: Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin Von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, Dylan Hadfield-Menell

    Abstract: External audits of AI systems are increasingly recognized as a key mechanism for AI governance. The effectiveness of an audit, however, depends on the degree of access granted to auditors. Recent audits of state-of-the-art AI systems have primarily relied on black-box access, in which auditors can only query the system and observe its outputs. However, white-box access to the system's inner workin… ▽ More

    Submitted 29 May, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

    Comments: FAccT 2024

    Journal ref: The 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT '24), June 3-6, 2024, Rio de Janeiro, Brazil

  28. arXiv:2311.12092  [pdf, other

    cs.CV

    Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models

    Authors: Rohit Gandikota, Joanna Materzynska, Tingrui Zhou, Antonio Torralba, David Bau

    Abstract: We present a method to create interpretable concept sliders that enable precise control over attributes in image generations from diffusion models. Our approach identifies a low-rank parameter direction corresponding to one concept while minimizing interference with other attributes. A slider is created using a small set of prompts or sample images; thus slider directions can be created for either… ▽ More

    Submitted 27 November, 2023; v1 submitted 20 November, 2023; originally announced November 2023.

  29. arXiv:2311.11350  [pdf, ps, other

    cs.CY

    An Alternative to Regulation: The Case for Public AI

    Authors: Nicholas Vincent, David Bau, Sarah Schwettmann, Joshua Tan

    Abstract: Can governments build AI? In this paper, we describe an ongoing effort to develop ``public AI'' -- publicly accessible AI models funded, provisioned, and governed by governments or other public bodies. Public AI presents both an alternative and a complement to standard regulatory approaches to AI, but it also suggests new technical and policy challenges. We present a roadmap for how the ML researc… ▽ More

    Submitted 19 November, 2023; originally announced November 2023.

    Comments: To be presented at Regulatable ML @ NeurIPS2023 workshop

  30. arXiv:2311.10538  [pdf, other

    cs.AI

    Testing Language Model Agents Safely in the Wild

    Authors: Silen Naihin, David Atkinson, Marc Green, Merwane Hamadi, Craig Swift, Douglas Schonholtz, Adam Tauman Kalai, David Bau

    Abstract: A prerequisite for safe autonomy-in-the-wild is safe testing-in-the-wild. Yet real-world autonomous tests face several unique safety challenges, both due to the possibility of causing harm during a test, as well as the risk of encountering new unsafe agent behavior through interactions with real-world and potentially malicious actors. We propose a framework for conducting safe autonomous agent tes… ▽ More

    Submitted 3 December, 2023; v1 submitted 17 November, 2023; originally announced November 2023.

  31. Future Lens: Anticipating Subsequent Tokens from a Single Hidden State

    Authors: Koyena Pal, Jiuding Sun, Andrew Yuan, Byron C. Wallace, David Bau

    Abstract: We conjecture that hidden state vectors corresponding to individual input tokens encode information sufficient to accurately predict several tokens ahead. More concretely, in this paper we ask: Given a hidden (internal) representation of a single token at position $t$ in an input, can we reliably anticipate the tokens that will appear at positions $\geq t + 2$? To test this, we measure linear appr… ▽ More

    Submitted 8 November, 2023; originally announced November 2023.

    Comments: Accepted at CoNLL 2023

  32. arXiv:2310.15213  [pdf, other

    cs.CL cs.LG

    Function Vectors in Large Language Models

    Authors: Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, David Bau

    Abstract: We report the presence of a simple neural mechanism that represents an input-output function as a vector within autoregressive transformer language models (LMs). Using causal mediation analysis on a diverse range of in-context-learning (ICL) tasks, we find that a small number attention heads transport a compact representation of the demonstrated task, which we call a function vector (FV). FVs are… ▽ More

    Submitted 25 February, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: ICLR 2024. 52 pages, 30 figures, 23 tables. Code and data at https://functions.baulab.info

  33. arXiv:2309.03886  [pdf, other

    cs.CL cs.AI cs.LG

    FIND: A Function Description Benchmark for Evaluating Interpretability Methods

    Authors: Sarah Schwettmann, Tamar Rott Shaham, Joanna Materzynska, Neil Chowdhury, Shuang Li, Jacob Andreas, David Bau, Antonio Torralba

    Abstract: Labeling neural network submodules with human-legible descriptions is useful for many downstream tasks: such descriptions can surface failures, guide interventions, and perhaps even explain important model behaviors. To date, most mechanistic descriptions of trained networks have involved small models, narrowly delimited phenomena, and large amounts of human labor. Labeling all human-interpretable… ▽ More

    Submitted 8 December, 2023; v1 submitted 7 September, 2023; originally announced September 2023.

    Comments: 28 pages, 10 figures

    Journal ref: NeurIPS 2023

  34. arXiv:2308.14761  [pdf, other

    cs.CV cs.LG

    Unified Concept Editing in Diffusion Models

    Authors: Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzyńska, David Bau

    Abstract: Text-to-image models suffer from various safety issues that may limit their suitability for deployment. Previous methods have separately addressed individual issues of bias, copyright, and offensive content in text-to-image models. However, in the real world, all of these issues appear simultaneously in the same model. We present a method that tackles all issues with a single approach. Our method,… ▽ More

    Submitted 22 October, 2024; v1 submitted 25 August, 2023; originally announced August 2023.

    Comments: In proceedings of WACV 2024. Project Page: https://unified.baulab.info

  35. arXiv:2308.09124  [pdf, other

    cs.CL

    Linearity of Relation Decoding in Transformer Language Models

    Authors: Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, David Bau

    Abstract: Much of the knowledge encoded in transformer language models (LMs) may be expressed in terms of relations: relations between words and their synonyms, entities and their attributes, etc. We show that, for a subset of relations, this computation is well-approximated by a single linear transformation on the subject representation. Linear relation representations may be obtained by constructing a fir… ▽ More

    Submitted 15 February, 2024; v1 submitted 17 August, 2023; originally announced August 2023.

  36. arXiv:2308.01544  [pdf, other

    cs.CV cs.CL

    Multimodal Neurons in Pretrained Text-Only Transformers

    Authors: Sarah Schwettmann, Neil Chowdhury, Samuel Klein, David Bau, Antonio Torralba

    Abstract: Language models demonstrate remarkable capacity to generalize representations learned in one modality to downstream tasks in other modalities. Can we trace this ability to individual neurons? We study the case where a frozen text transformer is augmented with vision using a self-supervised visual encoder and a single linear projection learned on an image-to-text task. Outputs of the projection lay… ▽ More

    Submitted 1 October, 2023; v1 submitted 3 August, 2023; originally announced August 2023.

    Comments: Oral presentation at ICCV CLVL 2023

  37. arXiv:2307.03637  [pdf, other

    cs.AI

    Discovering Variable Binding Circuitry with Desiderata

    Authors: Xander Davies, Max Nadeau, Nikhil Prakash, Tamar Rott Shaham, David Bau

    Abstract: Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of \textit{deside… ▽ More

    Submitted 7 July, 2023; originally announced July 2023.

  38. arXiv:2303.07345  [pdf, other

    cs.CV

    Erasing Concepts from Diffusion Models

    Authors: Rohit Gandikota, Joanna Materzynska, Jaden Fiotto-Kaufman, David Bau

    Abstract: Motivated by recent advancements in text-to-image diffusion, we study erasure of specific concepts from the model's weights. While Stable Diffusion has shown promise in producing explicit or realistic artwork, it has raised concerns regarding its potential for misuse. We propose a fine-tuning method that can erase a visual concept from a pre-trained diffusion model, given only the name of the styl… ▽ More

    Submitted 20 June, 2023; v1 submitted 13 March, 2023; originally announced March 2023.

  39. arXiv:2210.13382  [pdf, other

    cs.LG cs.AI cs.CL

    Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task

    Authors: Kenneth Li, Aspen K. Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg

    Abstract: Language models show a surprising range of capabilities, but the source of their apparent competence is unclear. Do these networks just memorize a collection of surface statistics, or do they rely on internal representations of the process that generates the sequences they see? We investigate this question by applying a variant of the GPT model to the task of predicting legal moves in a simple boa… ▽ More

    Submitted 26 June, 2024; v1 submitted 24 October, 2022; originally announced October 2022.

    Comments: ICLR 2023 oral (notable-top-5%): https://openreview.net/forum?id=DeG07_TcZvT ; code: https://github.com/likenneth/othello_world

  40. arXiv:2210.07229  [pdf, other

    cs.CL cs.LG

    Mass-Editing Memory in a Transformer

    Authors: Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, David Bau

    Abstract: Recent work has shown exciting promise in updating large language models with new memories, so as to replace obsolete information or add specialized knowledge. However, this line of work is predominantly limited to updating single associations. We develop MEMIT, a method for directly updating a language model with many memories, demonstrating experimentally that it can scale up to thousands of ass… ▽ More

    Submitted 1 August, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

    Comments: 18 pages, 11 figures. Code and data at https://memit.baulab.info

  41. arXiv:2210.03116  [pdf, other

    cs.CV cs.GR cs.IR cs.LG

    Content-Based Search for Deep Generative Models

    Authors: Daohan Lu, Sheng-Yu Wang, Nupur Kumari, Rohan Agarwal, Mia Tang, David Bau, Jun-Yan Zhu

    Abstract: The growing proliferation of customized and pretrained generative models has made it infeasible for a user to be fully cognizant of every model in existence. To address this need, we introduce the task of content-based model search: given a query and a large set of generative models, finding the models that best match the query. As each generative model produces a distribution of images, we formul… ▽ More

    Submitted 24 October, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

    Comments: Our project page is hosted at https://generative-intelligence-lab.github.io/modelverse/

  42. arXiv:2207.14288  [pdf, other

    cs.CV

    Rewriting Geometric Rules of a GAN

    Authors: Sheng-Yu Wang, David Bau, Jun-Yan Zhu

    Abstract: Deep generative models make visual content creation more accessible to novice users by automating the synthesis of diverse, realistic content based on a collected dataset. However, the current machine learning approaches miss a key element of the creative process -- the ability to synthesize things that go far beyond the data distribution and everyday experience. To begin to address this issue, we… ▽ More

    Submitted 28 July, 2022; originally announced July 2022.

    Comments: SIGGRAPH 2022 website: https://peterwang512.github.io/GANWarping/ code: https://github.com/PeterWang512/GANWarping

  43. arXiv:2207.02774  [pdf, other

    cs.CV cs.GR

    Local Relighting of Real Scenes

    Authors: Audrey Cui, Ali Jahanian, Agata Lapedriza, Antonio Torralba, Shahin Mahdizadehaghdam, Rohit Kumar, David Bau

    Abstract: We introduce the task of local relighting, which changes a photograph of a scene by switching on and off the light sources that are visible within the image. This new task differs from the traditional image relighting problem, as it introduces the challenge of detecting light sources and inferring the pattern of light that emanates from them. We propose an approach for local relighting that trains… ▽ More

    Submitted 6 July, 2022; originally announced July 2022.

    Comments: 15 pages, 15 figures

  44. arXiv:2206.07835  [pdf, other

    cs.CV

    Disentangling visual and written concepts in CLIP

    Authors: Joanna Materzynska, Antonio Torralba, David Bau

    Abstract: The CLIP network measures the similarity between natural text and images; in this work, we investigate the entanglement of the representation of word images and natural images in its image encoder. First, we find that the image encoder has an ability to match word images with natural images of scenes described by those words. This is consistent with previous research that suggests that the meaning… ▽ More

    Submitted 15 June, 2022; originally announced June 2022.

  45. arXiv:2202.05262  [pdf, other

    cs.CL cs.LG

    Locating and Editing Factual Associations in GPT

    Authors: Kevin Meng, David Bau, Alex Andonian, Yonatan Belinkov

    Abstract: We analyze the storage and recall of factual associations in autoregressive transformer language models, finding evidence that these associations correspond to localized, directly-editable computations. We first develop a causal intervention for identifying neuron activations that are decisive in a model's factual predictions. This reveals a distinct set of steps in middle-layer feed-forward modul… ▽ More

    Submitted 13 January, 2023; v1 submitted 10 February, 2022; originally announced February 2022.

    Comments: NeurIPS 2022. 35 pages, 30 figures. Code and data at https://rome.baulab.info/

    ACM Class: I.2.7

  46. arXiv:2201.11114  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Natural Language Descriptions of Deep Visual Features

    Authors: Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, Jacob Andreas

    Abstract: Some neurons in deep networks specialize in recognizing highly specific perceptual, structural, or semantic features of inputs. In computer vision, techniques exist for identifying neurons that respond to individual concept categories like colors, textures, and object classes. But these techniques are limited in scope, labeling only a small subset of neurons and behaviors in any network. Is a rich… ▽ More

    Submitted 18 April, 2022; v1 submitted 26 January, 2022; originally announced January 2022.

    Comments: To be published as a conference paper at ICLR 2022

  47. arXiv:2112.01008  [pdf, other

    cs.LG cs.CV

    Editing a classifier by rewriting its prediction rules

    Authors: Shibani Santurkar, Dimitris Tsipras, Mahalaxmi Elango, David Bau, Antonio Torralba, Aleksander Madry

    Abstract: We present a methodology for modifying the behavior of a classifier by directly rewriting its prediction rules. Our approach requires virtually no additional data collection and can be applied to a variety of settings, including adapting a model to new environments, and modifying it to ignore spurious features. Our code is available at https://github.com/MadryLab/EditingClassifiers .

    Submitted 2 December, 2021; originally announced December 2021.

  48. arXiv:2110.04292  [pdf, other

    cs.CV cs.AI

    Toward a Visual Concept Vocabulary for GAN Latent Space

    Authors: Sarah Schwettmann, Evan Hernandez, David Bau, Samuel Klein, Jacob Andreas, Antonio Torralba

    Abstract: A large body of recent work has identified transformations in the latent spaces of generative adversarial networks (GANs) that consistently and interpretably transform generated images. But existing techniques for identifying these transformations rely on either a fixed vocabulary of pre-specified visual concepts, or on unsupervised disentanglement techniques whose alignment with human judgments a… ▽ More

    Submitted 8 October, 2021; originally announced October 2021.

    Comments: 15 pages, 13 figures. Accepted to ICCV 2021. Project page: https://visualvocab.csail.mit.edu

    ACM Class: I.4

  49. arXiv:2108.02774  [pdf, other

    cs.CV cs.LG

    Sketch Your Own GAN

    Authors: Sheng-Yu Wang, David Bau, Jun-Yan Zhu

    Abstract: Can a user create a deep generative model by sketching a single example? Traditionally, creating a GAN model has required the collection of a large-scale dataset of exemplars and specialized knowledge in deep learning. In contrast, sketching is possibly the most universally accessible way to convey a visual concept. In this work, we present a method, GAN Sketching, for rewriting GANs with one or m… ▽ More

    Submitted 20 September, 2021; v1 submitted 5 August, 2021; originally announced August 2021.

    Comments: ICCV 2021 website: https://peterwang512.github.io/GANSketching code: https://github.com/PeterWang512/GANSketching

  50. arXiv:2103.10951  [pdf, other

    cs.CV cs.AI cs.GR

    Paint by Word

    Authors: Alex Andonian, Sabrina Osmany, Audrey Cui, YeonHwan Park, Ali Jahanian, Antonio Torralba, David Bau

    Abstract: We investigate the problem of zero-shot semantic image painting. Instead of painting modifications into an image using only concrete colors or a finite set of semantic concepts, we ask how to create semantic paint based on open full-text descriptions: our goal is to be able to point to a location in a synthesized image and apply an arbitrary new concept such as "rustic" or "opulent" or "happy dog.… ▽ More

    Submitted 23 March, 2023; v1 submitted 19 March, 2021; originally announced March 2021.

    Comments: 10 pages, 9 figures

    ACM Class: I.2.10; I.4; I.3