Skip to main content

Showing 1–12 of 12 results for author: Orgad, H

.
  1. arXiv:2504.13151  [pdf, ps, other

    cs.LG cs.AI cs.CL

    MIB: A Mechanistic Interpretability Benchmark

    Authors: Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iván Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, Yonatan Belinkov

    Abstract: How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of lasting evaluation standards, we propose MIB, a Mechanistic Interpretability Benchmark, with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or causal variables in neural language models. The circuit localization… ▽ More

    Submitted 9 June, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

    Comments: Accepted to ICML 2025. Project website at https://mib-bench.github.io

  2. arXiv:2503.15299  [pdf, other

    cs.CL

    Inside-Out: Hidden Factual Knowledge in LLMs

    Authors: Zorik Gekhman, Eyal Ben David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpektor, Jonathan Herzig, Roi Reichart

    Abstract: This work presents a framework for assessing whether large language models (LLMs) encode more factual knowledge in their parameters than what they express in their outputs. While a few studies hint at this possibility, none has clearly defined or demonstrated this phenomenon. We first propose a formal definition of knowledge, quantifying it for a given question as the fraction of correct-incorrect… ▽ More

    Submitted 23 March, 2025; v1 submitted 19 March, 2025; originally announced March 2025.

  3. arXiv:2502.04577  [pdf, other

    cs.LG cs.CL

    Position-aware Automatic Circuit Discovery

    Authors: Tal Haklay, Hadas Orgad, David Bau, Aaron Mueller, Yonatan Belinkov

    Abstract: A widely used strategy to discover and understand language model mechanisms is circuit analysis. A circuit is a minimal subgraph of a model's computation graph that executes a specific task. We identify a gap in existing circuit discovery methods: they assume circuits are position-invariant, treating model components as equally relevant across input positions. This limits their ability to capture… ▽ More

    Submitted 6 February, 2025; originally announced February 2025.

    MSC Class: 68T50 ACM Class: I.2.7

  4. arXiv:2501.06751  [pdf, other

    cs.CL cs.CV

    Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models

    Authors: Michael Toker, Ido Galil, Hadas Orgad, Rinon Gal, Yoad Tewel, Gal Chechik, Yonatan Belinkov

    Abstract: Text-to-image (T2I) diffusion models rely on encoded prompts to guide the image generation process. Typically, these prompts are extended to a fixed length by adding padding tokens before text encoding. Despite being a default practice, the influence of padding tokens on the image generation process has not been investigated. In this work, we conduct the first in-depth analysis of the role padding… ▽ More

    Submitted 2 March, 2025; v1 submitted 12 January, 2025; originally announced January 2025.

    Comments: Published in: NAACL 2025. Project webpage: https://padding-tone.github.io/

  5. arXiv:2410.02707  [pdf, other

    cs.CL cs.AI

    LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

    Authors: Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, Yonatan Belinkov

    Abstract: Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures, collectively referred to as "hallucinations". Recent studies have demonstrated that LLMs' internal states encode information regarding the truthfulness of their outputs, and that this information can be utilized to detect errors. In this work, we show that the internal representations… ▽ More

    Submitted 18 May, 2025; v1 submitted 3 October, 2024; originally announced October 2024.

    MSC Class: 68T50 ACM Class: I.2.7

  6. Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines

    Authors: Michael Toker, Hadas Orgad, Mor Ventura, Dana Arad, Yonatan Belinkov

    Abstract: Text-to-image diffusion models (T2I) use a latent representation of a text prompt to guide the image generation process. However, the process by which the encoder produces the text representation is unknown. We propose the Diffusion Lens, a method for analyzing the text encoder of T2I models by generating images from its intermediate representations. Using the Diffusion Lens, we perform an extensi… ▽ More

    Submitted 21 October, 2024; v1 submitted 9 March, 2024; originally announced March 2024.

    Comments: Published in: ACL 2024 Project webpage: tokeron.github.io/DiffusionLensWeb

    ACM Class: I.2.7; I.4.0

  7. arXiv:2308.14761  [pdf, other

    cs.CV cs.LG

    Unified Concept Editing in Diffusion Models

    Authors: Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzyńska, David Bau

    Abstract: Text-to-image models suffer from various safety issues that may limit their suitability for deployment. Previous methods have separately addressed individual issues of bias, copyright, and offensive content in text-to-image models. However, in the real world, all of these issues appear simultaneously in the same model. We present a method that tackles all issues with a single approach. Our method,… ▽ More

    Submitted 22 October, 2024; v1 submitted 25 August, 2023; originally announced August 2023.

    Comments: In proceedings of WACV 2024. Project Page: https://unified.baulab.info

  8. arXiv:2306.00738  [pdf, other

    cs.CL cs.CV

    ReFACT: Updating Text-to-Image Models by Editing the Text Encoder

    Authors: Dana Arad, Hadas Orgad, Yonatan Belinkov

    Abstract: Our world is marked by unprecedented technological, global, and socio-political transformations, posing a significant challenge to text-to-image generative models. These models encode factual associations within their parameters that can quickly become outdated, diminishing their utility for end-users. To that end, we introduce ReFACT, a novel approach for editing factual associations in text-to-i… ▽ More

    Submitted 7 May, 2024; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted to NAACL 2024 (Main Conference)

    MSC Class: 68T50 ACM Class: I.2.7

  9. arXiv:2303.08084  [pdf, other

    cs.CV

    Editing Implicit Assumptions in Text-to-Image Diffusion Models

    Authors: Hadas Orgad, Bahjat Kawar, Yonatan Belinkov

    Abstract: Text-to-image diffusion models often make implicit assumptions about the world when generating images. While some assumptions are useful (e.g., the sky is blue), they can also be outdated, incorrect, or reflective of social biases present in the training data. Thus, there is a need to control these assumptions without requiring explicit user input or costly re-training. In this work, we aim to edi… ▽ More

    Submitted 25 August, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

    Comments: Project page: https://time-diffusion.github.io/

  10. arXiv:2212.10563  [pdf, other

    cs.CL

    BLIND: Bias Removal With No Demographics

    Authors: Hadas Orgad, Yonatan Belinkov

    Abstract: Models trained on real-world data tend to imitate and amplify social biases. Common methods to mitigate biases require prior information on the types of biases that should be mitigated (e.g., gender or racial bias) and the social groups associated with each data sample. In this work, we introduce BLIND, a method for bias removal with no prior knowledge of the demographics in the dataset. While tra… ▽ More

    Submitted 11 June, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: Accepted to ACL 2023 main conference

    MSC Class: 68T50 ACM Class: I.2.7

  11. arXiv:2210.11471  [pdf, other

    cs.CL

    Choose Your Lenses: Flaws in Gender Bias Evaluation

    Authors: Hadas Orgad, Yonatan Belinkov

    Abstract: Considerable efforts to measure and mitigate gender bias in recent years have led to the introduction of an abundance of tasks, datasets, and metrics used in this vein. In this position paper, we assess the current paradigm of gender bias evaluation and identify several flaws in it. First, we highlight the importance of extrinsic bias metrics that measure how a model's performance on some task is… ▽ More

    Submitted 20 October, 2022; originally announced October 2022.

    Comments: Accepted to the 4th Workshop on Gender Bias in Natural Language Processing

    MSC Class: 68T50 ACM Class: I.2.7

  12. arXiv:2204.06827  [pdf, other

    cs.CL

    How Gender Debiasing Affects Internal Model Representations, and Why It Matters

    Authors: Hadas Orgad, Seraphina Goldfarb-Tarrant, Yonatan Belinkov

    Abstract: Common studies of gender bias in NLP focus either on extrinsic bias measured by model performance on a downstream task or on intrinsic bias found in models' internal representations. However, the relationship between extrinsic and intrinsic bias is relatively unknown. In this work, we illuminate this relationship by measuring both quantities together: we debias a model during downstream fine-tunin… ▽ More

    Submitted 16 May, 2022; v1 submitted 14 April, 2022; originally announced April 2022.

    Comments: Accepted to NAACL 2022

    MSC Class: 68T50 ACM Class: I.2.7