-
Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs
Authors:
Yaniv Nikankin,
Dana Arad,
Yossi Gandelsman,
Yonatan Belinkov
Abstract:
Vision-Language models (VLMs) show impressive abilities to answer questions on visual inputs (e.g., counting objects in an image), yet demonstrate higher accuracies when performing an analogous task on text (e.g., counting words in a text). We investigate this accuracy gap by identifying and comparing the \textit{circuits} - the task-specific computational sub-graphs - in different modalities. We…
▽ More
Vision-Language models (VLMs) show impressive abilities to answer questions on visual inputs (e.g., counting objects in an image), yet demonstrate higher accuracies when performing an analogous task on text (e.g., counting words in a text). We investigate this accuracy gap by identifying and comparing the \textit{circuits} - the task-specific computational sub-graphs - in different modalities. We show that while circuits are largely disjoint between modalities, they implement relatively similar functionalities: the differences lie primarily in processing modality-specific data positions (an image or a text sequence). Zooming in on the image data representations, we observe they become aligned with the higher-performing analogous textual representations only towards later layers, too late in processing to effectively influence subsequent positions. To overcome this, we patch the representations of visual data tokens from later layers back into earlier layers. In experiments with multiple tasks and models, this simple intervention closes a third of the performance gap between the modalities, on average. Our analysis sheds light on the multi-modal performance gap in VLMs and suggests a training-free approach for reducing it.
△ Less
Submitted 11 June, 2025; v1 submitted 10 June, 2025;
originally announced June 2025.
-
SAEs Are Good for Steering -- If You Select the Right Features
Authors:
Dana Arad,
Aaron Mueller,
Yonatan Belinkov
Abstract:
Sparse Autoencoders (SAEs) have been proposed as an unsupervised approach to learn a decomposition of a model's latent space. This enables useful applications such as steering - influencing the output of a model towards a desired concept - without requiring labeled data. Current methods identify SAE features to steer by analyzing the input tokens that activate them. However, recent work has highli…
▽ More
Sparse Autoencoders (SAEs) have been proposed as an unsupervised approach to learn a decomposition of a model's latent space. This enables useful applications such as steering - influencing the output of a model towards a desired concept - without requiring labeled data. Current methods identify SAE features to steer by analyzing the input tokens that activate them. However, recent work has highlighted that activations alone do not fully describe the effect of a feature on the model's output. In this work, we draw a distinction between two types of features: input features, which mainly capture patterns in the model's input, and output features, which have a human-understandable effect on the model's output. We propose input and output scores to characterize and locate these types of features, and show that high values for both scores rarely co-occur in the same features. These findings have practical implications: after filtering out features with low output scores, we obtain 2-3x improvements when steering with SAEs, making them competitive with supervised methods.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
MIB: A Mechanistic Interpretability Benchmark
Authors:
Aaron Mueller,
Atticus Geiger,
Sarah Wiegreffe,
Dana Arad,
Iván Arcuschin,
Adam Belfki,
Yik Siu Chan,
Jaden Fiotto-Kaufman,
Tal Haklay,
Michael Hanna,
Jing Huang,
Rohan Gupta,
Yaniv Nikankin,
Hadas Orgad,
Nikhil Prakash,
Anja Reusch,
Aruna Sankaranarayanan,
Shun Shao,
Alessandro Stolfo,
Martin Tutek,
Amir Zur,
David Bau,
Yonatan Belinkov
Abstract:
How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of lasting evaluation standards, we propose MIB, a Mechanistic Interpretability Benchmark, with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or causal variables in neural language models. The circuit localization…
▽ More
How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of lasting evaluation standards, we propose MIB, a Mechanistic Interpretability Benchmark, with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or causal variables in neural language models. The circuit localization track compares methods that locate the model components - and connections between them - most important for performing a task (e.g., attribution patching or information flow routes). The causal variable localization track compares methods that featurize a hidden vector, e.g., sparse autoencoders (SAEs) or distributed alignment search (DAS), and align those features to a task-relevant causal variable. Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. For causal variable localization, we find that the supervised DAS method performs best, while SAE features are not better than neurons, i.e., non-featurized hidden vectors. These findings illustrate that MIB enables meaningful comparisons, and increases our confidence that there has been real progress in the field.
△ Less
Submitted 9 June, 2025; v1 submitted 17 April, 2025;
originally announced April 2025.
-
Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines
Authors:
Michael Toker,
Hadas Orgad,
Mor Ventura,
Dana Arad,
Yonatan Belinkov
Abstract:
Text-to-image diffusion models (T2I) use a latent representation of a text prompt to guide the image generation process. However, the process by which the encoder produces the text representation is unknown. We propose the Diffusion Lens, a method for analyzing the text encoder of T2I models by generating images from its intermediate representations. Using the Diffusion Lens, we perform an extensi…
▽ More
Text-to-image diffusion models (T2I) use a latent representation of a text prompt to guide the image generation process. However, the process by which the encoder produces the text representation is unknown. We propose the Diffusion Lens, a method for analyzing the text encoder of T2I models by generating images from its intermediate representations. Using the Diffusion Lens, we perform an extensive analysis of two recent T2I models. Exploring compound prompts, we find that complex scenes describing multiple objects are composed progressively and more slowly compared to simple scenes; Exploring knowledge retrieval, we find that representation of uncommon concepts requires further computation compared to common concepts, and that knowledge retrieval is gradual across layers. Overall, our findings provide valuable insights into the text encoder component in T2I pipelines.
△ Less
Submitted 21 October, 2024; v1 submitted 9 March, 2024;
originally announced March 2024.
-
ReFACT: Updating Text-to-Image Models by Editing the Text Encoder
Authors:
Dana Arad,
Hadas Orgad,
Yonatan Belinkov
Abstract:
Our world is marked by unprecedented technological, global, and socio-political transformations, posing a significant challenge to text-to-image generative models. These models encode factual associations within their parameters that can quickly become outdated, diminishing their utility for end-users. To that end, we introduce ReFACT, a novel approach for editing factual associations in text-to-i…
▽ More
Our world is marked by unprecedented technological, global, and socio-political transformations, posing a significant challenge to text-to-image generative models. These models encode factual associations within their parameters that can quickly become outdated, diminishing their utility for end-users. To that end, we introduce ReFACT, a novel approach for editing factual associations in text-to-image models without relaying on explicit input from end-users or costly re-training. ReFACT updates the weights of a specific layer in the text encoder, modifying only a tiny portion of the model's parameters and leaving the rest of the model unaffected. We empirically evaluate ReFACT on an existing benchmark, alongside a newly curated dataset. Compared to other methods, ReFACT achieves superior performance in both generalization to related concepts and preservation of unrelated concepts. Furthermore, ReFACT maintains image generation quality, making it a practical tool for updating and correcting factual information in text-to-image models.
△ Less
Submitted 7 May, 2024; v1 submitted 1 June, 2023;
originally announced June 2023.
-
Tighter Bounds for Makespan Minimization on Unrelated Machines
Authors:
Dor Arad,
Yael Mordechai,
Hadas Shachnai
Abstract:
We consider the problem of scheduling $n$ jobs to minimize the makespan on $m$ unrelated machines, where job $j$ requires time $p_{ij}$ if processed on machine $i$. A classic algorithm of Lenstra et al. yields the best known approximation ratio of $2$ for the problem. Improving this bound has been a prominent open problem for over two decades. In this paper we obtain a tighter bound for a wide sub…
▽ More
We consider the problem of scheduling $n$ jobs to minimize the makespan on $m$ unrelated machines, where job $j$ requires time $p_{ij}$ if processed on machine $i$. A classic algorithm of Lenstra et al. yields the best known approximation ratio of $2$ for the problem. Improving this bound has been a prominent open problem for over two decades. In this paper we obtain a tighter bound for a wide subclass of instances which can be identified efficiently. Specifically, we define the feasibility factor of a given instance as the minimum fraction of machines on which each job can be processed. We show that there is a polynomial-time algorithm that, given values $L$ and $T$, and an instance having a sufficiently large feasibility factor $h \in (0,1]$, either proves that no schedule of mean machine completion time $L$ and makespan $T$ exists, or else finds a schedule of makespan at most $T + L/h < 2T$. For the restricted version of the problem, where for each job $j$ and machine $i$, $p_{ij} \in \{p_j, \infty\}$, we show that a simpler algorithm yields a better bound, thus improving for highly feasible instances the best known ratio of $33/17 + ε$, for any fixed $ε>0$, due to Svensson.
△ Less
Submitted 23 June, 2014; v1 submitted 11 May, 2014;
originally announced May 2014.