-
Analyzing Hierarchical Structure in Vision Models with Sparse Autoencoders
Authors:
Matthew Lyle Olson,
Musashi Hinck,
Neale Ratzlaff,
Changbai Li,
Phillip Howard,
Vasudev Lal,
Shao-Yen Tseng
Abstract:
The ImageNet hierarchy provides a structured taxonomy of object categories, offering a valuable lens through which to analyze the representations learned by deep vision models. In this work, we conduct a comprehensive analysis of how vision models encode the ImageNet hierarchy, leveraging Sparse Autoencoders (SAEs) to probe their internal representations. SAEs have been widely used as an explanati…
▽ More
The ImageNet hierarchy provides a structured taxonomy of object categories, offering a valuable lens through which to analyze the representations learned by deep vision models. In this work, we conduct a comprehensive analysis of how vision models encode the ImageNet hierarchy, leveraging Sparse Autoencoders (SAEs) to probe their internal representations. SAEs have been widely used as an explanation tool for large language models (LLMs), where they enable the discovery of semantically meaningful features. Here, we extend their use to vision models to investigate whether learned representations align with the ontological structure defined by the ImageNet taxonomy. Our results show that SAEs uncover hierarchical relationships in model activations, revealing an implicit encoding of taxonomic structure. We analyze the consistency of these representations across different layers of the popular vision foundation model DINOv2 and provide insights into how deep vision models internalize hierarchical category information by increasing information in the class token through each layer. Our study establishes a framework for systematic hierarchical analysis of vision model representations and highlights the potential of SAEs as a tool for probing semantic structure in deep networks.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Probing Semantic Routing in Large Mixture-of-Expert Models
Authors:
Matthew Lyle Olson,
Neale Ratzlaff,
Musashi Hinck,
Man Luo,
Sungduk Yu,
Chendi Xue,
Vasudev Lal
Abstract:
In the past year, large (>100B parameter) mixture-of-expert (MoE) models have become increasingly common in the open domain. While their advantages are often framed in terms of efficiency, prior work has also explored functional differentiation through routing behavior. We investigate whether expert routing in large MoE models is influenced by the semantics of the inputs. To test this, we design t…
▽ More
In the past year, large (>100B parameter) mixture-of-expert (MoE) models have become increasingly common in the open domain. While their advantages are often framed in terms of efficiency, prior work has also explored functional differentiation through routing behavior. We investigate whether expert routing in large MoE models is influenced by the semantics of the inputs. To test this, we design two controlled experiments. First, we compare activations on sentence pairs with a shared target word used in the same or different senses. Second, we fix context and substitute the target word with semantically similar or dissimilar alternatives. Comparing expert overlap across these conditions reveals clear, statistically significant evidence of semantic routing in large MoE models.
△ Less
Submitted 21 May, 2025; v1 submitted 15 February, 2025;
originally announced February 2025.
-
IssueBench: Millions of Realistic Prompts for Measuring Issue Bias in LLM Writing Assistance
Authors:
Paul Röttger,
Musashi Hinck,
Valentin Hofmann,
Kobi Hackenburg,
Valentina Pyatkin,
Faeze Brahman,
Dirk Hovy
Abstract:
Large language models (LLMs) are helping millions of users write texts about diverse issues, and in doing so expose users to different ideas and perspectives. This creates concerns about issue bias, where an LLM tends to present just one perspective on a given issue, which in turn may influence how users think about this issue. So far, it has not been possible to measure which issue biases LLMs ac…
▽ More
Large language models (LLMs) are helping millions of users write texts about diverse issues, and in doing so expose users to different ideas and perspectives. This creates concerns about issue bias, where an LLM tends to present just one perspective on a given issue, which in turn may influence how users think about this issue. So far, it has not been possible to measure which issue biases LLMs actually manifest in real user interactions, making it difficult to address the risks from biased LLMs. Therefore, we create IssueBench: a set of 2.49m realistic prompts for measuring issue bias in LLM writing assistance, which we construct based on 3.9k templates (e.g. "write a blog about") and 212 political issues (e.g. "AI regulation") from real user interactions. Using IssueBench, we show that issue biases are common and persistent in state-of-the-art LLMs. We also show that biases are remarkably similar across models, and that all models align more with US Democrat than Republican voter opinion on a subset of issues. IssueBench can easily be adapted to include other issues, templates, or tasks. By enabling robust and realistic measurement, we hope that IssueBench can bring a new quality of evidence to ongoing discussions about LLM biases and how to address them.
△ Less
Submitted 12 February, 2025;
originally announced February 2025.
-
Steering Large Language Models to Evaluate and Amplify Creativity
Authors:
Matthew Lyle Olson,
Neale Ratzlaff,
Musashi Hinck,
Shao-yen Tseng,
Vasudev Lal
Abstract:
Although capable of generating creative text, Large Language Models (LLMs) are poor judges of what constitutes "creativity". In this work, we show that we can leverage this knowledge of how to write creatively in order to better judge what is creative. We take a mechanistic approach that extracts differences in the internal states of an LLM when prompted to respond "boringly" or "creatively" to pr…
▽ More
Although capable of generating creative text, Large Language Models (LLMs) are poor judges of what constitutes "creativity". In this work, we show that we can leverage this knowledge of how to write creatively in order to better judge what is creative. We take a mechanistic approach that extracts differences in the internal states of an LLM when prompted to respond "boringly" or "creatively" to provide a robust measure of creativity that corresponds strongly with human judgment. We also show these internal state differences can be applied to enhance the creativity of generated text at inference time.
△ Less
Submitted 8 December, 2024;
originally announced December 2024.
-
Debias your Large Multi-Modal Model at Test-Time via Non-Contrastive Visual Attribute Steering
Authors:
Neale Ratzlaff,
Matthew Lyle Olson,
Musashi Hinck,
Estelle Aflalo,
Shao-Yen Tseng,
Vasudev Lal,
Phillip Howard
Abstract:
Large Multi-Modal Models (LMMs) have demonstrated impressive capabilities as general-purpose chatbots able to engage in conversations about visual inputs. However, their responses are influenced by societal biases present in their training datasets, leading to undesirable differences in how the model responds when presented with images depicting people of different demographics. In this work, we p…
▽ More
Large Multi-Modal Models (LMMs) have demonstrated impressive capabilities as general-purpose chatbots able to engage in conversations about visual inputs. However, their responses are influenced by societal biases present in their training datasets, leading to undesirable differences in how the model responds when presented with images depicting people of different demographics. In this work, we propose a training-free debiasing framework for LMMs that intervenes on the model's representations during text generation by constructing a steering vector that reduces reference on protected attributes. Our framework introduces two complementary methods: (1) a dataset-based approach that constructs a steering vector by contrasting model activations on biased and neutral inputs, and (2) a novel optimization-based approach designed for low-resource settings, which constructs the steering vector using a single step of gradient-based perturbation without requiring additional data. Our experiments show that these interventions effectively reduce the propensity of LMMs to generate text related to protected attributes while maintaining sentiment and fluency. Furthermore, we demonstrate that debiased LMMs achieve comparable accuracy to their unmodified counterparts on downstream tasks, indicating that bias mitigation can be achieved without sacrificing model performance.
△ Less
Submitted 13 March, 2025; v1 submitted 15 November, 2024;
originally announced November 2024.
-
Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations
Authors:
Neale Ratzlaff,
Matthew Lyle Olson,
Musashi Hinck,
Shao-Yen Tseng,
Vasudev Lal,
Phillip Howard
Abstract:
Large Vision Language Models (LVLMs) such as LLaVA have demonstrated impressive capabilities as general-purpose chatbots that can engage in conversations about a provided input image. However, their responses are influenced by societal biases present in their training datasets, leading to undesirable differences in how the model responds when presented with images depicting people of different dem…
▽ More
Large Vision Language Models (LVLMs) such as LLaVA have demonstrated impressive capabilities as general-purpose chatbots that can engage in conversations about a provided input image. However, their responses are influenced by societal biases present in their training datasets, leading to undesirable differences in how the model responds when presented with images depicting people of different demographics. In this work, we propose a novel debiasing framework for LVLMs by directly ablating biased attributes during text generation to avoid generating text related to protected attributes, or even representing them internally. Our method requires no training and a relatively small amount of representative biased outputs (~1000 samples). Our experiments show that not only can we can minimize the propensity of LVLMs to generate text related to protected attributes, but we can even use synthetic data to inform the ablation while retaining captioning performance on real data such as COCO. Furthermore, we find the resulting generations from a debiased LVLM exhibit similar accuracy as a baseline biased model, showing that debiasing effects can be achieved without sacrificing model performance.
△ Less
Submitted 17 October, 2024;
originally announced October 2024.
-
AutoPersuade: A Framework for Evaluating and Explaining Persuasive Arguments
Authors:
Till Raphael Saenger,
Musashi Hinck,
Justin Grimmer,
Brandon M. Stewart
Abstract:
We introduce AutoPersuade, a three-part framework for constructing persuasive messages. First, we curate a large dataset of arguments with human evaluations. Next, we develop a novel topic model to identify argument features that influence persuasiveness. Finally, we use this model to predict the effectiveness of new arguments and assess the causal impact of different components to provide explana…
▽ More
We introduce AutoPersuade, a three-part framework for constructing persuasive messages. First, we curate a large dataset of arguments with human evaluations. Next, we develop a novel topic model to identify argument features that influence persuasiveness. Finally, we use this model to predict the effectiveness of new arguments and assess the causal impact of different components to provide explanations. We validate AutoPersuade through an experimental study on arguments for veganism, demonstrating its effectiveness with human studies and out-of-sample predictions.
△ Less
Submitted 11 March, 2025; v1 submitted 11 October, 2024;
originally announced October 2024.
-
ClimDetect: A Benchmark Dataset for Climate Change Detection and Attribution
Authors:
Sungduk Yu,
Brian L. White,
Anahita Bhiwandiwalla,
Musashi Hinck,
Matthew Lyle Olson,
Yaniv Gurwicz,
Raanan Y. Rohekar,
Tung Nguyen,
Vasudev Lal
Abstract:
Detecting and attributing temperature increases driven by climate change is crucial for understanding global warming and informing adaptation strategies. However, distinguishing human-induced climate signals from natural variability remains challenging for traditional detection and attribution (D&A) methods, which rely on identifying specific "fingerprints" -- spatial patterns expected to emerge f…
▽ More
Detecting and attributing temperature increases driven by climate change is crucial for understanding global warming and informing adaptation strategies. However, distinguishing human-induced climate signals from natural variability remains challenging for traditional detection and attribution (D&A) methods, which rely on identifying specific "fingerprints" -- spatial patterns expected to emerge from external forcings such as greenhouse gas emissions. Deep learning offers promise in discerning these complex patterns within expansive spatial datasets, yet the lack of standardized protocols has hindered consistent comparisons across studies.
To address this gap, we introduce ClimDetect, a standardized dataset comprising 1.17M daily climate snapshots paired with target climate change indicator variables. The dataset is curated from both CMIP6 climate model simulations and real-world observation-assimilated reanalysis datasets (ERA5, JRA-3Q, and MERRA-2), and is designed to enhance model accuracy in detecting climate change signals. ClimDetect integrates various input and target variables used in previous research, ensuring comparability and consistency across studies. We also explore the application of vision transformers (ViT) to climate data -- a novel approach that, to our knowledge, has not been attempted before for climate change detection tasks. Our open-access data serve as a benchmark for advancing climate science by enabling end-to-end model development and evaluation. ClimDetect is publicly accessible via Hugging Face dataset repository at: https://huggingface.co/datasets/ClimDetect/ClimDetect.
△ Less
Submitted 10 March, 2025; v1 submitted 28 August, 2024;
originally announced August 2024.
-
Why do LLaVA Vision-Language Models Reply to Images in English?
Authors:
Musashi Hinck,
Carolin Holtermann,
Matthew Lyle Olson,
Florian Schneider,
Sungduk Yu,
Anahita Bhiwandiwalla,
Anne Lauscher,
Shaoyen Tseng,
Vasudev Lal
Abstract:
We uncover a surprising multilingual bias occurring in a popular class of multimodal vision-language models (VLMs). Including an image in the query to a LLaVA-style VLM significantly increases the likelihood of the model returning an English response, regardless of the language of the query. This paper investigates the causes of this loss with a two-pronged approach that combines extensive ablatio…
▽ More
We uncover a surprising multilingual bias occurring in a popular class of multimodal vision-language models (VLMs). Including an image in the query to a LLaVA-style VLM significantly increases the likelihood of the model returning an English response, regardless of the language of the query. This paper investigates the causes of this loss with a two-pronged approach that combines extensive ablation of the design space with a mechanistic analysis of the models' internal representations of image and text inputs. Both approaches indicate that the issue stems in the language modelling component of the LLaVA model. Statistically, we find that switching the language backbone for a bilingual language model has the strongest effect on reducing this error. Mechanistically, we provide compelling evidence that visual inputs are not mapped to a similar space as text ones, and that intervening on intermediary attention layers can reduce this bias. Our findings provide important insights to researchers and engineers seeking to understand the crossover between multimodal and multilingual spaces, and contribute to the goal of developing capable and inclusive VLMs for non-English contexts.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Authors:
Musashi Hinck,
Matthew L. Olson,
David Cobbley,
Shao-Yen Tseng,
Vasudev Lal
Abstract:
We train a suite of multimodal foundation models (MMFM) using the popular LLaVA framework with the recently released Gemma family of large language models (LLMs). Of particular interest is the 2B parameter Gemma model, which provides opportunities to construct capable small-scale MMFMs. In line with findings from other papers in this space, we test the effect of ablating three design features: pre…
▽ More
We train a suite of multimodal foundation models (MMFM) using the popular LLaVA framework with the recently released Gemma family of large language models (LLMs). Of particular interest is the 2B parameter Gemma model, which provides opportunities to construct capable small-scale MMFMs. In line with findings from other papers in this space, we test the effect of ablating three design features: pretraining the connector, utilizing a more powerful image backbone, and increasing the size of the language backbone. The resulting models, which we call LLaVA-Gemma, exhibit moderate performance on an array of evaluations, but fail to improve past the current comparably sized SOTA models. Closer analysis of performance shows mixed effects; skipping pretraining tends to reduce performance, larger vision models sometimes improve performance, and increasing language model size has inconsistent effects. We publicly release training recipes, code and weights for our models for the LLaVA-Gemma models.
△ Less
Submitted 10 June, 2024; v1 submitted 29 March, 2024;
originally announced April 2024.
-
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models
Authors:
Paul Röttger,
Valentin Hofmann,
Valentina Pyatkin,
Musashi Hinck,
Hannah Rose Kirk,
Hinrich Schütze,
Dirk Hovy
Abstract:
Much recent work seeks to evaluate values and opinions in large language models (LLMs) using multiple-choice surveys and questionnaires. Most of this work is motivated by concerns around real-world LLM applications. For example, politically-biased LLMs may subtly influence society when they are used by millions of people. Such real-world concerns, however, stand in stark contrast to the artificial…
▽ More
Much recent work seeks to evaluate values and opinions in large language models (LLMs) using multiple-choice surveys and questionnaires. Most of this work is motivated by concerns around real-world LLM applications. For example, politically-biased LLMs may subtly influence society when they are used by millions of people. Such real-world concerns, however, stand in stark contrast to the artificiality of current evaluations: real users do not typically ask LLMs survey questions. Motivated by this discrepancy, we challenge the prevailing constrained evaluation paradigm for values and opinions in LLMs and explore more realistic unconstrained evaluations. As a case study, we focus on the popular Political Compass Test (PCT). In a systematic review, we find that most prior work using the PCT forces models to comply with the PCT's multiple-choice format. We show that models give substantively different answers when not forced; that answers change depending on how models are forced; and that answers lack paraphrase robustness. Then, we demonstrate that models give different answers yet again in a more realistic open-ended answer setting. We distill these findings into recommendations and open challenges in evaluating values and opinions in LLMs.
△ Less
Submitted 5 June, 2024; v1 submitted 26 February, 2024;
originally announced February 2024.
-
Using Imperfect Surrogates for Downstream Inference: Design-based Supervised Learning for Social Science Applications of Large Language Models
Authors:
Naoki Egami,
Musashi Hinck,
Brandon M. Stewart,
Hanying Wei
Abstract:
In computational social science (CSS), researchers analyze documents to explain social and political phenomena. In most scenarios, CSS researchers first obtain labels for documents and then explain labels using interpretable regression analyses in the second step. One increasingly common way to annotate documents cheaply at scale is through large language models (LLMs). However, like other scalabl…
▽ More
In computational social science (CSS), researchers analyze documents to explain social and political phenomena. In most scenarios, CSS researchers first obtain labels for documents and then explain labels using interpretable regression analyses in the second step. One increasingly common way to annotate documents cheaply at scale is through large language models (LLMs). However, like other scalable ways of producing annotations, such surrogate labels are often imperfect and biased. We present a new algorithm for using imperfect annotation surrogates for downstream statistical analyses while guaranteeing statistical properties -- like asymptotic unbiasedness and proper uncertainty quantification -- which are fundamental to CSS research. We show that direct use of surrogate labels in downstream statistical analyses leads to substantial bias and invalid confidence intervals, even with high surrogate accuracy of 80-90%. To address this, we build on debiased machine learning to propose the design-based supervised learning (DSL) estimator. DSL employs a doubly-robust procedure to combine surrogate labels with a smaller number of high-quality, gold-standard labels. Our approach guarantees valid inference for downstream statistical analyses, even when surrogates are arbitrarily biased and without requiring stringent assumptions, by controlling the probability of sampling documents for gold-standard labeling. Both our theoretical analysis and experimental results show that DSL provides valid statistical inference while achieving root mean squared errors comparable to existing alternatives that focus only on prediction without inferential guarantees.
△ Less
Submitted 14 January, 2024; v1 submitted 7 June, 2023;
originally announced June 2023.