-
Measuring and Guiding Monosemanticity
Authors:
Ruben Härle,
Felix Friedrich,
Manuel Brack,
Stephan Wäldchen,
Björn Deiseroth,
Patrick Schramowski,
Kristian Kersting
Abstract:
There is growing interest in leveraging mechanistic interpretability and controllability to better understand and influence the internal dynamics of large language models (LLMs). However, current methods face fundamental challenges in reliably localizing and manipulating feature representations. Sparse Autoencoders (SAEs) have recently emerged as a promising direction for feature extraction at sca…
▽ More
There is growing interest in leveraging mechanistic interpretability and controllability to better understand and influence the internal dynamics of large language models (LLMs). However, current methods face fundamental challenges in reliably localizing and manipulating feature representations. Sparse Autoencoders (SAEs) have recently emerged as a promising direction for feature extraction at scale, yet they, too, are limited by incomplete feature isolation and unreliable monosemanticity. To systematically quantify these limitations, we introduce Feature Monosemanticity Score (FMS), a novel metric to quantify feature monosemanticity in latent representation. Building on these insights, we propose Guided Sparse Autoencoders (G-SAE), a method that conditions latent representations on labeled concepts during training. We demonstrate that reliable localization and disentanglement of target concepts within the latent space improve interpretability, detection of behavior, and control. Specifically, our evaluations on toxicity detection, writing style identification, and privacy attribute recognition show that G-SAE not only enhances monosemanticity but also enables more effective and fine-grained steering with less quality degradation. Our findings provide actionable guidelines for measuring and advancing mechanistic interpretability and control of LLMs.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation
Authors:
Thomas F Burns,
Letitia Parcalabescu,
Stephan Wäldchen,
Michael Barlow,
Gregor Ziegltrum,
Volker Stampa,
Bastian Harren,
Björn Deiseroth
Abstract:
Scaling data quantity is essential for large language models (LLMs), yet recent findings show that data quality can significantly boost performance and training efficiency. We introduce a German-language dataset curation pipeline that combines heuristic and model-based filtering techniques with synthetic data generation. We use our pipeline to create Aleph-Alpha-GermanWeb, a large-scale German pre…
▽ More
Scaling data quantity is essential for large language models (LLMs), yet recent findings show that data quality can significantly boost performance and training efficiency. We introduce a German-language dataset curation pipeline that combines heuristic and model-based filtering techniques with synthetic data generation. We use our pipeline to create Aleph-Alpha-GermanWeb, a large-scale German pre-training dataset which draws from: (1) Common Crawl web data, (2) FineWeb2, and (3) synthetically-generated data conditioned on actual, organic web data. We evaluate our dataset by pre-training both a 1B Llama-style model and an 8B tokenizer-free hierarchical autoregressive transformer (HAT). A comparison on German-language benchmarks, including MMMLU, shows significant performance gains of Aleph-Alpha-GermanWeb over FineWeb2 alone. This advantage holds at the 8B scale even when FineWeb2 is enriched by human-curated high-quality data sources such as Wikipedia. Our findings support the growing body of evidence that model-based data curation and synthetic data generation can significantly enhance LLM pre-training datasets.
△ Less
Submitted 23 May, 2025; v1 submitted 24 April, 2025;
originally announced May 2025.
-
Hardness of Deceptive Certificate Selection
Authors:
Stephan Wäldchen
Abstract:
Recent progress towards theoretical interpretability guarantees for AI has been made with classifiers that are based on interactive proof systems. A prover selects a certificate from the datapoint and sends it to a verifier who decides the class. In the context of machine learning, such a certificate can be a feature that is informative of the class. For a setup with high soundness and completenes…
▽ More
Recent progress towards theoretical interpretability guarantees for AI has been made with classifiers that are based on interactive proof systems. A prover selects a certificate from the datapoint and sends it to a verifier who decides the class. In the context of machine learning, such a certificate can be a feature that is informative of the class. For a setup with high soundness and completeness, the exchanged certificates must have a high mutual information with the true class of the datapoint. However, this guarantee relies on a bound on the Asymmetric Feature Correlation of the dataset, a property that so far is difficult to estimate for high-dimensional data. It was conjectured in Wäldchen et al. that it is computationally hard to exploit the AFC, which is what we prove here.
We consider a malicious prover-verifier duo that aims to exploit the AFC to achieve high completeness and soundness while using uninformative certificates. We show that this task is $\mathsf{NP}$-hard and cannot be approximated better than $\mathcal{O}(m^{1/8 - ε})$, where $m$ is the number of possible certificates, for $ε>0$ under the Dense-vs-Random conjecture. This is some evidence that AFC should not prevent the use of interactive classification for real-world tasks, as it is computationally hard to be exploited.
△ Less
Submitted 7 June, 2023;
originally announced June 2023.
-
Interpretability Guarantees with Merlin-Arthur Classifiers
Authors:
Stephan Wäldchen,
Kartikey Sharma,
Berkant Turan,
Max Zimmer,
Sebastian Pokutta
Abstract:
We propose an interactive multi-agent classifier that provides provable interpretability guarantees even for complex agents such as neural networks. These guarantees consist of lower bounds on the mutual information between selected features and the classification decision. Our results are inspired by the Merlin-Arthur protocol from Interactive Proof Systems and express these bounds in terms of me…
▽ More
We propose an interactive multi-agent classifier that provides provable interpretability guarantees even for complex agents such as neural networks. These guarantees consist of lower bounds on the mutual information between selected features and the classification decision. Our results are inspired by the Merlin-Arthur protocol from Interactive Proof Systems and express these bounds in terms of measurable metrics such as soundness and completeness. Compared to existing interactive setups, we rely neither on optimal agents nor on the assumption that features are distributed independently. Instead, we use the relative strength of the agents as well as the new concept of Asymmetric Feature Correlation which captures the precise kind of correlations that make interpretability guarantees difficult. We evaluate our results on two small-scale datasets where high mutual information can be verified explicitly.
△ Less
Submitted 22 March, 2024; v1 submitted 1 June, 2022;
originally announced June 2022.
-
Training Characteristic Functions with Reinforcement Learning: XAI-methods play Connect Four
Authors:
Stephan Wäldchen,
Felix Huber,
Sebastian Pokutta
Abstract:
One of the goals of Explainable AI (XAI) is to determine which input components were relevant for a classifier decision. This is commonly know as saliency attribution. Characteristic functions (from cooperative game theory) are able to evaluate partial inputs and form the basis for theoretically "fair" attribution methods like Shapley values. Given only a standard classifier function, it is unclea…
▽ More
One of the goals of Explainable AI (XAI) is to determine which input components were relevant for a classifier decision. This is commonly know as saliency attribution. Characteristic functions (from cooperative game theory) are able to evaluate partial inputs and form the basis for theoretically "fair" attribution methods like Shapley values. Given only a standard classifier function, it is unclear how partial input should be realised. Instead, most XAI-methods for black-box classifiers like neural networks consider counterfactual inputs that generally lie off-manifold. This makes them hard to evaluate and easy to manipulate.
We propose a setup to directly train characteristic functions in the form of neural networks to play simple two-player games. We apply this to the game of Connect Four by randomly hiding colour information from our agents during training. This has three advantages for comparing XAI-methods: It alleviates the ambiguity about how to realise partial input, makes off-manifold evaluation unnecessary and allows us to compare the methods by letting them play against each other.
△ Less
Submitted 25 February, 2022; v1 submitted 23 February, 2022;
originally announced February 2022.
-
A Complete Characterisation of ReLU-Invariant Distributions
Authors:
Jan Macdonald,
Stephan Wäldchen
Abstract:
We give a complete characterisation of families of probability distributions that are invariant under the action of ReLU neural network layers. The need for such families arises during the training of Bayesian networks or the analysis of trained neural networks, e.g., in the context of uncertainty quantification (UQ) or explainable artificial intelligence (XAI). We prove that no invariant parametr…
▽ More
We give a complete characterisation of families of probability distributions that are invariant under the action of ReLU neural network layers. The need for such families arises during the training of Bayesian networks or the analysis of trained neural networks, e.g., in the context of uncertainty quantification (UQ) or explainable artificial intelligence (XAI). We prove that no invariant parametrised family of distributions can exist unless at least one of the following three restrictions holds: First, the network layers have a width of one, which is unreasonable for practical neural networks. Second, the probability measures in the family have finite support, which basically amounts to sampling distributions. Third, the parametrisation of the family is not locally Lipschitz continuous, which excludes all computationally feasible families. Finally, we show that these restrictions are individually necessary. For each of the three cases we can construct an invariant family exploiting exactly one of the restrictions but not the other two.
△ Less
Submitted 13 December, 2021;
originally announced December 2021.
-
A Rate-Distortion Framework for Explaining Neural Network Decisions
Authors:
Jan Macdonald,
Stephan Wäldchen,
Sascha Hauch,
Gitta Kutyniok
Abstract:
We formalise the widespread idea of interpreting neural network decisions as an explicit optimisation problem in a rate-distortion framework. A set of input features is deemed relevant for a classification decision if the expected classifier score remains nearly constant when randomising the remaining features. We discuss the computational complexity of finding small sets of relevant features and…
▽ More
We formalise the widespread idea of interpreting neural network decisions as an explicit optimisation problem in a rate-distortion framework. A set of input features is deemed relevant for a classification decision if the expected classifier score remains nearly constant when randomising the remaining features. We discuss the computational complexity of finding small sets of relevant features and show that the problem is complete for $\mathsf{NP}^\mathsf{PP}$, an important class of computational problems frequently arising in AI tasks. Furthermore, we show that it even remains $\mathsf{NP}$-hard to only approximate the optimal solution to within any non-trivial approximation factor. Finally, we consider a continuous problem relaxation and develop a heuristic solution strategy based on assumed density filtering for deep ReLU neural networks. We present numerical experiments for two image classification data sets where we outperform established methods in particular for sparse explanations of neural network decisions.
△ Less
Submitted 27 May, 2019;
originally announced May 2019.
-
The Computational Complexity of Understanding Network Decisions
Authors:
Stephan Wäldchen,
Jan Macdonald,
Sascha Hauch,
Gitta Kutyniok
Abstract:
For a Boolean function $Φ\colon\{0,1\}^d\to\{0,1\}$ and an assignment to its variables $\mathbf{x}=(x_1, x_2, \dots, x_d)$ we consider the problem of finding the subsets of the variables that are sufficient to determine the function value with a given probability $δ$. This is motivated by the task of interpreting predictions of binary classifiers described as Boolean circuits (which can be seen as…
▽ More
For a Boolean function $Φ\colon\{0,1\}^d\to\{0,1\}$ and an assignment to its variables $\mathbf{x}=(x_1, x_2, \dots, x_d)$ we consider the problem of finding the subsets of the variables that are sufficient to determine the function value with a given probability $δ$. This is motivated by the task of interpreting predictions of binary classifiers described as Boolean circuits (which can be seen as special cases of neural networks). We show that the problem of deciding whether such subsets of relevant variables of limited size $k\leq d$ exist is complete for the complexity class $\mathsf{NP}^{\mathsf{PP}}$ and thus generally unfeasible to solve. We introduce a variant where it suffices to check whether a subset determines the function value with probability at least $δ$ or at most $δ-γ$ for $0<γ<δ$. This reduces the complexity to the class $\mathsf{NP}^{\mathsf{BPP}}$. Finally, we show that finding the minimal set of relevant variables can not be reasonably approximated, i.e. with an approximation factor $d^{1-α}$ for $α> 0$, by a polynomial time algorithm unless $\mathsf{P} = \mathsf{NP}$ (this holds even with the probability gap).
△ Less
Submitted 18 June, 2019; v1 submitted 22 May, 2019;
originally announced May 2019.
-
Unmasking Clever Hans Predictors and Assessing What Machines Really Learn
Authors:
Sebastian Lapuschkin,
Stephan Wäldchen,
Alexander Binder,
Grégoire Montavon,
Wojciech Samek,
Klaus-Robert Müller
Abstract:
Current learning machines have successfully solved hard application problems, reaching high accuracy and displaying seemingly "intelligent" behavior. Here we apply recent techniques for explaining decisions of state-of-the-art learning machines and analyze various tasks from computer vision and arcade games. This showcases a spectrum of problem-solving behaviors ranging from naive and short-sighte…
▽ More
Current learning machines have successfully solved hard application problems, reaching high accuracy and displaying seemingly "intelligent" behavior. Here we apply recent techniques for explaining decisions of state-of-the-art learning machines and analyze various tasks from computer vision and arcade games. This showcases a spectrum of problem-solving behaviors ranging from naive and short-sighted, to well-informed and strategic. We observe that standard performance evaluation metrics can be oblivious to distinguishing these diverse problem solving behaviors. Furthermore, we propose our semi-automated Spectral Relevance Analysis that provides a practically effective way of characterizing and validating the behavior of nonlinear learning machines. This helps to assess whether a learned model indeed delivers reliably for the problem that it was conceived for. Furthermore, our work intends to add a voice of caution to the ongoing excitement about machine intelligence and pledges to evaluate and judge some of these recent successes in a more nuanced manner.
△ Less
Submitted 26 February, 2019;
originally announced February 2019.