Skip to main content

Showing 1–5 of 5 results for author: Ayonrinde, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.01372  [pdf, ps, other

    cs.LG cs.AI cs.CL cs.HC

    Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii

    Authors: Kola Ayonrinde, Louis Jaburi

    Abstract: Mechanistic Interpretability (MI) aims to understand neural networks through causal explanations. Though MI has many explanation-generating methods, progress has been limited by the lack of a universal approach to evaluating explanations. Here we analyse the fundamental question "What makes a good explanation?" We introduce a pluralist Explanatory Virtues Framework drawing on four perspectives fro… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

    Comments: 13 pages (plus appendices), 5 figures

  2. arXiv:2505.00808  [pdf, ps, other

    cs.LG cs.AI cs.CL

    A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i

    Authors: Kola Ayonrinde, Louis Jaburi

    Abstract: Mechanistic Interpretability aims to understand neural networks through causal explanations. We argue for the Explanatory View Hypothesis: that Mechanistic Interpretability research is a principled approach to understanding models because neural networks contain implicit explanations which can be extracted and understood. We hence show that Explanatory Faithfulness, an assessment of how well an ex… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

    Comments: 15 pages (plus appendices), 2 figures

  3. arXiv:2503.09532  [pdf, ps, other

    cs.LG cs.CL

    SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

    Authors: Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda

    Abstract: Sparse autoencoders (SAEs) are a popular technique for interpreting language model activations, and there is extensive recent work on improving SAE effectiveness. However, most prior work evaluates progress using unsupervised proxy metrics with unclear practical relevance. We introduce SAEBench, a comprehensive evaluation suite that measures SAE performance across eight diverse metrics, spanning i… ▽ More

    Submitted 4 June, 2025; v1 submitted 12 March, 2025; originally announced March 2025.

    Comments: Accepted to ICML 2025 main conference

  4. arXiv:2411.02124  [pdf, other

    cs.LG cs.AI

    Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders

    Authors: Kola Ayonrinde

    Abstract: Sparse autoencoders (SAEs) are a promising approach to extracting features from neural networks, enabling model interpretability as well as causal interventions on model internals. SAEs generate sparse feature representations using a sparsifying activation function that implicitly defines a set of token-feature matches. We frame the token-feature matching as a resource allocation problem constrain… ▽ More

    Submitted 7 November, 2024; v1 submitted 4 November, 2024; originally announced November 2024.

    Comments: 10 pages (18 w/ appendices), 7 figures. Preprint

  5. arXiv:2410.11179  [pdf, other

    cs.LG cs.AI cs.IT

    Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

    Authors: Kola Ayonrinde, Michael T. Pearce, Lee Sharkey

    Abstract: Sparse Autoencoders (SAEs) have emerged as a useful tool for interpreting the internal representations of neural networks. However, naively optimising SAEs for reconstruction loss and sparsity results in a preference for SAEs that are extremely wide and sparse. We present an information-theoretic framework for interpreting SAEs as lossy compression algorithms for communicating explanations of neur… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

    Comments: 8 pages, 5 figures