Skip to main content

Showing 1–4 of 4 results for author: Venhoff, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.18167  [pdf, ps, other

    cs.LG cs.AI

    Understanding Reasoning in Thinking Language Models via Steering Vectors

    Authors: Constantin Venhoff, Iván Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda

    Abstract: Recent advances in large language models (LLMs) have led to the development of thinking language models that generate extensive internal reasoning chains before producing responses. While these models achieve improved performance, controlling their reasoning processes remains challenging. This work presents a steering approach for thinking LLMs by analyzing and manipulating specific reasoning beha… ▽ More

    Submitted 23 June, 2025; v1 submitted 22 June, 2025; originally announced June 2025.

  2. arXiv:2506.11976  [pdf, ps, other

    cs.CV cs.LG

    How Visual Representations Map to Language Feature Space in Multimodal LLMs

    Authors: Constantin Venhoff, Ashkan Khakzar, Sonia Joseph, Philip Torr, Neel Nanda

    Abstract: Effective multimodal reasoning depends on the alignment of visual and linguistic representations, yet the mechanisms by which vision-language models (VLMs) achieve this alignment remain poorly understood. Following the LiMBeR framework, we deliberately maintain a frozen large language model (LLM) and a frozen vision transformer (ViT), connected solely by training a linear adapter during visual ins… ▽ More

    Submitted 21 June, 2025; v1 submitted 13 June, 2025; originally announced June 2025.

  3. arXiv:2503.07639  [pdf, other

    cs.LG cs.CL

    Mixture of Experts Made Intrinsically Interpretable

    Authors: Xingyi Yang, Constantin Venhoff, Ashkan Khakzar, Christian Schroeder de Witt, Puneet K. Dokania, Adel Bibi, Philip Torr

    Abstract: Neurons in large language models often exhibit \emph{polysemanticity}, simultaneously encoding multiple unrelated concepts and obscuring interpretability. Instead of relying on post-hoc methods, we present \textbf{MoE-X}, a Mixture-of-Experts (MoE) language model designed to be \emph{intrinsically} interpretable. Our approach is motivated by the observation that, in language models, wider networks… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

  4. arXiv:2410.07456  [pdf, other

    cs.LG

    SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders

    Authors: Constantin Venhoff, Anisoara Calinescu, Philip Torr, Christian Schroeder de Witt

    Abstract: A key challenge in interpretability is to decompose model activations into meaningful features. Sparse autoencoders (SAEs) have emerged as a promising tool for this task. However, a central problem in evaluating the quality of SAEs is the absence of ground truth features to serve as an evaluation gold standard. Current evaluation methods for SAEs are therefore confronted with a significant trade-o… ▽ More

    Submitted 9 October, 2024; originally announced October 2024.