Skip to main content

Showing 1–4 of 4 results for author: Juang, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.02922  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

    Authors: Julian Minder, Clément Dumas, Caden Juang, Bilal Chugtai, Neel Nanda

    Abstract: Model diffing is the study of how fine-tuning changes a model's representations and internal algorithms. Many behaviors of interest are introduced during fine-tuning, and model diffing offers a promising lens to interpret such behaviors. Crosscoders are a recent model diffing method that learns a shared dictionary of interpretable concepts represented as latent directions in both the base and fine… ▽ More

    Submitted 30 May, 2025; v1 submitted 3 April, 2025; originally announced April 2025.

    Comments: 42 pages, 31 figures

  2. arXiv:2410.13928  [pdf, other

    cs.LG cs.CL

    Automatically Interpreting Millions of Features in Large Language Models

    Authors: Gonçalo Paulo, Alex Mallen, Caden Juang, Nora Belrose

    Abstract: While the activations of neurons in deep neural networks usually do not have a simple human-understandable interpretation, sparse autoencoders (SAEs) can be used to transform these activations into a higher-dimensional latent space which may be more easily interpretable. However, these SAEs can have millions of distinct latent features, making it infeasible for humans to manually interpret each on… ▽ More

    Submitted 4 December, 2024; v1 submitted 17 October, 2024; originally announced October 2024.

  3. arXiv:2407.14561  [pdf, other

    cs.LG cs.AI

    NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals

    Authors: Jaden Fiotto-Kaufman, Alexander R. Loftus, Eric Todd, Jannik Brinkmann, Koyena Pal, Dmitrii Troitskii, Michael Ripa, Adam Belfki, Can Rager, Caden Juang, Aaron Mueller, Samuel Marks, Arnab Sen Sharma, Francesca Lucchetti, Nikhil Prakash, Carla Brodley, Arjun Guha, Jonathan Bell, Byron C. Wallace, David Bau

    Abstract: We introduce NNsight and NDIF, technologies that work in tandem to enable scientific study of the representations and computations learned by very large neural networks. NNsight is an open-source system that extends PyTorch to introduce deferred remote execution. The National Deep Inference Fabric (NDIF) is a scalable inference service that executes NNsight requests, allowing users to share GPU re… ▽ More

    Submitted 1 April, 2025; v1 submitted 18 July, 2024; originally announced July 2024.

    Comments: Code at https://nnsight.net

  4. arXiv:2405.05466  [pdf, other

    cs.CL cs.AI

    Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

    Authors: Joshua Clymer, Caden Juang, Severin Field

    Abstract: Like a criminal under investigation, Large Language Models (LLMs) might pretend to be aligned while evaluated and misbehave when they have a good opportunity. Can current interpretability methods catch these 'alignment fakers?' To answer this question, we introduce a benchmark that consists of 324 pairs of LLMs fine-tuned to select actions in role-play scenarios. One model in each pair is consiste… ▽ More

    Submitted 11 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.