Search | arXiv e-print repository

Distillation Robustifies Unlearning

Authors: Bruce W. Lee, Addie Foote, Alex Infanger, Leni Shor, Harish Kamath, Jacob Goldman-Wetzler, Bryce Woodworth, Alex Cloud, Alexander Matt Turner

Abstract: Current LLM unlearning methods are not robust: they can be reverted easily with a few steps of finetuning. This is true even for the idealized unlearning method of training to imitate an oracle model that was never exposed to unwanted information, suggesting that output-based finetuning is insufficient to achieve robust unlearning. In a similar vein, we find that training a randomly initialized st… ▽ More Current LLM unlearning methods are not robust: they can be reverted easily with a few steps of finetuning. This is true even for the idealized unlearning method of training to imitate an oracle model that was never exposed to unwanted information, suggesting that output-based finetuning is insufficient to achieve robust unlearning. In a similar vein, we find that training a randomly initialized student to imitate an unlearned model transfers desired behaviors while leaving undesired capabilities behind. In other words, distillation robustifies unlearning. Building on this insight, we propose Unlearn-Noise-Distill-on-Outputs (UNDO), a scalable method that distills an unlearned model into a partially noised copy of itself. UNDO introduces a tunable tradeoff between compute cost and robustness, establishing a new Pareto frontier on synthetic language and arithmetic tasks. At its strongest setting, UNDO matches the robustness of a model retrained from scratch with perfect data filtering while using only 60-80% of the compute and requiring only 0.01% of the pretraining data to be labeled. We also show that UNDO robustifies unlearning on the more realistic Weapons of Mass Destruction Proxy (WMDP) benchmark. Since distillation is widely used in practice, incorporating an unlearning step beforehand offers a convenient path to robust capability removal. △ Less

Submitted 9 June, 2025; v1 submitted 6 June, 2025; originally announced June 2025.

arXiv:2411.08166 [pdf, other]

Tackling Polysemanticity with Neuron Embeddings

Authors: Alex Foote

Abstract: We present neuron embeddings, a representation that can be used to tackle polysemanticity by identifying the distinct semantic behaviours in a neuron's characteristic dataset examples, making downstream manual or automatic interpretation much easier. We apply our method to GPT2-small, and provide a UI for exploring the results. Neuron embeddings are computed using a model's internal representation… ▽ More We present neuron embeddings, a representation that can be used to tackle polysemanticity by identifying the distinct semantic behaviours in a neuron's characteristic dataset examples, making downstream manual or automatic interpretation much easier. We apply our method to GPT2-small, and provide a UI for exploring the results. Neuron embeddings are computed using a model's internal representations and weights, making them domain and architecture agnostic and removing the risk of introducing external structure which may not reflect a model's actual computation. We describe how neuron embeddings can be used to measure neuron polysemanticity, which could be applied to better evaluate the efficacy of Sparse Auto-Encoders (SAEs). △ Less

Submitted 12 November, 2024; originally announced November 2024.

arXiv:2405.10020 [pdf, other]

Natural Language Can Help Bridge the Sim2Real Gap

Authors: Albert Yu, Adeline Foote, Raymond Mooney, Roberto Martín-Martín

Abstract: The main challenge in learning image-conditioned robotic policies is acquiring a visual representation conducive to low-level control. Due to the high dimensionality of the image space, learning a good visual representation requires a considerable amount of visual data. However, when learning in the real world, data is expensive. Sim2Real is a promising paradigm for overcoming data scarcity in the… ▽ More The main challenge in learning image-conditioned robotic policies is acquiring a visual representation conducive to low-level control. Due to the high dimensionality of the image space, learning a good visual representation requires a considerable amount of visual data. However, when learning in the real world, data is expensive. Sim2Real is a promising paradigm for overcoming data scarcity in the real-world target domain by using a simulator to collect large amounts of cheap data closely related to the target task. However, it is difficult to transfer an image-conditioned policy from sim to real when the domains are very visually dissimilar. To bridge the sim2real visual gap, we propose using natural language descriptions of images as a unifying signal across domains that captures the underlying task-relevant semantics. Our key insight is that if two image observations from different domains are labeled with similar language, the policy should predict similar action distributions for both images. We demonstrate that training the image encoder to predict the language description or the distance between descriptions of a sim or real image serves as a useful, data-efficient pretraining step that helps learn a domain-invariant image representation. We can then use this image encoder as the backbone of an IL policy trained simultaneously on a large amount of simulated and a handful of real demonstrations. Our approach outperforms widely used prior sim2real methods and strong vision-language pretraining baselines like CLIP and R3M by 25 to 40%. See additional videos and materials at https://robin-lab.cs.utexas.edu/lang4sim2real/. △ Less

Submitted 2 July, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

Comments: To appear in RSS 2024. Project website at https://robin-lab.cs.utexas.edu/lang4sim2real/

ACM Class: I.2.9; I.2.7; I.2.6

arXiv:2305.19911 [pdf, other]

Neuron to Graph: Interpreting Language Model Neurons at Scale

Authors: Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Shay Cohen, Fazl Barez

Abstract: Advances in Large Language Models (LLMs) have led to remarkable capabilities, yet their inner mechanisms remain largely unknown. To understand these models, we need to unravel the functions of individual neurons and their contribution to the network. This paper introduces a novel automated approach designed to scale interpretability techniques across a vast array of neurons within LLMs, to make th… ▽ More Advances in Large Language Models (LLMs) have led to remarkable capabilities, yet their inner mechanisms remain largely unknown. To understand these models, we need to unravel the functions of individual neurons and their contribution to the network. This paper introduces a novel automated approach designed to scale interpretability techniques across a vast array of neurons within LLMs, to make them more interpretable and ultimately safe. Conventional methods require examination of examples with strong neuron activation and manual identification of patterns to decipher the concepts a neuron responds to. We propose Neuron to Graph (N2G), an innovative tool that automatically extracts a neuron's behaviour from the dataset it was trained on and translates it into an interpretable graph. N2G uses truncation and saliency methods to emphasise only the most pertinent tokens to a neuron while enriching dataset examples with diverse samples to better encompass the full spectrum of neuron behaviour. These graphs can be visualised to aid researchers' manual interpretation, and can generate token activations on text for automatic validation by comparison with the neuron's ground truth activations, which we use to show that the model is better at predicting neuron activation than two baseline methods. We also demonstrate how the generated graph representations can be flexibly used to facilitate further automation of interpretability research, by searching for neurons with particular properties, or programmatically comparing neurons to each other to identify similar neurons. Our method easily scales to build graph representations for all neurons in a 6-layer Transformer model using a single Tesla T4 GPU, allowing for wide usability. We release the code and instructions for use at https://github.com/alexjfoote/Neuron2Graph. △ Less

Submitted 31 May, 2023; originally announced May 2023.

arXiv:2304.12918 [pdf, other]

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models

Authors: Alex Foote, Neel Nanda, Esben Kran, Ionnis Konstas, Fazl Barez

Abstract: Understanding the function of individual neurons within language models is essential for mechanistic interpretability research. We propose $\textbf{Neuron to Graph (N2G)}$, a tool which takes a neuron and its dataset examples, and automatically distills the neuron's behaviour on those examples to an interpretable graph. This presents a less labour intensive approach to interpreting neurons than cu… ▽ More Understanding the function of individual neurons within language models is essential for mechanistic interpretability research. We propose $\textbf{Neuron to Graph (N2G)}$, a tool which takes a neuron and its dataset examples, and automatically distills the neuron's behaviour on those examples to an interpretable graph. This presents a less labour intensive approach to interpreting neurons than current manual methods, that will better scale these methods to Large Language Models (LLMs). We use truncation and saliency methods to only present the important tokens, and augment the dataset examples with more diverse samples to better capture the extent of neuron behaviour. These graphs can be visualised to aid manual interpretation by researchers, but can also output token activations on text to compare to the neuron's ground truth activations for automatic validation. N2G represents a step towards scalable interpretability methods by allowing us to convert neurons in an LLM to interpretable representations of measurable quality. △ Less

Submitted 22 April, 2023; originally announced April 2023.

Comments: To be published at ICLR 2023 Workshop on Trustworthy and Reliable Large-Scale Machine Learning Models

arXiv:2201.12311 [pdf]

REET: Robustness Evaluation and Enhancement Toolbox for Computational Pathology

Authors: Alex Foote, Amina Asif, Nasir Rajpoot, Fayyaz Minhas

Abstract: Motivation: Digitization of pathology laboratories through digital slide scanners and advances in deep learning approaches for objective histological assessment have resulted in rapid progress in the field of computational pathology (CPath) with wide-ranging applications in medical and pharmaceutical research as well as clinical workflows. However, the estimation of robustness of CPath models to v… ▽ More Motivation: Digitization of pathology laboratories through digital slide scanners and advances in deep learning approaches for objective histological assessment have resulted in rapid progress in the field of computational pathology (CPath) with wide-ranging applications in medical and pharmaceutical research as well as clinical workflows. However, the estimation of robustness of CPath models to variations in input images is an open problem with a significant impact on the down-stream practical applicability, deployment and acceptability of these approaches. Furthermore, development of domain-specific strategies for enhancement of robustness of such models is of prime importance as well. Implementation and Availability: In this work, we propose the first domain-specific Robustness Evaluation and Enhancement Toolbox (REET) for computational pathology applications. It provides a suite of algorithmic strategies for enabling robustness assessment of predictive models with respect to specialized image transformations such as staining, compression, focusing, blurring, changes in spatial resolution, brightness variations, geometric changes as well as pixel-level adversarial perturbations. Furthermore, REET also enables efficient and robust training of deep learning pipelines in computational pathology. REET is implemented in Python and is available at the following URL: https://github.com/alexjfoote/reetoolbox. Contact: [email protected] △ Less

Submitted 28 January, 2022; originally announced January 2022.

arXiv:2106.08153 [pdf]

Now You See It, Now You Dont: Adversarial Vulnerabilities in Computational Pathology

Authors: Alex Foote, Amina Asif, Ayesha Azam, Tim Marshall-Cox, Nasir Rajpoot, Fayyaz Minhas

Abstract: Deep learning models are routinely employed in computational pathology (CPath) for solving problems of diagnostic and prognostic significance. Typically, the generalization performance of CPath models is analyzed using evaluation protocols such as cross-validation and testing on multi-centric cohorts. However, to ensure that such CPath solutions are robust and safe for use in a clinical setting, a… ▽ More Deep learning models are routinely employed in computational pathology (CPath) for solving problems of diagnostic and prognostic significance. Typically, the generalization performance of CPath models is analyzed using evaluation protocols such as cross-validation and testing on multi-centric cohorts. However, to ensure that such CPath solutions are robust and safe for use in a clinical setting, a critical analysis of their predictive performance and vulnerability to adversarial attacks is required, which is the focus of this paper. Specifically, we show that a highly accurate model for classification of tumour patches in pathology images (AUC > 0.95) can easily be attacked with minimal perturbations which are imperceptible to lay humans and trained pathologists alike. Our analytical results show that it is possible to generate single-instance white-box attacks on specific input images with high success rate and low perturbation energy. Furthermore, we have also generated a single universal perturbation matrix using the training dataset only which, when added to unseen test images, results in forcing the trained neural network to flip its prediction labels with high confidence at a success rate of > 84%. We systematically analyze the relationship between perturbation energy of an adversarial attack, its impact on morphological constructs of clinical significance, their perceptibility by a trained pathologist and saliency maps obtained using deep learning models. Based on our analysis, we strongly recommend that computational pathology models be critically analyzed using the proposed adversarial validation strategy prior to clinical adoption. △ Less

Submitted 16 June, 2021; v1 submitted 14 June, 2021; originally announced June 2021.

Comments: 10 pages

arXiv:1702.00092 [pdf, ps, other]

doi 10.1112/plms.12143

The $2$-Selmer group of a number field and heuristics for narrow class groups and signature ranks of units

Authors: David S. Dummit, John Voight, appendix with Richard Foote

Abstract: We investigate in detail a homomorphism which we call the 2-Selmer signature map from the $2$-Selmer group of a number field $K$ to a nondegenerate symmetric space, in particular proving the image is a maximal totally isotropic subspace. Applications include precise predictions on the density of fields $K$ with given narrow class group 2-rank and with given unit group signature rank. In addition t… ▽ More We investigate in detail a homomorphism which we call the 2-Selmer signature map from the $2$-Selmer group of a number field $K$ to a nondegenerate symmetric space, in particular proving the image is a maximal totally isotropic subspace. Applications include precise predictions on the density of fields $K$ with given narrow class group 2-rank and with given unit group signature rank. In addition to theoretical evidence, extensive computations for totally real cubic and quintic fields are presented that match the predictions extremely well. In an appendix with Richard Foote, we classify the maximal totally isotropic subspaces of orthogonal direct sums of two nondegenerate symmetric spaces over perfect fields of characteristic 2 and derive some consequences, including a mass formula for such subspaces. △ Less

Submitted 29 March, 2018; v1 submitted 31 January, 2017; originally announced February 2017.

Comments: 48 pages, small corrections made

Showing 1–8 of 8 results for author: Foote, A