Search | arXiv e-print repository

Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability

Authors: Vishnu Kabir Chhabra, Mohammad Mahdi Khalili

Abstract: The rapid growth of large language models has spurred significant interest in model compression as a means to enhance their accessibility and practicality. While extensive research has explored model compression through the lens of safety, findings suggest that safety-aligned models often lose elements of trustworthiness post-compression. Simultaneously, the field of mechanistic interpretability h… ▽ More The rapid growth of large language models has spurred significant interest in model compression as a means to enhance their accessibility and practicality. While extensive research has explored model compression through the lens of safety, findings suggest that safety-aligned models often lose elements of trustworthiness post-compression. Simultaneously, the field of mechanistic interpretability has gained traction, with notable discoveries, such as the identification of a single direction in the residual stream mediating refusal behaviors across diverse model architectures. In this work, we investigate the safety of compressed models by examining the mechanisms of refusal, adopting a novel interpretability-driven perspective to evaluate model safety. Furthermore, leveraging insights from our interpretability analysis, we propose a lightweight, computationally efficient method to enhance the safety of compressed models without compromising their performance or utility. △ Less

Submitted 5 April, 2025; originally announced April 2025.

arXiv:2503.01896 [pdf, other]

Neuroplasticity and Corruption in Model Mechanisms: A Case Study Of Indirect Object Identification

Authors: Vishnu Kabir Chhabra, Ding Zhu, Mohammad Mahdi Khalili

Abstract: Previous research has shown that fine-tuning language models on general tasks enhance their underlying mechanisms. However, the impact of fine-tuning on poisoned data and the resulting changes in these mechanisms are poorly understood. This study investigates the changes in a model's mechanisms during toxic fine-tuning and identifies the primary corruption mechanisms. We also analyze the changes a… ▽ More Previous research has shown that fine-tuning language models on general tasks enhance their underlying mechanisms. However, the impact of fine-tuning on poisoned data and the resulting changes in these mechanisms are poorly understood. This study investigates the changes in a model's mechanisms during toxic fine-tuning and identifies the primary corruption mechanisms. We also analyze the changes after retraining a corrupted model on the original dataset and observe neuroplasticity behaviors, where the model relearns original mechanisms after fine-tuning the corrupted model. Our findings indicate that: (i) Underlying mechanisms are amplified across task-specific fine-tuning which can be generalized to longer epochs, (ii) Model corruption via toxic fine-tuning is localized to specific circuit components, (iii) Models exhibit neuroplasticity when retraining corrupted models on clean dataset, reforming the original model mechanisms. △ Less

Submitted 27 February, 2025; originally announced March 2025.

arXiv:2407.11065 [pdf, other]

ECG Signal Denoising Using Multi-scale Patch Embedding and Transformers

Authors: Ding Zhu, Vishnu Kabir Chhabra, Mohammad Mahdi Khalili

Abstract: Cardiovascular disease is a major life-threatening condition that is commonly monitored using electrocardiogram (ECG) signals. However, these signals are often contaminated by various types of noise at different intensities, significantly interfering with downstream tasks. Therefore, denoising ECG signals and increasing the signal-to-noise ratio is crucial for cardiovascular monitoring. In this pa… ▽ More Cardiovascular disease is a major life-threatening condition that is commonly monitored using electrocardiogram (ECG) signals. However, these signals are often contaminated by various types of noise at different intensities, significantly interfering with downstream tasks. Therefore, denoising ECG signals and increasing the signal-to-noise ratio is crucial for cardiovascular monitoring. In this paper, we propose a deep learning method that combines a one-dimensional convolutional layer with transformer architecture for denoising ECG signals. The convolutional layer processes the ECG signal by various kernel/patch sizes and generates an embedding called multi-scale patch embedding. The embedding then is used as the input of a transformer network and enhances the capability of the transformer for denoising the ECG signal. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Showing 1–3 of 3 results for author: Chhabra, V K