-
Enabling Precise Topic Alignment in Large Language Models Via Sparse Autoencoders
Authors:
Ananya Joshi,
Celia Cintas,
Skyler Speakman
Abstract:
Recent work shows that Sparse Autoencoders (SAE) applied to large language model (LLM) layers have neurons corresponding to interpretable concepts. These SAE neurons can be modified to align generated outputs, but only towards pre-identified topics and with some parameter tuning. Our approach leverages the observational and modification properties of SAEs to enable alignment for any topic. This me…
▽ More
Recent work shows that Sparse Autoencoders (SAE) applied to large language model (LLM) layers have neurons corresponding to interpretable concepts. These SAE neurons can be modified to align generated outputs, but only towards pre-identified topics and with some parameter tuning. Our approach leverages the observational and modification properties of SAEs to enable alignment for any topic. This method 1) scores each SAE neuron by its semantic similarity to an alignment text and uses them to 2) modify SAE-layer-level outputs by emphasizing topic-aligned neurons. We assess the alignment capabilities of this approach on diverse public topic datasets including Amazon reviews, Medicine, and Sycophancy, across the currently available open-source LLMs and SAE pairs (GPT2 and Gemma) with multiple SAEs configurations. Experiments aligning to medical prompts reveal several benefits over fine-tuning, including increased average language acceptability (0.25 vs. 0.5), reduced training time across multiple alignment topics (333.6s vs. 62s), and acceptable inference time for many applications (+0.00092s/token). Our open-source code is available at github.com/IBM/sae-steering.
△ Less
Submitted 14 June, 2025;
originally announced June 2025.
-
Localizing Persona Representations in LLMs
Authors:
Celia Cintas,
Miriam Rateike,
Erik Miehling,
Elizabeth Daly,
Skyler Speakman
Abstract:
We present a study on how and where personas -- defined by distinct sets of human characteristics, values, and beliefs -- are encoded in the representation space of large language models (LLMs). Using a range of dimension reduction and pattern recognition methods, we first identify the model layers that show the greatest divergence in encoding these representations. We then analyze the activations…
▽ More
We present a study on how and where personas -- defined by distinct sets of human characteristics, values, and beliefs -- are encoded in the representation space of large language models (LLMs). Using a range of dimension reduction and pattern recognition methods, we first identify the model layers that show the greatest divergence in encoding these representations. We then analyze the activations within a selected layer to examine how specific personas are encoded relative to others, including their shared and distinct embedding spaces. We find that, across multiple pre-trained decoder-only LLMs, the analyzed personas show large differences in representation space only within the final third of the decoder layers. We observe overlapping activations for specific ethical perspectives -- such as moral nihilism and utilitarianism -- suggesting a degree of polysemy. In contrast, political ideologies like conservatism and liberalism appear to be represented in more distinct regions. These findings help to improve our understanding of how LLMs internally represent information and can inform future efforts in refining the modulation of specific human traits in LLM outputs. Warning: This paper includes potentially offensive sample statements.
△ Less
Submitted 3 June, 2025; v1 submitted 30 May, 2025;
originally announced May 2025.
-
Percolation pathway switching in laser graphitized polyimide conducting tracks
Authors:
Melanie Whitfield,
Larry Yip,
Stuart Speakman,
David Hasko
Abstract:
Laser processing has been used to create weakly conducting tracks in polyimide film. Raman spectroscopy shows that these tracks consist of nanometre sized graphitic regions contained in a carbon-rich matrix. The measured temperature dependent and electric field dependent conduction characteristics show an activated characteristic that is consistent with nearest neighbour hopping. In addition, disc…
▽ More
Laser processing has been used to create weakly conducting tracks in polyimide film. Raman spectroscopy shows that these tracks consist of nanometre sized graphitic regions contained in a carbon-rich matrix. The measured temperature dependent and electric field dependent conduction characteristics show an activated characteristic that is consistent with nearest neighbour hopping. In addition, discrete percolation pathway switching events are seen when the system is subjected to significant disturbance from equilibrium.
△ Less
Submitted 18 January, 2024;
originally announced January 2024.
-
Efficient Representation of the Activation Space in Deep Neural Networks
Authors:
Tanya Akumu,
Celia Cintas,
Girmaw Abebe Tadesse,
Adebayo Oshingbesan,
Skyler Speakman,
Edward McFowland III
Abstract:
The representations of the activation space of deep neural networks (DNNs) are widely utilized for tasks like natural language processing, anomaly detection and speech recognition. Due to the diverse nature of these tasks and the large size of DNNs, an efficient and task-independent representation of activations becomes crucial. Empirical p-values have been used to quantify the relative strength o…
▽ More
The representations of the activation space of deep neural networks (DNNs) are widely utilized for tasks like natural language processing, anomaly detection and speech recognition. Due to the diverse nature of these tasks and the large size of DNNs, an efficient and task-independent representation of activations becomes crucial. Empirical p-values have been used to quantify the relative strength of an observed node activation compared to activations created by already-known inputs. Nonetheless, keeping raw data for these calculations increases memory resource consumption and raises privacy concerns. To this end, we propose a model-agnostic framework for creating representations of activations in DNNs using node-specific histograms to compute p-values of observed activations without retaining already-known inputs. Our proposed approach demonstrates promising potential when validated with multiple network architectures across various downstream tasks and compared with the kernel density estimates and brute-force empirical baselines. In addition, the framework reduces memory usage by 30% with up to 4 times faster p-value computing time while maintaining state of-the-art detection power in downstream tasks such as the detection of adversarial attacks and synthesized content. Moreover, as we do not persist raw data at inference time, we could potentially reduce susceptibility to attacks and privacy issues.
△ Less
Submitted 13 December, 2023;
originally announced December 2023.
-
Weakly Supervised Detection of Hallucinations in LLM Activations
Authors:
Miriam Rateike,
Celia Cintas,
John Wamburu,
Tanya Akumu,
Skyler Speakman
Abstract:
We propose an auditing method to identify whether a large language model (LLM) encodes patterns such as hallucinations in its internal states, which may propagate to downstream tasks. We introduce a weakly supervised auditing technique using a subset scanning approach to detect anomalous patterns in LLM activations from pre-trained models. Importantly, our method does not need knowledge of the typ…
▽ More
We propose an auditing method to identify whether a large language model (LLM) encodes patterns such as hallucinations in its internal states, which may propagate to downstream tasks. We introduce a weakly supervised auditing technique using a subset scanning approach to detect anomalous patterns in LLM activations from pre-trained models. Importantly, our method does not need knowledge of the type of patterns a-priori. Instead, it relies on a reference dataset devoid of anomalies during testing. Further, our approach enables the identification of pivotal nodes responsible for encoding these patterns, which may offer crucial insights for fine-tuning specific sub-networks for bias mitigation. We introduce two new scanning methods to handle LLM activations for anomalous sentences that may deviate from the expected distribution in either direction. Our results confirm prior findings of BERT's limited internal capacity for encoding hallucinations, while OPT appears capable of encoding hallucination information internally. Importantly, our scanning approach, without prior exposure to false statements, performs comparably to a fully supervised out-of-distribution classifier.
△ Less
Submitted 5 December, 2023;
originally announced December 2023.
-
Model-free feature selection to facilitate automatic discovery of divergent subgroups in tabular data
Authors:
Girmaw Abebe Tadesse,
William Ogallo,
Celia Cintas,
Skyler Speakman
Abstract:
Data-centric AI encourages the need of cleaning and understanding of data in order to achieve trustworthy AI. Existing technologies, such as AutoML, make it easier to design and train models automatically, but there is a lack of a similar level of capabilities to extract data-centric insights. Manual stratification of tabular data per a feature (e.g., gender) is limited to scale up for higher feat…
▽ More
Data-centric AI encourages the need of cleaning and understanding of data in order to achieve trustworthy AI. Existing technologies, such as AutoML, make it easier to design and train models automatically, but there is a lack of a similar level of capabilities to extract data-centric insights. Manual stratification of tabular data per a feature (e.g., gender) is limited to scale up for higher feature dimension, which could be addressed using automatic discovery of divergent subgroups. Nonetheless, these automatic discovery techniques often search across potentially exponential combinations of features that could be simplified using a preceding feature selection step. Existing feature selection techniques for tabular data often involve fitting a particular model in order to select important features. However, such model-based selection is prone to model-bias and spurious correlations in addition to requiring extra resource to design, fine-tune and train a model. In this paper, we propose a model-free and sparsity-based automatic feature selection (SAFS) framework to facilitate automatic discovery of divergent subgroups. Different from filter-based selection techniques, we exploit the sparsity of objective measures among feature values to rank and select features. We validated SAFS across two publicly available datasets (MIMIC-III and Allstate Claims) and compared it with six existing feature selection methods. SAFS achieves a reduction of feature selection time by a factor of 81x and 104x, averaged cross the existing methods in the MIMIC-III and Claims datasets respectively. SAFS-selected features are also shown to achieve competitive detection performance, e.g., 18.3% of features selected by SAFS in the Claims dataset detected divergent samples similar to those detected by using the whole features with a Jaccard similarity of 0.95 but with a 16x reduction in detection time.
△ Less
Submitted 8 March, 2022;
originally announced March 2022.
-
Towards Creativity Characterization of Generative Models via Group-based Subset Scanning
Authors:
Celia Cintas,
Payel Das,
Brian Quanz,
Girmaw Abebe Tadesse,
Skyler Speakman,
Pin-Yu Chen
Abstract:
Deep generative models, such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs), have been employed widely in computational creativity research. However, such models discourage out-of-distribution generation to avoid spurious sample generation, thereby limiting their creativity. Thus, incorporating research on human creativity into generative deep learning techniques pre…
▽ More
Deep generative models, such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs), have been employed widely in computational creativity research. However, such models discourage out-of-distribution generation to avoid spurious sample generation, thereby limiting their creativity. Thus, incorporating research on human creativity into generative deep learning techniques presents an opportunity to make their outputs more compelling and human-like. As we see the emergence of generative models directed toward creativity research, a need for machine learning-based surrogate metrics to characterize creative output from these models is imperative. We propose group-based subset scanning to identify, quantify, and characterize creative processes by detecting a subset of anomalous node-activations in the hidden layers of the generative models. Our experiments on the standard image benchmarks, and their "creatively generated" variants, reveal that the proposed subset scores distribution is more useful for detecting creative processes in the activation space rather than the pixel space. Further, we found that creative samples generate larger subsets of anomalies than normal or non-creative samples across datasets. The node activations highlighted during the creative decoding process are different from those responsible for the normal sample generation. Lastly, we assess if the images from the subsets selected by our method were also found creative by human evaluators, presenting a link between creativity perception in humans and node activations within deep neural nets.
△ Less
Submitted 26 May, 2022; v1 submitted 1 March, 2022;
originally announced March 2022.
-
Sparsity-based Feature Selection for Anomalous Subgroup Discovery
Authors:
Girmaw Abebe Tadesse,
William Ogallo,
Catherine Wanjiru,
Charles Wachira,
Isaiah Onando Mulang',
Vibha Anand,
Aisha Walcott-Bryant,
Skyler Speakman
Abstract:
Anomalous pattern detection aims to identify instances where deviation from normalcy is evident, and is widely applicable across domains. Multiple anomalous detection techniques have been proposed in the state of the art. However, there is a common lack of a principled and scalable feature selection method for efficient discovery. Existing feature selection techniques are often conducted by optimi…
▽ More
Anomalous pattern detection aims to identify instances where deviation from normalcy is evident, and is widely applicable across domains. Multiple anomalous detection techniques have been proposed in the state of the art. However, there is a common lack of a principled and scalable feature selection method for efficient discovery. Existing feature selection techniques are often conducted by optimizing the performance of prediction outcomes rather than its systemic deviations from the expected. In this paper, we proposed a sparsity-based automated feature selection (SAFS) framework, which encodes systemic outcome deviations via the sparsity of feature-driven odds ratios. SAFS is a model-agnostic approach with usability across different discovery techniques. SAFS achieves more than $3\times$ reduction in computation time while maintaining detection performance when validated on publicly available critical care dataset. SAFS also results in a superior performance when compared against multiple baselines for feature selection.
△ Less
Submitted 6 January, 2022;
originally announced January 2022.
-
Pattern Detection in the Activation Space for Identifying Synthesized Content
Authors:
Celia Cintas,
Skyler Speakman,
Girmaw Abebe Tadesse,
Victor Akinwande,
Edward McFowland III,
Komminist Weldemariam
Abstract:
Generative Adversarial Networks (GANs) have recently achieved unprecedented success in photo-realistic image synthesis from low-dimensional random noise. The ability to synthesize high-quality content at a large scale brings potential risks as the generated samples may lead to misinformation that can create severe social, political, health, and business hazards. We propose SubsetGAN to identify ge…
▽ More
Generative Adversarial Networks (GANs) have recently achieved unprecedented success in photo-realistic image synthesis from low-dimensional random noise. The ability to synthesize high-quality content at a large scale brings potential risks as the generated samples may lead to misinformation that can create severe social, political, health, and business hazards. We propose SubsetGAN to identify generated content by detecting a subset of anomalous node-activations in the inner layers of pre-trained neural networks. These nodes, as a group, maximize a non-parametric measure of divergence away from the expected distribution of activations created from real data. This enable us to identify synthesised images without prior knowledge of their distribution. SubsetGAN efficiently scores subsets of nodes and returns the group of nodes within the pre-trained classifier that contributed to the maximum score. The classifier can be a general fake classifier trained over samples from multiple sources or the discriminator network from different GANs. Our approach shows consistently higher detection power than existing detection methods across several state-of-the-art GANs (PGGAN, StarGAN, and CycleGAN) and over different proportions of generated content.
△ Less
Submitted 27 May, 2021; v1 submitted 26 May, 2021;
originally announced May 2021.
-
Out-of-Distribution Detection in Dermatology using Input Perturbation and Subset Scanning
Authors:
Hannah Kim,
Girmaw Abebe Tadesse,
Celia Cintas,
Skyler Speakman,
Kush Varshney
Abstract:
Recent advances in deep learning have led to breakthroughs in the development of automated skin disease classification. As we observe an increasing interest in these models in the dermatology space, it is crucial to address aspects such as the robustness towards input data distribution shifts. Current skin disease models could make incorrect inferences for test samples from different hardware devi…
▽ More
Recent advances in deep learning have led to breakthroughs in the development of automated skin disease classification. As we observe an increasing interest in these models in the dermatology space, it is crucial to address aspects such as the robustness towards input data distribution shifts. Current skin disease models could make incorrect inferences for test samples from different hardware devices and clinical settings or unknown disease samples, which are out-of-distribution (OOD) from the training samples. To this end, we propose a simple yet effective approach that detect these OOD samples prior to making any decision. The detection is performed via scanning in the latent space representation (e.g., activations of the inner layers of any pre-trained skin disease classifier). The input samples could also perturbed to maximise divergence of OOD samples. We validate our ODD detection approach in two use cases: 1) identify samples collected from different protocols, and 2) detect samples from unknown disease classes. Additionally, we evaluate the performance of the proposed approach and compare it with other state-of-the-art methods. Furthermore, data-driven dermatology applications may deepen the disparity in clinical care across racial and ethnic groups since most datasets are reported to suffer from bias in skin tone distribution. Therefore, we also evaluate the fairness of these OOD detection methods across different skin tones. Our experiments resulted in competitive performance across multiple datasets in detecting OOD samples, which could be used (in the future) to design more effective transfer learning techniques prior to inferring on these samples.
△ Less
Submitted 2 June, 2021; v1 submitted 24 May, 2021;
originally announced May 2021.
-
Towards creativity characterization of generative models via group-based subset scanning
Authors:
Celia Cintas,
Payel Das,
Brian Quanz,
Skyler Speakman,
Victor Akinwande,
Pin-Yu Chen
Abstract:
Deep generative models, such as Variational Autoencoders (VAEs), have been employed widely in computational creativity research. However, such models discourage out-of-distribution generation to avoid spurious sample generation, limiting their creativity. Thus, incorporating research on human creativity into generative deep learning techniques presents an opportunity to make their outputs more com…
▽ More
Deep generative models, such as Variational Autoencoders (VAEs), have been employed widely in computational creativity research. However, such models discourage out-of-distribution generation to avoid spurious sample generation, limiting their creativity. Thus, incorporating research on human creativity into generative deep learning techniques presents an opportunity to make their outputs more compelling and human-like. As we see the emergence of generative models directed to creativity research, a need for machine learning-based surrogate metrics to characterize creative output from these models is imperative. We propose group-based subset scanning to quantify, detect, and characterize creative processes by detecting a subset of anomalous node-activations in the hidden layers of generative models. Our experiments on original, typically decoded, and "creatively decoded" (Das et al 2020) image datasets reveal that the proposed subset scores distribution is more useful for detecting creative processes in the activation space rather than the pixel space. Further, we found that creative samples generate larger subsets of anomalies than normal or non-creative samples across datasets. The node activations highlighted during the creative decoding process are different from those responsible for normal sample generation.
△ Less
Submitted 26 May, 2021; v1 submitted 1 April, 2021;
originally announced April 2021.
-
Prediction of neonatal mortality in Sub-Saharan African countries using data-level linkage of multiple surveys
Authors:
Girmaw Abebe Tadesse,
Celia Cintas,
Skyler Speakman,
Komminist Weldemariam
Abstract:
Existing datasets available to address crucial problems, such as child mortality and family planning discontinuation in developing countries, are not ample for data-driven approaches. This is partly due to disjoint data collection efforts employed across locations, times, and variations of modalities. On the other hand, state-of-the-art methods for small data problem are confined to image modaliti…
▽ More
Existing datasets available to address crucial problems, such as child mortality and family planning discontinuation in developing countries, are not ample for data-driven approaches. This is partly due to disjoint data collection efforts employed across locations, times, and variations of modalities. On the other hand, state-of-the-art methods for small data problem are confined to image modalities. In this work, we proposed a data-level linkage of disjoint surveys across Sub-Saharan African countries to improve prediction performance of neonatal death and provide cross-domain explainability.
△ Less
Submitted 25 November, 2020;
originally announced November 2020.
-
Identifying Audio Adversarial Examples via Anomalous Pattern Detection
Authors:
Victor Akinwande,
Celia Cintas,
Skyler Speakman,
Srihari Sridharan
Abstract:
Audio processing models based on deep neural networks are susceptible to adversarial attacks even when the adversarial audio waveform is 99.9% similar to a benign sample. Given the wide application of DNN-based audio recognition systems, detecting the presence of adversarial examples is of high practical relevance. By applying anomalous pattern detection techniques in the activation space of these…
▽ More
Audio processing models based on deep neural networks are susceptible to adversarial attacks even when the adversarial audio waveform is 99.9% similar to a benign sample. Given the wide application of DNN-based audio recognition systems, detecting the presence of adversarial examples is of high practical relevance. By applying anomalous pattern detection techniques in the activation space of these models, we show that 2 of the recent and current state-of-the-art adversarial attacks on audio processing systems systematically lead to higher-than-expected activation at some subset of nodes and we can detect these with up to an AUC of 0.98 with no degradation in performance on benign samples.
△ Less
Submitted 25 July, 2020; v1 submitted 13 February, 2020;
originally announced February 2020.
-
Smooth Grad-CAM++: An Enhanced Inference Level Visualization Technique for Deep Convolutional Neural Network Models
Authors:
Daniel Omeiza,
Skyler Speakman,
Celia Cintas,
Komminist Weldermariam
Abstract:
Gaining insight into how deep convolutional neural network models perform image classification and how to explain their outputs have been a concern to computer vision researchers and decision makers. These deep models are often referred to as black box due to low comprehension of their internal workings. As an effort to developing explainable deep learning models, several methods have been propose…
▽ More
Gaining insight into how deep convolutional neural network models perform image classification and how to explain their outputs have been a concern to computer vision researchers and decision makers. These deep models are often referred to as black box due to low comprehension of their internal workings. As an effort to developing explainable deep learning models, several methods have been proposed such as finding gradients of class output with respect to input image (sensitivity maps), class activation map (CAM), and Gradient based Class Activation Maps (Grad-CAM). These methods under perform when localizing multiple occurrences of the same class and do not work for all CNNs. In addition, Grad-CAM does not capture the entire object in completeness when used on single object images, this affect performance on recognition tasks. With the intention to create an enhanced visual explanation in terms of visual sharpness, object localization and explaining multiple occurrences of objects in a single image, we present Smooth Grad-CAM++ \footnote{Simple demo: http://35.238.22.135:5000/}, a technique that combines methods from two other recent techniques---SMOOTHGRAD and Grad-CAM++. Our Smooth Grad-CAM++ technique provides the capability of either visualizing a layer, subset of feature maps, or subset of neurons within a feature map at each instance at the inference level (model prediction process). After experimenting with few images, Smooth Grad-CAM++ produced more visually sharp maps with better localization of objects in the given input images when compared with other methods.
△ Less
Submitted 3 August, 2019;
originally announced August 2019.
-
Subset Scanning Over Neural Network Activations
Authors:
Skyler Speakman,
Srihari Sridharan,
Sekou Remy,
Komminist Weldemariam,
Edward McFowland
Abstract:
This work views neural networks as data generating systems and applies anomalous pattern detection techniques on that data in order to detect when a network is processing an anomalous input. Detecting anomalies is a critical component for multiple machine learning problems including detecting adversarial noise. More broadly, this work is a step towards giving neural networks the ability to recogni…
▽ More
This work views neural networks as data generating systems and applies anomalous pattern detection techniques on that data in order to detect when a network is processing an anomalous input. Detecting anomalies is a critical component for multiple machine learning problems including detecting adversarial noise. More broadly, this work is a step towards giving neural networks the ability to recognize an out-of-distribution sample. This is the first work to introduce "Subset Scanning" methods from the anomalous pattern detection domain to the task of detecting anomalous input of neural networks. Subset scanning treats the detection problem as a search for the most anomalous subset of node activations (i.e., highest scoring subset according to non-parametric scan statistics). Mathematical properties of these scoring functions allow the search to be completed in log-linear rather than exponential time while still guaranteeing the most anomalous subset of nodes in the network is identified for a given input. Quantitative results for detecting and characterizing adversarial noise are provided for CIFAR-10 images on a simple convolutional neural network. We observe an "interference" pattern where anomalous activations in shallow layers suppress the activation structure of the original image in deeper layers.
△ Less
Submitted 19 October, 2018;
originally announced October 2018.
-
A Novel Test for Host-Symbiont Codivergence Indicates Ancient Origin of Fungal Endophytes in Grasses
Authors:
Chris L. Schardl,
Kelly D. Craven,
Adam Lindstrom,
Skyler Speakman,
Arnold Stromberg,
Ruriko Yoshida
Abstract:
Significant phylogenetic codivergence between plant or animal hosts ($H$) and their symbionts or parasites ($P$) indicate the importance of their interactions on evolutionary time scales. However, valid and realistic methods to test for codivergence are not fully developed. One of the systems where possible codivergence has been of interest involves the large subfamily of temperate grasses (Pooi…
▽ More
Significant phylogenetic codivergence between plant or animal hosts ($H$) and their symbionts or parasites ($P$) indicate the importance of their interactions on evolutionary time scales. However, valid and realistic methods to test for codivergence are not fully developed. One of the systems where possible codivergence has been of interest involves the large subfamily of temperate grasses (Pooideae) and their endophytic fungi (epichloae). These widespread symbioses often help protect host plants from herbivory and stresses, and affect species diversity and food web structures. Here we introduce the MRCALink (most-recent-common-ancestor link) method and use it to investigate the possibility of grass-epichloƫ codivergence. MRCALink applied to ultrametric $H$ and $P$ trees identifies all corresponding nodes for pairwise comparisons of MRCA ages. The result is compared to the space of random $H$ and $P$ tree pairs estimated by a Monte Carlo method. Compared to tree reconciliation the method is less dependent on tree topologies (which often can be misleading), and it crucially improves on phylogeny-independent methods such as {\tt ParaFit} or the Mantel test by eliminating an extreme (but previously unrecognized) distortion of node-pair sampling. Analysis of 26 grass species-epichloƫ species symbioses did not reject random association of $H$ and $P$ MRCA ages. However, when five obvious host jumps were removed the analysis significantly rejected random association and supported grass-endophyte codivergence. Interestingly, early cladogenesis events in the Pooideae corresponded to early cladogenesis events in epichloae, suggesting concomitant origins of this grass subfamily and its remarkable group of symbionts. We also applied our method to the well-known gopher-louse data set.
△ Less
Submitted 28 August, 2008; v1 submitted 25 November, 2006;
originally announced November 2006.