Skip to main content

Showing 1–3 of 3 results for author: Petrova, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.18872  [pdf, other

    cs.CL cs.LG

    Latent Adversarial Training Improves the Representation of Refusal

    Authors: Alexandra Abbas, Nora Petrova, Helios Ael Lyons, Natalia Perez-Campanero

    Abstract: Recent work has shown that language models' refusal behavior is primarily encoded in a single direction in their latent space, making it vulnerable to targeted attacks. Although Latent Adversarial Training (LAT) attempts to improve robustness by introducing noise during training, a key question remains: How does this noise-based training affect the underlying representation of refusal behavior? Un… ▽ More

    Submitted 26 April, 2025; originally announced April 2025.

  2. arXiv:2409.17113  [pdf, other

    cs.LG

    Characterizing stable regions in the residual stream of LLMs

    Authors: Jett Janiak, Jacek Karwowski, Chatrik Singh Mangat, Giorgi Giglemiani, Nora Petrova, Stefan Heimersheim

    Abstract: We identify stable regions in the residual stream of Transformers, where the model's output remains insensitive to small activation changes, but exhibits high sensitivity at region boundaries. These regions emerge during training and become more defined as training progresses or model size increases. The regions appear to be much larger than previously studied polytopes. Our analysis suggests that… ▽ More

    Submitted 18 November, 2024; v1 submitted 25 September, 2024; originally announced September 2024.

    Comments: Presented at the Scientific Methods for Understanding Deep Learning (SciForDL) workshop at NeurIPS 2024

  3. arXiv:2409.15019  [pdf, other

    cs.LG

    Evaluating Synthetic Activations composed of SAE Latents in GPT-2

    Authors: Giorgi Giglemiani, Nora Petrova, Chatrik Singh Mangat, Jett Janiak, Stefan Heimersheim

    Abstract: Sparse Auto-Encoders (SAEs) are commonly employed in mechanistic interpretability to decompose the residual stream into monosemantic SAE latents. Recent work demonstrates that perturbing a model's activations at an early layer results in a step-function-like change in the model's final layer activations. Furthermore, the model's sensitivity to this perturbation differs between model-generated (rea… ▽ More

    Submitted 18 November, 2024; v1 submitted 23 September, 2024; originally announced September 2024.

    Comments: Presented at the Attributing Model Behavior at Scale (ATTRIB) workshop at NeurIPS 2024