Skip to main content

Showing 1–4 of 4 results for author: Antolín, N P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.00509  [pdf, ps, other

    cs.LG

    Self-Ablating Transformers: More Interpretability, Less Sparsity

    Authors: Jeremias Ferrao, Luhan Mikaelson, Keenan Pepper, Natalia Perez-Campanero Antolin

    Abstract: A growing intuition in machine learning suggests a link between sparsity and interpretability. We introduce a novel self-ablation mechanism to investigate this connection ante-hoc in the context of language transformers. Our approach dynamically enforces a k-winner-takes-all constraint, forcing the model to demonstrate selective activation across neuron and attention units. Unlike post-hoc methods… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

    Comments: Poster Presentation at Building Trust Workshop at ICLR 2025

  2. arXiv:2504.07831  [pdf, other

    cs.AI cs.CL

    Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems

    Authors: Simon Lermen, Mateusz Dziemian, Natalia Pérez-Campanero Antolín

    Abstract: We demonstrate how AI agents can coordinate to deceive oversight systems using automated interpretability of neural networks. Using sparse autoencoders (SAEs) as our experimental framework, we show that language models (Llama, DeepSeek R1, and Claude 3.7 Sonnet) can generate deceptive explanations that evade detection. Our agents employ steganographic methods to hide information in seemingly innoc… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

  3. arXiv:2503.12722  [pdf, other

    cs.AI cs.CL cs.GT cs.MA

    Identifying Cooperative Personalities in Multi-agent Contexts through Personality Steering with Representation Engineering

    Authors: Kenneth J. K. Ong, Lye Jia Jun, Hieu Minh "Jord" Nguyen, Seong Hah Cho, Natalia Pérez-Campanero Antolín

    Abstract: As Large Language Models (LLMs) gain autonomous capabilities, their coordination in multi-agent settings becomes increasingly important. However, they often struggle with cooperation, leading to suboptimal outcomes. Inspired by Axelrod's Iterated Prisoner's Dilemma (IPD) tournaments, we explore how personality traits influence LLM cooperation. Using representation engineering, we steer Big Five tr… ▽ More

    Submitted 16 March, 2025; originally announced March 2025.

    Comments: Poster, Technical AI Safety Conference 2025

  4. arXiv:2411.13627  [pdf, other

    cs.CR cs.AI cs.SC

    CryptoFormalEval: Integrating LLMs and Formal Verification for Automated Cryptographic Protocol Vulnerability Detection

    Authors: Cristian Curaba, Denis D'Ambrosi, Alessandro Minisini, Natalia Pérez-Campanero Antolín

    Abstract: Cryptographic protocols play a fundamental role in securing modern digital infrastructure, but they are often deployed without prior formal verification. This could lead to the adoption of distributed systems vulnerable to attack vectors. Formal verification methods, on the other hand, require complex and time-consuming techniques that lack automatization. In this paper, we introduce a benchmark t… ▽ More

    Submitted 20 November, 2024; originally announced November 2024.