-
The Structural Safety Generalization Problem
Authors:
Julius Broomfield,
Tom Gibbs,
Ethan Kosak-Hine,
George Ingebretsen,
Tia Nasir,
Jason Zhang,
Reihaneh Iranmanesh,
Sara Pieri,
Reihaneh Rabbany,
Kellin Pelrine
Abstract:
LLM jailbreaks are a widespread safety challenge. Given this problem has not yet been tractable, we suggest targeting a key failure mechanism: the failure of safety to generalize across semantically equivalent inputs. We further focus the target by requiring desirable tractability properties of attacks to study: explainability, transferability between models, and transferability between goals. We…
▽ More
LLM jailbreaks are a widespread safety challenge. Given this problem has not yet been tractable, we suggest targeting a key failure mechanism: the failure of safety to generalize across semantically equivalent inputs. We further focus the target by requiring desirable tractability properties of attacks to study: explainability, transferability between models, and transferability between goals. We perform red-teaming within this framework by uncovering new vulnerabilities to multi-turn, multi-image, and translation-based attacks. These attacks are semantically equivalent by our design to their single-turn, single-image, or untranslated counterparts, enabling systematic comparisons; we show that the different structures yield different safety outcomes. We then demonstrate the potential for this framework to enable new defenses by proposing a Structure Rewriting Guardrail, which converts an input to a structure more conducive to safety assessment. This guardrail significantly improves refusal of harmful inputs, without over-refusing benign ones. Thus, by framing this intermediate challenge - more tractable than universal defenses but essential for long-term safety - we highlight a critical milestone for AI safety research.
△ Less
Submitted 30 May, 2025; v1 submitted 13 April, 2025;
originally announced April 2025.
-
Online Influence Campaigns: Strategies and Vulnerabilities
Authors:
Andreea Musulan,
Veronica Xia,
Ethan Kosak-Hine,
Tom Gibbs,
Vidya Sujaya,
Reihaneh Rabbany,
Jean-François Godbout,
Kellin Pelrine
Abstract:
In order to combat the creation and spread of harmful content online, this paper defines and contextualizes the concept of inauthentic, societal-scale manipulation by malicious actors. We review the literature on societally harmful content and how it proliferates to analyze the manipulation strategies used by such actors and the vulnerabilities they target. We also provide an overview of three cas…
▽ More
In order to combat the creation and spread of harmful content online, this paper defines and contextualizes the concept of inauthentic, societal-scale manipulation by malicious actors. We review the literature on societally harmful content and how it proliferates to analyze the manipulation strategies used by such actors and the vulnerabilities they target. We also provide an overview of three case studies of extensive manipulation campaigns to emphasize the severity of the problem. We then address the role that Artificial Intelligence plays in the development and dissemination of harmful content, and how its evolution presents new threats to societal cohesion for countries across the globe. Our survey aims to increase our understanding of not just particular aspects of these threats, but also the strategies underlying their deployment, so we can effectively prepare for the evolving cybersecurity landscape.
△ Less
Submitted 18 December, 2024;
originally announced January 2025.
-
A Simulation System Towards Solving Societal-Scale Manipulation
Authors:
Maximilian Puelma Touzel,
Sneheel Sarangi,
Austin Welch,
Gayatri Krishnakumar,
Dan Zhao,
Zachary Yang,
Hao Yu,
Ethan Kosak-Hine,
Tom Gibbs,
Andreea Musulan,
Camille Thibault,
Busra Tugce Gurbuz,
Reihaneh Rabbany,
Jean-François Godbout,
Kellin Pelrine
Abstract:
The rise of AI-driven manipulation poses significant risks to societal trust and democratic processes. Yet, studying these effects in real-world settings at scale is ethically and logistically impractical, highlighting a need for simulation tools that can model these dynamics in controlled settings to enable experimentation with possible defenses. We present a simulation environment designed to ad…
▽ More
The rise of AI-driven manipulation poses significant risks to societal trust and democratic processes. Yet, studying these effects in real-world settings at scale is ethically and logistically impractical, highlighting a need for simulation tools that can model these dynamics in controlled settings to enable experimentation with possible defenses. We present a simulation environment designed to address this. We elaborate upon the Concordia framework that simulates offline, `real life' activity by adding online interactions to the simulation through social media with the integration of a Mastodon server. We improve simulation efficiency and information flow, and add a set of measurement tools, particularly longitudinal surveys. We demonstrate the simulator with a tailored example in which we track agents' political positions and show how partisan manipulation of agents can affect election results.
△ Less
Submitted 16 October, 2024;
originally announced October 2024.
-
Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks
Authors:
Tom Gibbs,
Ethan Kosak-Hine,
George Ingebretsen,
Jason Zhang,
Julius Broomfield,
Sara Pieri,
Reihaneh Iranmanesh,
Reihaneh Rabbany,
Kellin Pelrine
Abstract:
Large language models (LLMs) are improving at an exceptional rate. However, these models are still susceptible to jailbreak attacks, which are becoming increasingly dangerous as models become increasingly powerful. In this work, we introduce a dataset of jailbreaks where each example can be input in both a single or a multi-turn format. We show that while equivalent in content, they are not equiva…
▽ More
Large language models (LLMs) are improving at an exceptional rate. However, these models are still susceptible to jailbreak attacks, which are becoming increasingly dangerous as models become increasingly powerful. In this work, we introduce a dataset of jailbreaks where each example can be input in both a single or a multi-turn format. We show that while equivalent in content, they are not equivalent in jailbreak success: defending against one structure does not guarantee defense against the other. Similarly, LLM-based filter guardrails also perform differently depending on not just the input content but the input structure. Thus, vulnerabilities of frontier models should be studied in both single and multi-turn settings; this dataset provides a tool to do so.
△ Less
Submitted 29 August, 2024;
originally announced September 2024.