Skip to main content
Cornell University
We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors. Donate
arxiv logo > cs > arXiv:2509.03728v1

Help | Advanced Search

arXiv logo
Cornell University Logo

quick links

  • Login
  • Help Pages
  • About

Computer Science > Artificial Intelligence

arXiv:2509.03728v1 (cs)
[Submitted on 3 Sep 2025 (this version), latest version 5 Sep 2025 (v2)]

Title:PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming

Authors:Wesley Hanwen Deng, Sunnie S. Y. Kim, Akshita Jha, Ken Holstein, Motahhare Eslami, Lauren Wilcox, Leon A Gatys
View a PDF of the paper titled PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming, by Wesley Hanwen Deng and 6 other authors
View PDF
Abstract:Recent developments in AI governance and safety research have called for red-teaming methods that can effectively surface potential risks posed by AI models. Many of these calls have emphasized how the identities and backgrounds of red-teamers can shape their red-teaming strategies, and thus the kinds of risks they are likely to uncover. While automated red-teaming approaches promise to complement human red-teaming by enabling larger-scale exploration of model behavior, current approaches do not consider the role of identity. As an initial step towards incorporating people's background and identities in automated red-teaming, we develop and evaluate a novel method, PersonaTeaming, that introduces personas in the adversarial prompt generation process to explore a wider spectrum of adversarial strategies. In particular, we first introduce a methodology for mutating prompts based on either "red-teaming expert" personas or "regular AI user" personas. We then develop a dynamic persona-generating algorithm that automatically generates various persona types adaptive to different seed prompts. In addition, we develop a set of new metrics to explicitly measure the "mutation distance" to complement existing diversity measurements of adversarial prompts. Our experiments show promising improvements (up to 144.1%) in the attack success rates of adversarial prompts through persona mutation, while maintaining prompt diversity, compared to RainbowPlus, a state-of-the-art automated red-teaming method. We discuss the strengths and limitations of different persona types and mutation methods, shedding light on future opportunities to explore complementarities between automated and human red-teaming approaches.
Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Cite as: arXiv:2509.03728 [cs.AI]
  (or arXiv:2509.03728v1 [cs.AI] for this version)
  https://doi.org/10.48550/arXiv.2509.03728
arXiv-issued DOI via DataCite

Submission history

From: Wesley Deng [view email]
[v1] Wed, 3 Sep 2025 21:20:38 UTC (88 KB)
[v2] Fri, 5 Sep 2025 02:25:11 UTC (88 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming, by Wesley Hanwen Deng and 6 other authors
  • View PDF
  • TeX Source
license icon view license
Current browse context:
cs.AI
< prev   |   next >
new | recent | 2025-09
Change to browse by:
cs
cs.HC

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar
export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
  • Author
  • Venue
  • Institution
  • Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack