Dissecting Persona-Driven Reasoning in Language Models via Activation Patching

Poonia, Ansh; Jain, Maeghal

Computer Science > Machine Learning

arXiv:2507.20936 (cs)

[Submitted on 28 Jul 2025]

Title:Dissecting Persona-Driven Reasoning in Language Models via Activation Patching

Authors:Ansh Poonia, Maeghal Jain

View PDF HTML (experimental)

Abstract:Large language models (LLMs) exhibit remarkable versatility in adopting diverse personas. In this study, we examine how assigning a persona influences a model's reasoning on an objective task. Using activation patching, we take a first step toward understanding how key components of the model encode persona-specific information. Our findings reveal that the early Multi-Layer Perceptron (MLP) layers attend not only to the syntactic structure of the input but also process its semantic content. These layers transform persona tokens into richer representations, which are then used by the middle Multi-Head Attention (MHA) layers to shape the model's output. Additionally, we identify specific attention heads that disproportionately attend to racial and color-based identities.

Comments:	11 pages
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2507.20936 [cs.LG]
	(or arXiv:2507.20936v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2507.20936

Submission history

From: Ansh Poonia [view email]
[v1] Mon, 28 Jul 2025 15:45:31 UTC (722 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.LG

< prev | next >

new | recent | 2025-07

Change to browse by:

cs
cs.AI
cs.CL

References & Citations

export BibTeX citation

Computer Science > Machine Learning

Title:Dissecting Persona-Driven Reasoning in Language Models via Activation Patching

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Dissecting Persona-Driven Reasoning in Language Models via Activation Patching

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators