Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Chen, Runjin; Arditi, Andy; Sleight, Henry; Evans, Owain; Lindsey, Jack

Computer Science > Computation and Language

arXiv:2507.21509 (cs)

[Submitted on 29 Jul 2025 (v1), last revised 31 Aug 2025 (this version, v2)]

Title:Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Authors:Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, Jack Lindsey

View PDF

Abstract:Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model's activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant's personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2507.21509 [cs.CL]
	(or arXiv:2507.21509v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2507.21509

Submission history

From: Runjin Chen [view email]
[v1] Tue, 29 Jul 2025 05:20:14 UTC (2,585 KB)
[v2] Sun, 31 Aug 2025 02:41:43 UTC (2,637 KB)

Computer Science > Computation and Language

Title:Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators