Reward Model Interpretability via Optimal and Pessimal Tokens

Christian, Brian; Kirk, Hannah Rose; Thompson, Jessica A. F.; Summerfield, Christopher; Dumbalska, Tsvetomira

doi:10.1145/3715275.3732068

Computer Science > Computation and Language

arXiv:2506.07326 (cs)

[Submitted on 8 Jun 2025]

Title:Reward Model Interpretability via Optimal and Pessimal Tokens

Authors:Brian Christian, Hannah Rose Kirk, Jessica A.F. Thompson, Christopher Summerfield, Tsvetomira Dumbalska

View PDF HTML (experimental)

Abstract:Reward modeling has emerged as a crucial component in aligning large language models with human values. Significant attention has focused on using reward models as a means for fine-tuning generative models. However, the reward models themselves -- which directly encode human value judgments by turning prompt-response pairs into scalar rewards -- remain relatively understudied. We present a novel approach to reward model interpretability through exhaustive analysis of their responses across their entire vocabulary space. By examining how different reward models score every possible single-token response to value-laden prompts, we uncover several striking findings: (i) substantial heterogeneity between models trained on similar objectives, (ii) systematic asymmetries in how models encode high- vs low-scoring tokens, (iii) significant sensitivity to prompt framing that mirrors human cognitive biases, and (iv) overvaluation of more frequent tokens. We demonstrate these effects across ten recent open-source reward models of varying parameter counts and architectures. Our results challenge assumptions about the interchangeability of reward models, as well as their suitability as proxies of complex and context-dependent human values. We find that these models can encode concerning biases toward certain identity groups, which may emerge as unintended consequences of harmlessness training -- distortions that risk propagating through the downstream large language models now deployed to millions.

Comments:	Accepted for publication in Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT '25), to appear June 2025
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
ACM classes:	I.2.6; I.2.7; H.5.2; J.4; K.4.2
Cite as:	arXiv:2506.07326 [cs.CL]
	(or arXiv:2506.07326v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2506.07326
Related DOI:	https://doi.org/10.1145/3715275.3732068

Submission history

From: Brian Christian [view email]
[v1] Sun, 8 Jun 2025 23:56:58 UTC (5,267 KB)

Computer Science > Computation and Language

Title:Reward Model Interpretability via Optimal and Pessimal Tokens

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Reward Model Interpretability via Optimal and Pessimal Tokens

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators