RewardBench: Evaluating Reward Models for Language Modeling

Lambert, Nathan; Pyatkin, Valentina; Morrison, Jacob; Miranda, LJ; Lin, Bill Yuchen; Chandu, Khyathi; Dziri, Nouha; Kumar, Sachin; Zick, Tom; Choi, Yejin; Smith, Noah A.; Hajishirzi, Hannaneh

Computer Science > Machine Learning

arXiv:2403.13787 (cs)

[Submitted on 20 Mar 2024 (v1), last revised 8 Jun 2024 (this version, v2)]

Title:RewardBench: Evaluating Reward Models for Language Modeling

Authors:Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi

View PDF HTML (experimental)

Abstract:Reward models (RMs) are at the crux of successfully using RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of language models and which values are embedded in them. Resources for reward model training and understanding are sparse in the nascent open-source community around them. To enhance scientific understanding of reward models, we present RewardBench, a benchmark dataset and code-base for evaluation. The RewardBench dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries. We create specific comparison datasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrect facts) why one answer should be preferred to another. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods, such as the direct MLE training of classifiers and the implicit reward modeling of Direct Preference Optimization (DPO). We present many findings on propensity for refusals, reasoning limitations, and instruction following shortcomings of various reward models towards a better understanding of the RLHF process.

Comments:	44 pages, 19 figures, 12 tables
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2403.13787 [cs.LG]
	(or arXiv:2403.13787v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2403.13787

Submission history

From: Nathan Lambert [view email]
[v1] Wed, 20 Mar 2024 17:49:54 UTC (493 KB)
[v2] Sat, 8 Jun 2024 16:40:12 UTC (502 KB)

Computer Science > Machine Learning

Title:RewardBench: Evaluating Reward Models for Language Modeling

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:RewardBench: Evaluating Reward Models for Language Modeling

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators