Nash Learning from Human Feedback

Munos, Rémi; Valko, Michal; Calandriello, Daniele; Azar, Mohammad Gheshlaghi; Rowland, Mark; Guo, Zhaohan Daniel; Tang, Yunhao; Geist, Matthieu; Mesnard, Thomas; Michi, Andrea; Selvi, Marco; Girgin, Sertan; Momchev, Nikola; Bachem, Olivier; Mankowitz, Daniel J.; Precup, Doina; Piot, Bilal

Statistics > Machine Learning

arXiv:2312.00886 (stat)

[Submitted on 1 Dec 2023 (v1), last revised 11 Jun 2024 (this version, v4)]

Title:Nash Learning from Human Feedback

Authors:Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier Bachem, Daniel J. Mankowitz, Doina Precup, Bilal Piot

View PDF HTML (experimental)

Abstract:Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human preferences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution.
In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF).
In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.

Subjects:	Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Cite as:	arXiv:2312.00886 [stat.ML]
	(or arXiv:2312.00886v4 [stat.ML] for this version)
	https://doi.org/10.48550/arXiv.2312.00886

Submission history

From: Michal Valko [view email]
[v1] Fri, 1 Dec 2023 19:26:23 UTC (264 KB)
[v2] Tue, 5 Dec 2023 11:05:06 UTC (264 KB)
[v3] Wed, 6 Dec 2023 14:07:10 UTC (264 KB)
[v4] Tue, 11 Jun 2024 16:25:52 UTC (822 KB)

Statistics > Machine Learning

Title:Nash Learning from Human Feedback

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Machine Learning

Title:Nash Learning from Human Feedback

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators