A General Theoretical Paradigm to Understand Learning from Human Preferences

Azar, Mohammad Gheshlaghi; Rowland, Mark; Piot, Bilal; Guo, Daniel; Calandriello, Daniele; Valko, Michal; Munos, Rémi

Computer Science > Artificial Intelligence

arXiv:2310.12036 (cs)

[Submitted on 18 Oct 2023 (v1), last revised 22 Nov 2023 (this version, v2)]

Title:A General Theoretical Paradigm to Understand Learning from Human Preferences

Authors:Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos

View PDF

Abstract:The prevalent deployment of learning from human preferences through reinforcement learning (RLHF) relies on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second assumes that a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy. Recently, Direct Preference Optimisation (DPO) has been proposed as an approach that bypasses the second approximation and learn directly a policy from collected data without the reward modelling stage. However, this method still heavily relies on the first approximation.
In this paper we try to gain a deeper theoretical understanding of these practical algorithms. In particular we derive a new general objective called $\Psi$PO for learning from human preferences that is expressed in terms of pairwise preferences and therefore bypasses both approximations. This new general objective allows us to perform an in-depth analysis of the behavior of RLHF and DPO (as special cases of $\Psi$PO) and to identify their potential pitfalls. We then consider another special case for $\Psi$PO by setting $\Psi$ simply to Identity, for which we can derive an efficient optimisation procedure, prove performance guarantees and demonstrate its empirical superiority to DPO on some illustrative examples.

Subjects:	Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:2310.12036 [cs.AI]
	(or arXiv:2310.12036v2 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2310.12036

Submission history

From: Mohammad Gheshlaghi Azar [view email]
[v1] Wed, 18 Oct 2023 15:21:28 UTC (689 KB)
[v2] Wed, 22 Nov 2023 00:02:49 UTC (689 KB)

Computer Science > Artificial Intelligence

Title:A General Theoretical Paradigm to Understand Learning from Human Preferences

Submission history

Access Paper:

References & Citations

1 blog link

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:A General Theoretical Paradigm to Understand Learning from Human Preferences

Submission history

Access Paper:

References & Citations

1 blog link

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators