Theoretical Tensions in RLHF: Reconciling Empirical Success with Inconsistencies in Social Choice Theory

Xiao, Jiancong; Shi, Zhekun; Liu, Kaizhao; Long, Qi; Su, Weijie J.

Statistics > Machine Learning

arXiv:2506.12350 (stat)

[Submitted on 14 Jun 2025]

Title:Theoretical Tensions in RLHF: Reconciling Empirical Success with Inconsistencies in Social Choice Theory

Authors:Jiancong Xiao, Zhekun Shi, Kaizhao Liu, Qi Long, Weijie J. Su

View PDF HTML (experimental)

Abstract:Despite its empirical success, Reinforcement Learning from Human Feedback (RLHF) has been shown to violate almost all the fundamental axioms in social choice theory -- such as majority consistency, pairwise majority consistency, and Condorcet consistency. This raises a foundational question: why does RLHF perform so well in practice if it fails these seemingly essential properties? In this paper, we resolve this paradox by showing that under mild and empirically plausible assumptions on the preference profile, RLHF does satisfy pairwise majority and Condorcet consistency. These assumptions are frequently satisfied in real-world alignment tasks, offering a theoretical explanation for RLHF's strong practical performance. Furthermore, we show that a slight modification to the reward modeling objective can ensure pairwise majority or Condorcet consistency even under general preference profiles, thereby improving the alignment process. Finally, we go beyond classical axioms in economic and social choice theory and introduce new alignment criteria -- preference matching, preference equivalence, and group preference matching -- that better reflect the goal of learning distributions over responses. We show that while RLHF satisfies the first two properties, it fails to satisfy the third. We conclude by discussing how future alignment methods may be designed to satisfy all three.

Subjects:	Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2506.12350 [stat.ML]
	(or arXiv:2506.12350v1 [stat.ML] for this version)
	https://doi.org/10.48550/arXiv.2506.12350

Submission history

From: Jiancong Xiao [view email]
[v1] Sat, 14 Jun 2025 05:14:49 UTC (81 KB)

Statistics > Machine Learning

Title:Theoretical Tensions in RLHF: Reconciling Empirical Success with Inconsistencies in Social Choice Theory

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Machine Learning

Title:Theoretical Tensions in RLHF: Reconciling Empirical Success with Inconsistencies in Social Choice Theory

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators