Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision

Jiang, Eric Hanchen; Luo, Haozheng; Pang, Shengyuan; Li, Xiaomin; Qi, Zhenting; Li, Hengli; Yang, Cheng-Fu; Lin, Zongyu; Li, Xinfeng; Xu, Hao; Chang, Kai-Wei; Wu, Ying Nian

Computer Science > Machine Learning

arXiv:2505.14999 (cs)

[Submitted on 21 May 2025]

Title:Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision

Authors:Eric Hanchen Jiang, Haozheng Luo, Shengyuan Pang, Xiaomin Li, Zhenting Qi, Hengli Li, Cheng-Fu Yang, Zongyu Lin, Xinfeng Li, Hao Xu, Kai-Wei Chang, Ying Nian Wu

View PDF

Abstract:Mathematical reasoning presents a significant challenge for Large Language Models (LLMs), often requiring robust multi step logical consistency. While Chain of Thought (CoT) prompting elicits reasoning steps, it doesn't guarantee correctness, and improving reliability via extensive sampling is computationally costly. This paper introduces the Energy Outcome Reward Model (EORM), an effective, lightweight, post hoc verifier. EORM leverages Energy Based Models (EBMs) to simplify the training of reward models by learning to assign a scalar energy score to CoT solutions using only outcome labels, thereby avoiding detailed annotations. It achieves this by interpreting discriminator output logits as negative energies, effectively ranking candidates where lower energy is assigned to solutions leading to correct final outcomes implicitly favoring coherent reasoning. On mathematical benchmarks (GSM8k, MATH), EORM significantly improves final answer accuracy (e.g., with Llama 3 8B, achieving 90.7% on GSM8k and 63.7% on MATH). EORM effectively leverages a given pool of candidate solutions to match or exceed the performance of brute force sampling, thereby enhancing LLM reasoning outcome reliability through its streamlined post hoc verification process.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Cite as:	arXiv:2505.14999 [cs.LG]
	(or arXiv:2505.14999v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2505.14999

Submission history

From: Eric Jiang [view email]
[v1] Wed, 21 May 2025 01:06:29 UTC (2,947 KB)

Computer Science > Machine Learning

Title:Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators