$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training

Zhou, Jin Peng; Wang, Kaiwen; Chang, Jonathan; Gao, Zhaolin; Kallus, Nathan; Weinberger, Kilian Q.; Brantley, Kianté; Sun, Wen

Computer Science > Machine Learning

arXiv:2502.20548 (cs)

[Submitted on 27 Feb 2025]

Title:$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training

Authors:Jin Peng Zhou, Kaiwen Wang, Jonathan Chang, Zhaolin Gao, Nathan Kallus, Kilian Q. Weinberger, Kianté Brantley, Wen Sun

View PDF HTML (experimental)

Abstract:Reinforcement learning (RL) post-training is crucial for LLM alignment and reasoning, but existing policy-based methods, such as PPO and DPO, can fall short of fixing shortcuts inherited from pre-training. In this work, we introduce $Q\sharp$, a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized $Q$ function. We propose to learn the optimal $Q$ function using distributional RL on an aggregated online dataset. Unlike prior value-based baselines that guide the model using unregularized $Q$-values, our method is theoretically principled and provably learns the optimal policy for the KL-regularized RL problem. Empirically, $Q\sharp$ outperforms prior baselines in math reasoning benchmarks while maintaining a smaller KL divergence to the reference policy. Theoretically, we establish a reduction from KL-regularized RL to no-regret online learning, providing the first bounds for deterministic MDPs under only realizability. Thanks to distributional RL, our bounds are also variance-dependent and converge faster when the reference policy has small variance. In sum, our results highlight $Q\sharp$ as an effective approach for post-training LLMs, offering both improved performance and theoretical guarantees. The code can be found at this https URL.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2502.20548 [cs.LG]
	(or arXiv:2502.20548v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2502.20548

Submission history

From: Jin Peng Zhou [view email]
[v1] Thu, 27 Feb 2025 21:43:00 UTC (315 KB)

Computer Science > Machine Learning

Title:$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators