On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling

Corrado, Nicholas E.; Hanna, Josiah P.

Computer Science > Machine Learning

arXiv:2311.08290 (cs)

[Submitted on 14 Nov 2023 (v1), last revised 6 Oct 2024 (this version, v2)]

Title:On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling

Authors:Nicholas E. Corrado, Josiah P. Hanna

View PDF HTML (experimental)

Abstract:On-policy reinforcement learning (RL) algorithms perform policy updates using i.i.d. trajectories collected by the current policy. However, after observing only a finite number of trajectories, on-policy sampling may produce data that fails to match the expected on-policy data distribution. This sampling error leads to noisy updates and data inefficient on-policy learning. Recent work in the policy evaluation setting has shown that non-i.i.d., off-policy sampling can produce data with lower sampling error than on-policy sampling can produce (Zhong et. al, 2022). Motivated by this observation, we introduce an adaptive, off-policy sampling method to improve the data efficiency of on-policy policy gradient algorithms. Our method, Proximal Robust On-Policy Sampling (PROPS), reduces sampling error by collecting data with a behavior policy that increases the probability of sampling actions that are under-sampled with respect to the current policy. We empirically evaluate PROPS on both continuous-action MuJoCo benchmark tasks as well discrete-action tasks and demonstrate that (1) PROPS decreases sampling error throughout training and (2) improves the data efficiency of on-policy policy gradient algorithms.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2311.08290 [cs.LG]
	(or arXiv:2311.08290v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2311.08290

Submission history

From: Nicholas Corrado [view email]
[v1] Tue, 14 Nov 2023 16:37:28 UTC (24,262 KB)
[v2] Sun, 6 Oct 2024 23:33:45 UTC (33,026 KB)

Computer Science > Machine Learning

Title:On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators