Jointly Learning Environments and Control Policies with Projected Stochastic Gradient Ascent

Bolland, Adrien; Boukas, Ioannis; Berger, Mathias; Ernst, Damien

Computer Science > Machine Learning

arXiv:2006.01738 (cs)

[Submitted on 2 Jun 2020 (v1), last revised 6 Jan 2022 (this version, v4)]

Title:Jointly Learning Environments and Control Policies with Projected Stochastic Gradient Ascent

Authors:Adrien Bolland, Ioannis Boukas, Mathias Berger, Damien Ernst

View PDF

Abstract:We consider the joint design and control of discrete-time stochastic dynamical systems over a finite time horizon. We formulate the problem as a multi-step optimization problem under uncertainty seeking to identify a system design and a control policy that jointly maximize the expected sum of rewards collected over the time horizon considered. The transition function, the reward function and the policy are all parametrized, assumed known and differentiable with respect to their parameters. We then introduce a deep reinforcement learning algorithm combining policy gradient methods with model-based optimization techniques to solve this problem. In essence, our algorithm iteratively approximates the gradient of the expected return via Monte-Carlo sampling and automatic differentiation and takes projected gradient ascent steps in the space of environment and policy parameters. This algorithm is referred to as Direct Environment and Policy Search (DEPS). We assess the performance of our algorithm in three environments concerned with the design and control of a mass-spring-damper system, a small-scale off-grid power system and a drone, respectively. In addition, our algorithm is benchmarked against a state-of-the-art deep reinforcement learning algorithm used to tackle joint design and control problems. We show that DEPS performs at least as well or better in all three environments, consistently yielding solutions with higher returns in fewer iterations. Finally, solutions produced by our algorithm are also compared with solutions produced by an algorithm that does not jointly optimize environment and policy parameters, highlighting the fact that higher returns can be achieved when joint optimization is performed.

Subjects:	Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:2006.01738 [cs.LG]
	(or arXiv:2006.01738v4 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2006.01738
Journal reference:	Journal of Artificial Intelligence Research 73 (2022) 117-171

Submission history

From: Adrien Bolland [view email]
[v1] Tue, 2 Jun 2020 16:08:07 UTC (1,231 KB)
[v2] Wed, 23 Dec 2020 20:51:10 UTC (919 KB)
[v3] Wed, 22 Sep 2021 12:50:53 UTC (5,534 KB)
[v4] Thu, 6 Jan 2022 12:25:26 UTC (5,535 KB)

Computer Science > Machine Learning

Title:Jointly Learning Environments and Control Policies with Projected Stochastic Gradient Ascent

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Jointly Learning Environments and Control Policies with Projected Stochastic Gradient Ascent

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators