-
Beyond discounted returns: Robust Markov decision processes with average and Blackwell optimality
Authors:
Julien Grand-Clément,
Marek Petrik,
Nicolas Vieille
Abstract:
Robust Markov Decision Processes (RMDPs) are a widely used framework for sequential decision-making under parameter uncertainty. RMDPs have been extensively studied when the objective is to maximize the discounted return, but little is known for average optimality (optimizing the long-run average of the rewards obtained over time) and Blackwell optimality (remaining discount optimal for all discou…
▽ More
Robust Markov Decision Processes (RMDPs) are a widely used framework for sequential decision-making under parameter uncertainty. RMDPs have been extensively studied when the objective is to maximize the discounted return, but little is known for average optimality (optimizing the long-run average of the rewards obtained over time) and Blackwell optimality (remaining discount optimal for all discount factors sufficiently close to ). In this paper, we prove several foundational results for RMDPs beyond the discounted return. We show that average optimal policies can be chosen stationary and deterministic for sa-rectangular RMDPs but, perhaps surprisingly, we show that for s-rectangular RMDPs average optimal policies may not exist, and if they exist, may need to be history-dependent (Markovian). We also study Blackwell optimality for sa-rectangular RMDPs, where we show that $ε$-Blackwell optimal policies always exist, although Blackwell optimal policies may not exist. We also provide a sufficient condition for their existence, which encompasses virtually any examples from the literature. We then discuss the connection between average and Blackwell optimality, and we describe several algorithms to compute the optimal average return. Interestingly, our approach leverages the connections between RMDPs and stochastic games. Overall, our paper emphasizes the superior practical properties of distance-based sa-rectangular models over s-rectangular models for average and Blackwell optimality.
△ Less
Submitted 14 January, 2025; v1 submitted 6 December, 2023;
originally announced December 2023.
-
On Dynamic Programming Decompositions of Static Risk Measures in Markov Decision Processes
Authors:
Jia Lin Hau,
Erick Delage,
Mohammad Ghavamzadeh,
Marek Petrik
Abstract:
Optimizing static risk-averse objectives in Markov decision processes is difficult because they do not admit standard dynamic programming equations common in Reinforcement Learning (RL) algorithms. Dynamic programming decompositions that augment the state space with discrete risk levels have recently gained popularity in the RL community. Prior work has shown that these decompositions are optimal…
▽ More
Optimizing static risk-averse objectives in Markov decision processes is difficult because they do not admit standard dynamic programming equations common in Reinforcement Learning (RL) algorithms. Dynamic programming decompositions that augment the state space with discrete risk levels have recently gained popularity in the RL community. Prior work has shown that these decompositions are optimal when the risk level is discretized sufficiently. However, we show that these popular decompositions for Conditional-Value-at-Risk (CVaR) and Entropic-Value-at-Risk (EVaR) are inherently suboptimal regardless of the discretization level. In particular, we show that a saddle point property assumed to hold in prior literature may be violated. However, a decomposition does hold for Value-at-Risk and our proof demonstrates how this risk measure differs from CVaR and EVaR. Our findings are significant because risk-averse algorithms are used in high-stake environments, making their correctness much more critical.
△ Less
Submitted 23 April, 2024; v1 submitted 24 April, 2023;
originally announced April 2023.
-
On the convex formulations of robust Markov decision processes
Authors:
Julien Grand-Clément,
Marek Petrik
Abstract:
Robust Markov decision processes (MDPs) are used for applications of dynamic optimization in uncertain environments and have been studied extensively. Many of the main properties and algorithms of MDPs, such as value iteration and policy iteration, extend directly to RMDPs. Surprisingly, there is no known analog of the MDP convex optimization formulation for solving RMDPs. This work describes the…
▽ More
Robust Markov decision processes (MDPs) are used for applications of dynamic optimization in uncertain environments and have been studied extensively. Many of the main properties and algorithms of MDPs, such as value iteration and policy iteration, extend directly to RMDPs. Surprisingly, there is no known analog of the MDP convex optimization formulation for solving RMDPs. This work describes the first convex optimization formulation of RMDPs under the classical sa-rectangularity and s-rectangularity assumptions. By using entropic regularization and exponential change of variables, we derive a convex formulation with a number of variables and constraints polynomial in the number of states and actions, but with large coefficients in the constraints. We further simplify the formulation for RMDPs with polyhedral, ellipsoidal, or entropy-based uncertainty sets, showing that, in these cases, RMDPs can be reformulated as conic programs based on exponential cones, quadratic cones, and non-negative orthants. Our work opens a new research direction for RMDPs and can serve as a first step toward obtaining a tractable convex formulation of RMDPs.
△ Less
Submitted 13 December, 2023; v1 submitted 21 September, 2022;
originally announced September 2022.
-
Robust Phi-Divergence MDPs
Authors:
Chin Pang Ho,
Marek Petrik,
Wolfram Wiesemann
Abstract:
In recent years, robust Markov decision processes (MDPs) have emerged as a prominent modeling framework for dynamic decision problems affected by uncertainty. In contrast to classical MDPs, which only account for stochasticity by modeling the dynamics through a stochastic process with a known transition kernel, robust MDPs additionally account for ambiguity by optimizing in view of the most advers…
▽ More
In recent years, robust Markov decision processes (MDPs) have emerged as a prominent modeling framework for dynamic decision problems affected by uncertainty. In contrast to classical MDPs, which only account for stochasticity by modeling the dynamics through a stochastic process with a known transition kernel, robust MDPs additionally account for ambiguity by optimizing in view of the most adverse transition kernel from a prescribed ambiguity set. In this paper, we develop a novel solution framework for robust MDPs with s-rectangular ambiguity sets that decomposes the problem into a sequence of robust Bellman updates and simplex projections. Exploiting the rich structure present in the simplex projections corresponding to phi-divergence ambiguity sets, we show that the associated s-rectangular robust MDPs can be solved substantially faster than with state-of-the-art commercial solvers as well as a recent first-order solution scheme, thus rendering them attractive alternatives to classical MDPs in practical applications.
△ Less
Submitted 12 January, 2023; v1 submitted 27 May, 2022;
originally announced May 2022.
-
Soft-Robust Algorithms for Batch Reinforcement Learning
Authors:
Elita A. Lobo,
Mohammad Ghavamzadeh,
Marek Petrik
Abstract:
In reinforcement learning, robust policies for high-stakes decision-making problems with limited data are usually computed by optimizing the percentile criterion, which minimizes the probability of a catastrophic failure. Unfortunately, such policies are typically overly conservative as the percentile criterion is non-convex, difficult to optimize, and ignores the mean performance. To overcome the…
▽ More
In reinforcement learning, robust policies for high-stakes decision-making problems with limited data are usually computed by optimizing the percentile criterion, which minimizes the probability of a catastrophic failure. Unfortunately, such policies are typically overly conservative as the percentile criterion is non-convex, difficult to optimize, and ignores the mean performance. To overcome these shortcomings, we study the soft-robust criterion, which uses risk measures to balance the mean and percentile criterion better. In this paper, we establish the soft-robust criterion's fundamental properties, show that it is NP-hard to optimize, and propose and analyze two algorithms to approximately optimize it. Our theoretical analyses and empirical evaluations demonstrate that our algorithms compute much less conservative solutions than the existing approximate methods for optimizing the percentile-criterion.
△ Less
Submitted 26 February, 2021; v1 submitted 29 November, 2020;
originally announced November 2020.
-
Entropic Risk Constrained Soft-Robust Policy Optimization
Authors:
Reazul Hasan Russel,
Bahram Behzadian,
Marek Petrik
Abstract:
Having a perfect model to compute the optimal policy is often infeasible in reinforcement learning. It is important in high-stakes domains to quantify and manage risk induced by model uncertainties. Entropic risk measure is an exponential utility-based convex risk measure that satisfies many reasonable properties. In this paper, we propose an entropic risk constrained policy gradient and actor-cri…
▽ More
Having a perfect model to compute the optimal policy is often infeasible in reinforcement learning. It is important in high-stakes domains to quantify and manage risk induced by model uncertainties. Entropic risk measure is an exponential utility-based convex risk measure that satisfies many reasonable properties. In this paper, we propose an entropic risk constrained policy gradient and actor-critic algorithms that are risk-averse to the model uncertainty. We demonstrate the usefulness of our algorithms on several problem domains.
△ Less
Submitted 20 June, 2020;
originally announced June 2020.
-
Partial Policy Iteration for L1-Robust Markov Decision Processes
Authors:
Chin Pang Ho,
Marek Petrik,
Wolfram Wiesemann
Abstract:
Robust Markov decision processes (MDPs) allow to compute reliable solutions for dynamic decision problems whose evolution is modeled by rewards and partially-known transition probabilities. Unfortunately, accounting for uncertainty in the transition probabilities significantly increases the computational complexity of solving robust MDPs, which severely limits their scalability. This paper describ…
▽ More
Robust Markov decision processes (MDPs) allow to compute reliable solutions for dynamic decision problems whose evolution is modeled by rewards and partially-known transition probabilities. Unfortunately, accounting for uncertainty in the transition probabilities significantly increases the computational complexity of solving robust MDPs, which severely limits their scalability. This paper describes new efficient algorithms for solving the common class of robust MDPs with s- and sa-rectangular ambiguity sets defined by weighted $L_1$ norms. We propose partial policy iteration, a new, efficient, flexible, and general policy iteration scheme for robust MDPs. We also propose fast methods for computing the robust Bellman operator in quasi-linear time, nearly matching the linear complexity the non-robust Bellman operator. Our experimental results indicate that the proposed methods are many orders of magnitude faster than the state-of-the-art approach which uses linear programming solvers combined with a robust value iteration.
△ Less
Submitted 16 June, 2020;
originally announced June 2020.
-
Robust Policy Optimization with Baseline Guarantees
Authors:
Yinlam Chow,
Marek Petrik,
Mohammad Ghavamzadeh
Abstract:
Our goal is to compute a policy that guarantees improved return over a baseline policy even when the available MDP model is inaccurate. The inaccurate model may be constructed, for example, by system identification techniques when the true model is inaccessible. When the modeling error is large, the standard solution to the constructed model has no performance guarantees with respect to the true m…
▽ More
Our goal is to compute a policy that guarantees improved return over a baseline policy even when the available MDP model is inaccurate. The inaccurate model may be constructed, for example, by system identification techniques when the true model is inaccessible. When the modeling error is large, the standard solution to the constructed model has no performance guarantees with respect to the true model. In this paper we develop algorithms that provide such performance guarantees and show a trade-off between their complexity and conservatism. Our novel model-based safe policy search algorithms leverage recent advances in robust optimization techniques. Furthermore we illustrate the effectiveness of these algorithms using a numerical example.
△ Less
Submitted 15 June, 2015; v1 submitted 15 June, 2015;
originally announced June 2015.
-
Tight Approximations of Dynamic Risk Measures
Authors:
Dan A. Iancu,
Marek Petrik,
Dharmashankar Subramanian
Abstract:
This paper compares two different frameworks recently introduced in the literature for measuring risk in a multi-period setting. The first corresponds to applying a single coherent risk measure to the cumulative future costs, while the second involves applying a composition of one-step coherent risk mappings. We summarize the relative strengths of the two methods, characterize several necessary an…
▽ More
This paper compares two different frameworks recently introduced in the literature for measuring risk in a multi-period setting. The first corresponds to applying a single coherent risk measure to the cumulative future costs, while the second involves applying a composition of one-step coherent risk mappings. We summarize the relative strengths of the two methods, characterize several necessary and sufficient conditions under which one of the measurements always dominates the other, and introduce a metric to quantify how close the two risk measures are.
Using this notion, we address the question of how tightly a given coherent measure can be approximated by lower or upper-bounding compositional measures. We exhibit an interesting asymmetry between the two cases: the tightest possible upper-bound can be exactly characterized, and corresponds to a popular construction in the literature, while the tightest-possible lower bound is not readily available. We show that testing domination and computing the approximation factors is generally NP-hard, even when the risk measures in question are comonotonic and law-invariant. However, we characterize conditions and discuss several examples where polynomial-time algorithms are possible. One such case is the well-known Conditional Value-at-Risk measure, which is further explored in our companion paper [Huang, Iancu, Petrik and Subramanian, "Static and Dynamic Conditional Value at Risk" (2012)]. Our theoretical and algorithmic constructions exploit interesting connections between the study of risk measures and the theory of submodularity and combinatorial optimization, which may be of independent interest.
△ Less
Submitted 23 August, 2013; v1 submitted 29 June, 2011;
originally announced June 2011.