-
Risk-Sensitive Partially Observable Markov Decision Processes as Fully Observable Multivariate Utility Optimization problems
Authors:
Arsham Afsardeir,
Andreas Kapetanis,
Vaios Laschos,
Klaus Obermayer
Abstract:
We provide a new algorithm for solving Risk Sensitive Partially Observable Markov Decisions Processes, when the risk is modeled by a utility function, and both the state space and the space of observations is finite. This algorithm is based on an observation that the change of measure and the subsequent introduction of the information space that is used for exponential utility functions, can be ac…
▽ More
We provide a new algorithm for solving Risk Sensitive Partially Observable Markov Decisions Processes, when the risk is modeled by a utility function, and both the state space and the space of observations is finite. This algorithm is based on an observation that the change of measure and the subsequent introduction of the information space that is used for exponential utility functions, can be actually extended for sums of exponentials if one introduces an extra vector parameter that tracks the "expected accumulated cost" that corresponds to each exponential. Since every increasing function can be approximated by sums of exponentials in finite intervals, the method can be essentially applied for any utility function, with its complexity depending on the number of exponentials.
△ Less
Submitted 17 July, 2022; v1 submitted 13 August, 2018;
originally announced August 2018.
-
A Fenchel-Moreau-Rockafellar type theorem on the Kantorovich-Wasserstein space with Applications in Partially Observable Markov Decision Processes
Authors:
Vaios Laschos,
Klaus Obermayer,
Yun Shen,
Wilhelm Stannat
Abstract:
By using the fact that the space of all probability measures with finite support can be somehow completed in two different fashions, one generating the Arens-Eells space and another generating the Kantorovich-Wasserstein (Wasserstein-1) space, and by exploiting the duality relationship between the Arens-Eells space with the space of Lipschitz functions, we provide a dual representation of Fenchel-…
▽ More
By using the fact that the space of all probability measures with finite support can be somehow completed in two different fashions, one generating the Arens-Eells space and another generating the Kantorovich-Wasserstein (Wasserstein-1) space, and by exploiting the duality relationship between the Arens-Eells space with the space of Lipschitz functions, we provide a dual representation of Fenchel-Moreau-Rockafellar type for proper convex functionals on Wasserstein-1. We retrieve dual transportation inequalities as a Corollary and we provide examples where the theorem can be used to easily prove dual expressions like the celebrated Donsker-Varadhan variational formula. Finally our result allows to write convex functions as the supremum over all linear functions that are generated by roots of its conjugate dual, something that we apply to the field of Partially observable Markov decision processes (POMDPs) to approximate the value function of a given POMDP by iterating level sets. This extends the method used in Smallwood 1973 for finite state spaces to the case were the state space is a Polish metric space.
△ Less
Submitted 5 May, 2019; v1 submitted 9 March, 2016;
originally announced March 2016.
-
On Average Risk-sensitive Markov Control Processes
Authors:
Yun Shen,
Klaus Obermayer,
Wilhelm Stannat
Abstract:
We introduce the Lyapunov approach to optimal control problems of average risk-sensitive Markov control processes with general risk maps. Motivated by applications in particular to behavioral economics, we consider possibly non-convex risk maps, modeling behavior with mixed risk preference. We introduce classical objective functions to the risk-sensitive setting and we are in particular interested…
▽ More
We introduce the Lyapunov approach to optimal control problems of average risk-sensitive Markov control processes with general risk maps. Motivated by applications in particular to behavioral economics, we consider possibly non-convex risk maps, modeling behavior with mixed risk preference. We introduce classical objective functions to the risk-sensitive setting and we are in particular interested in optimizing the average risk in the infinite-time horizon for Markov Control Processes on general, possibly non-compact, state spaces allowing also unbounded cost. Existence and uniqueness of an optimal control is obtained with a fixed point theorem applied to the nonlinear map modeling the risk-sensitive expected total cost. The necessary contraction is obtained in a suitable chosen seminorm under a new set of conditions: 1) Lyapunov-type conditions on both risk maps and cost functions that control the growth of iterations, and 2) Doeblin-type conditions, known for Markov chains, generalized to nonlinear mappings. In the particular case of the entropic risk map, the above conditions can be replaced by the existence of a Lyapunov function, a local Doeblin-type condition for the underlying Markov chain, and a growth condition on the cost functions.
△ Less
Submitted 22 July, 2015; v1 submitted 13 March, 2014;
originally announced March 2014.
-
Risk-sensitive Markov control processes
Authors:
Yun Shen,
Wilhelm Stannat,
Klaus Obermayer
Abstract:
We introduce a general framework for measuring risk in the context of Markov control processes with risk maps on general Borel spaces that generalize known concepts of risk measures in mathematical finance, operations research and behavioral economics. Within the framework, applying weighted norm spaces to incorporate also unbounded costs, we study two types of infinite-horizon risk-sensitive crit…
▽ More
We introduce a general framework for measuring risk in the context of Markov control processes with risk maps on general Borel spaces that generalize known concepts of risk measures in mathematical finance, operations research and behavioral economics. Within the framework, applying weighted norm spaces to incorporate also unbounded costs, we study two types of infinite-horizon risk-sensitive criteria, discounted total risk and average risk, and solve the associated optimization problems by dynamic programming. For the discounted case, we propose a new discount scheme, which is different from the conventional form but consistent with the existing literature, while for the average risk criterion, we state Lyapunov-like stability conditions that generalize known conditions for Markov chains to ensure the existence of solutions to the optimality equation.
△ Less
Submitted 23 January, 2014; v1 submitted 28 October, 2011;
originally announced October 2011.
-
The Optimal Unbiased Value Estimator and its Relation to LSTD, TD and MC
Authors:
Steffen Grünewälder,
Klaus Obermayer
Abstract:
In this analytical study we derive the optimal unbiased value estimator (MVU) and compare its statistical risk to three well known value estimators: Temporal Difference learning (TD), Monte Carlo estimation (MC) and Least-Squares Temporal Difference Learning (LSTD). We demonstrate that LSTD is equivalent to the MVU if the Markov Reward Process (MRP) is acyclic and show that both differ for most…
▽ More
In this analytical study we derive the optimal unbiased value estimator (MVU) and compare its statistical risk to three well known value estimators: Temporal Difference learning (TD), Monte Carlo estimation (MC) and Least-Squares Temporal Difference Learning (LSTD). We demonstrate that LSTD is equivalent to the MVU if the Markov Reward Process (MRP) is acyclic and show that both differ for most cyclic MRPs as LSTD is then typically biased. More generally, we show that estimators that fulfill the Bellman equation can only be unbiased for special cyclic MRPs. The main reason being the probability measures with which the expectations are taken. These measure vary from state to state and due to the strong coupling by the Bellman equation it is typically not possible for a set of value estimators to be unbiased with respect to each of these measures. Furthermore, we derive relations of the MVU to MC and TD. The most important one being the equivalence of MC to the MVU and to LSTD for undiscounted MRPs in which MC has the same amount of information. In the discounted case this equivalence does not hold anymore. For TD we show that it is essentially unbiased for acyclic MRPs and biased for cyclic MRPs. We also order estimators according to their risk and present counter-examples to show that no general ordering exists between the MVU and LSTD, between MC and LSTD and between TD and MC. Theoretical results are supported by examples and an empirical evaluation.
△ Less
Submitted 24 August, 2009;
originally announced August 2009.