Search | arXiv e-print repository

arXiv:2506.01101 [pdf, ps, other]

Learning to optimize convex risk measures: The cases of utility-based shortfall risk and optimized certainty equivalent risk

Authors: Sumedh Gupte, Prashanth L. A., Sanjay P. Bhat

Abstract: We consider the problems of estimation and optimization of two popular convex risk measures: utility-based shortfall risk (UBSR) and Optimized Certainty Equivalent (OCE) risk. We extend these risk measures to cover possibly unbounded random variables. We cover prominent risk measures like the entropic risk, expectile risk, monotone mean-variance risk, Value-at-Risk, and Conditional Value-at-Risk a… ▽ More We consider the problems of estimation and optimization of two popular convex risk measures: utility-based shortfall risk (UBSR) and Optimized Certainty Equivalent (OCE) risk. We extend these risk measures to cover possibly unbounded random variables. We cover prominent risk measures like the entropic risk, expectile risk, monotone mean-variance risk, Value-at-Risk, and Conditional Value-at-Risk as few special cases of either the UBSR or the OCE risk. In the context of estimation, we derive non-asymptotic bounds on the mean absolute error (MAE) and mean-squared error (MSE) of the classical sample average approximation (SAA) estimators of both, the UBSR and the OCE. Next, in the context of optimization, we derive expressions for the UBSR gradient and the OCE gradient under a smooth parameterization. Utilizing these expressions, we propose gradient estimators for both, the UBSR and the OCE. We use the SAA estimator of UBSR in both these gradient estimators, and derive non-asymptotic bounds on MAE and MSE for the proposed gradient estimation schemes. We incorporate the aforementioned gradient estimators into a stochastic gradient (SG) algorithm for optimization. Finally, we derive non-asymptotic bounds that quantify the rate of convergence of our SG algorithm for the optimization of the UBSR and the OCE risk measure. △ Less

Submitted 1 June, 2025; originally announced June 2025.

arXiv:2504.20877 [pdf, other]

Preference-centric Bandits: Optimality of Mixtures and Regret-efficient Algorithms

Authors: Meltem Tatlı, Arpan Mukherjee, Prashanth L. A., Karthikeyan Shanmugam, Ali Tajer

Abstract: The objective of canonical multi-armed bandits is to identify and repeatedly select an arm with the largest reward, often in the form of the expected value of the arm's probability distribution. Such a utilitarian perspective and focus on the probability models' first moments, however, is agnostic to the distributions' tail behavior and their implications for variability and risks in decision-maki… ▽ More The objective of canonical multi-armed bandits is to identify and repeatedly select an arm with the largest reward, often in the form of the expected value of the arm's probability distribution. Such a utilitarian perspective and focus on the probability models' first moments, however, is agnostic to the distributions' tail behavior and their implications for variability and risks in decision-making. This paper introduces a principled framework for shifting from expectation-based evaluation to an alternative reward formulation, termed a preference metric (PM). The PMs can place the desired emphasis on different reward realization and can encode a richer modeling of preferences that incorporate risk aversion, robustness, or other desired attitudes toward uncertainty. A fundamentally distinct observation in such a PM-centric perspective is that designing bandit algorithms will have a significantly different principle: as opposed to the reward-based models in which the optimal sampling policy converges to repeatedly sampling from the single best arm, in the PM-centric framework the optimal policy converges to selecting a mix of arms based on specific mixing weights. Designing such mixture policies departs from the principles for designing bandit algorithms in significant ways, primarily because of uncountable mixture possibilities. The paper formalizes the PM-centric framework and presents two algorithm classes (horizon-dependent and anytime) that learn and track mixtures in a regret-efficient fashion. These algorithms have two distinctions from their canonical counterparts: (i) they involve an estimation routine to form reliable estimates of optimal mixtures, and (ii) they are equipped with tracking mechanisms to navigate arm selection fractions to track the optimal mixtures. These algorithms' regret guarantees are investigated under various algebraic forms of the PMs. △ Less

Submitted 30 April, 2025; v1 submitted 29 April, 2025; originally announced April 2025.

Comments: An earlier version of this manuscript, which focused on risk-sensitive bandits, has appeared in the Proceedings of the 2025 International Conference on Artificial Intelligence and Statistics (AISTATS)

arXiv:2503.08896 [pdf, other]

Risk-sensitive Bandits: Arm Mixture Optimality and Regret-efficient Algorithms

Authors: Meltem Tatlı, Arpan Mukherjee, Prashanth L. A., Karthikeyan Shanmugam, Ali Tajer

Abstract: This paper introduces a general framework for risk-sensitive bandits that integrates the notions of risk-sensitive objectives by adopting a rich class of distortion riskmetrics. The introduced framework subsumes the various existing risk-sensitive models. An important and hitherto unknown observation is that for a wide range of riskmetrics, the optimal bandit policy involves selecting a mixture of… ▽ More This paper introduces a general framework for risk-sensitive bandits that integrates the notions of risk-sensitive objectives by adopting a rich class of distortion riskmetrics. The introduced framework subsumes the various existing risk-sensitive models. An important and hitherto unknown observation is that for a wide range of riskmetrics, the optimal bandit policy involves selecting a mixture of arms. This is in sharp contrast to the convention in the multi-arm bandit algorithms that there is generally a solitary arm that maximizes the utility, whether purely reward-centric or risk-sensitive. This creates a major departure from the principles for designing bandit algorithms since there are uncountable mixture possibilities. The contributions of the paper are as follows: (i) it formalizes a general framework for risk-sensitive bandits, (ii) identifies standard risk-sensitive bandit models for which solitary arm selections is not optimal, (iii) and designs regret-efficient algorithms whose sampling strategies can accurately track optimal arm mixtures (when mixture is optimal) or the solitary arms (when solitary is optimal). The algorithms are shown to achieve a regret that scales according to $O((\log T/T )^ν)$, where $T$ is the horizon, and $ν>0$ is a riskmetric-specific constant. △ Less

Submitted 11 March, 2025; originally announced March 2025.

Comments: AISTATS 2025

arXiv:2409.05733 [pdf, ps, other]

Markov Chain Variance Estimation: A Stochastic Approximation Approach

Authors: Shubhada Agrawal, Prashanth L. A., Siva Theja Maguluri

Abstract: We consider the problem of estimating the asymptotic variance of a function defined on a Markov chain, an important step for statistical inference of the stationary mean. We design a novel recursive estimator that requires $O(1)$ computation at each step, does not require storing any historical samples or any prior knowledge of run-length, and has optimal $O(\frac{1}{n})$ rate of convergence for t… ▽ More We consider the problem of estimating the asymptotic variance of a function defined on a Markov chain, an important step for statistical inference of the stationary mean. We design a novel recursive estimator that requires $O(1)$ computation at each step, does not require storing any historical samples or any prior knowledge of run-length, and has optimal $O(\frac{1}{n})$ rate of convergence for the mean-squared error (MSE) with provable finite sample guarantees. Here, $n$ refers to the total number of samples generated. Our estimator is based on linear stochastic approximation of an equivalent formulation of the asymptotic variance in terms of the solution of the Poisson equation. We generalize our estimator in several directions, including estimating the covariance matrix for vector-valued functions, estimating the stationary variance of a Markov chain, and approximately estimating the asymptotic variance in settings where the state space of the underlying Markov chain is large. We also show applications of our estimator in average reward reinforcement learning (RL), where we work with asymptotic variance as a risk measure to model safety-critical applications. We design a temporal-difference type algorithm tailored for policy evaluation in this context. We consider both the tabular and linear function approximation settings. Our work paves the way for developing actor-critic style algorithms for variance-constrained RL. △ Less

Submitted 22 September, 2024; v1 submitted 9 September, 2024; originally announced September 2024.

Comments: 62 pages, 1 table, added additional references

arXiv:2304.10951 [pdf, ps, other]

A Cubic-regularized Policy Newton Algorithm for Reinforcement Learning

Authors: Mizhaan Prajit Maniyar, Akash Mondal, Prashanth L. A., Shalabh Bhatnagar

Abstract: We consider the problem of control in the setting of reinforcement learning (RL), where model information is not available. Policy gradient algorithms are a popular solution approach for this problem and are usually shown to converge to a stationary point of the value function. In this paper, we propose two policy Newton algorithms that incorporate cubic regularization. Both algorithms employ the… ▽ More We consider the problem of control in the setting of reinforcement learning (RL), where model information is not available. Policy gradient algorithms are a popular solution approach for this problem and are usually shown to converge to a stationary point of the value function. In this paper, we propose two policy Newton algorithms that incorporate cubic regularization. Both algorithms employ the likelihood ratio method to form estimates of the gradient and Hessian of the value function using sample trajectories. The first algorithm requires an exact solution of the cubic regularized problem in each iteration, while the second algorithm employs an efficient gradient descent-based approximation to the cubic regularized problem. We establish convergence of our proposed algorithms to a second-order stationary point (SOSP) of the value function, which results in the avoidance of traps in the form of saddle points. In particular, the sample complexity of our algorithms to find an $ε$-SOSP is $O(ε^{-3.5})$, which is an improvement over the state-of-the-art sample complexity of $O(ε^{-4.5})$. △ Less

Submitted 21 April, 2023; originally announced April 2023.

arXiv:2210.05918 [pdf, ps, other]

Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation

Authors: Gandharv Patil, Prashanth L. A., Dheeraj Nagaraj, Doina Precup

Abstract: We study the finite-time behaviour of the popular temporal difference (TD) learning algorithm when combined with tail-averaging. We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice that does not require information about the eigenvalues of the matrix underlying the projected TD fixed point. Our analysis shows that tail-averaged TD converges… ▽ More We study the finite-time behaviour of the popular temporal difference (TD) learning algorithm when combined with tail-averaging. We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice that does not require information about the eigenvalues of the matrix underlying the projected TD fixed point. Our analysis shows that tail-averaged TD converges at the optimal $O\left(1/t\right)$ rate, both in expectation and with high probability. In addition, our bounds exhibit a sharper rate of decay for the initial error (bias), which is an improvement over averaging all iterates. We also propose and analyse a variant of TD that incorporates regularisation. From analysis, we conclude that the regularised version of TD is useful for problems with ill-conditioned features. △ Less

Submitted 19 September, 2024; v1 submitted 12 October, 2022; originally announced October 2022.

Journal ref: Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, 2023

arXiv:2205.05843 [pdf, ps, other]

A Survey of Risk-Aware Multi-Armed Bandits

Authors: Vincent Y. F. Tan, Prashanth L. A., Krishna Jagannathan

Abstract: In several applications such as clinical trials and financial portfolio optimization, the expected value (or the average reward) does not satisfactorily capture the merits of a drug or a portfolio. In such applications, risk plays a crucial role, and a risk-aware performance measure is preferable, so as to capture losses in the case of adverse events. This survey aims to consolidate and summarise… ▽ More In several applications such as clinical trials and financial portfolio optimization, the expected value (or the average reward) does not satisfactorily capture the merits of a drug or a portfolio. In such applications, risk plays a crucial role, and a risk-aware performance measure is preferable, so as to capture losses in the case of adverse events. This survey aims to consolidate and summarise the existing research on risk measures, specifically in the context of multi-armed bandits. We review various risk measures of interest, and comment on their properties. Next, we review existing concentration inequalities for various risk measures. Then, we proceed to defining risk-aware bandit problems, We consider algorithms for the regret minimization setting, where the exploration-exploitation trade-off manifests, as well as the best-arm identification setting, which is a pure exploration problem -- both in the context of risk-sensitive measures. We conclude by commenting on persisting challenges and fertile areas for future research. △ Less

Submitted 11 May, 2022; originally announced May 2022.

Comments: 11 pages; Unabridged version of a a survey paper of the same title accepted to IJCAI-ECAI, 2022

arXiv:2002.11440 [pdf, ps, other]

Non-asymptotic bounds for stochastic optimization with biased noisy gradient oracles

Authors: Nirav Bhavsar, Prashanth L. A

Abstract: We introduce biased gradient oracles to capture a setting where the function measurements have an estimation error that can be controlled through a batch size parameter. Our proposed oracles are appealing in several practical contexts, for instance, risk measure estimation from a batch of independent and identically distributed (i.i.d.) samples, or simulation optimization, where the function measu… ▽ More We introduce biased gradient oracles to capture a setting where the function measurements have an estimation error that can be controlled through a batch size parameter. Our proposed oracles are appealing in several practical contexts, for instance, risk measure estimation from a batch of independent and identically distributed (i.i.d.) samples, or simulation optimization, where the function measurements are `biased' due to computational constraints. In either case, increasing the batch size reduces the estimation error. We highlight the applicability of our biased gradient oracles in a risk-sensitive reinforcement learning setting. In the stochastic non-convex optimization context, we analyze a variant of the randomized stochastic gradient (RSG) algorithm with a biased gradient oracle. We quantify the convergence rate of this algorithm by deriving non-asymptotic bounds on its performance. Next, in the stochastic convex optimization setting, we derive non-asymptotic bounds for the last iterate of a stochastic gradient descent (SGD) algorithm with a biased gradient oracle. △ Less

Submitted 16 May, 2021; v1 submitted 26 February, 2020; originally announced February 2020.

arXiv:1912.10398 [pdf, other]

Estimation of Spectral Risk Measures

Authors: Ajay Kumar Pandey, Prashanth L. A., Sanjay P. Bhat

Abstract: We consider the problem of estimating a spectral risk measure (SRM) from i.i.d. samples, and propose a novel method that is based on numerical integration. We show that our SRM estimate concentrates exponentially, when the underlying distribution has bounded support. Further, we also consider the case when the underlying distribution is either Gaussian or exponential, and derive a concentration bo… ▽ More We consider the problem of estimating a spectral risk measure (SRM) from i.i.d. samples, and propose a novel method that is based on numerical integration. We show that our SRM estimate concentrates exponentially, when the underlying distribution has bounded support. Further, we also consider the case when the underlying distribution is either Gaussian or exponential, and derive a concentration bound for our estimation scheme. We validate the theoretical findings on a synthetic setup, and in a vehicular traffic routing application. △ Less

Submitted 22 December, 2019; originally announced December 2019.

arXiv:1902.10709 [pdf, ps, other]

A Wasserstein distance approach for concentration of empirical risk estimates

Authors: Prashanth L. A., Sanjay P. Bhat

Abstract: This paper presents a unified approach based on Wasserstein distance to derive concentration bounds for empirical estimates for two broad classes of risk measures defined in the paper. The classes of risk measures introduced include as special cases well known risk measures from the finance literature such as conditional value at risk (CVaR), optimized certainty equivalent risk, spectral risk meas… ▽ More This paper presents a unified approach based on Wasserstein distance to derive concentration bounds for empirical estimates for two broad classes of risk measures defined in the paper. The classes of risk measures introduced include as special cases well known risk measures from the finance literature such as conditional value at risk (CVaR), optimized certainty equivalent risk, spectral risk measures, utility-based shortfall risk, cumulative prospect theory (CPT) value, rank dependent expected utility and distorted risk measures. Two estimation schemes are considered, one for each class of risk measures. One estimation scheme involves applying the risk measure to the empirical distribution function formed from a collection of i.i.d. samples of the random variable (r.v.), while the second scheme involves applying the same procedure to a truncated sample. The bounds provided apply to three popular classes of distributions, namely sub-Gaussian, sub-exponential and heavy-tailed distributions. The bounds are derived by first relating the estimation error to the Wasserstein distance between the true and empirical distributions, and then using recent concentration bounds for the latter. Previous concentration bounds are available only for specific risk measures such as CVaR and CPT-value. The bounds derived in this paper are shown to either match or improve upon previous bounds in cases where they are available. The usefulness of the bounds is illustrated through an algorithm and the corresponding regret bound for a stochastic bandit problem involving a general risk measure from each of the two classes introduced in the paper. △ Less

Submitted 10 May, 2022; v1 submitted 27 February, 2019; originally announced February 2019.

arXiv:1902.02953 [pdf, ps, other]

Correlated bandits or: How to minimize mean-squared error online

Authors: Vinay Praneeth Boda, Prashanth L. A

Abstract: While the objective in traditional multi-armed bandit problems is to find the arm with the highest mean, in many settings, finding an arm that best captures information about other arms is of interest. This objective, however, requires learning the underlying correlation structure and not just the means of the arms. Sensors placement for industrial surveillance and cellular network monitoring are… ▽ More While the objective in traditional multi-armed bandit problems is to find the arm with the highest mean, in many settings, finding an arm that best captures information about other arms is of interest. This objective, however, requires learning the underlying correlation structure and not just the means of the arms. Sensors placement for industrial surveillance and cellular network monitoring are a few applications, where the underlying correlation structure plays an important role. Motivated by such applications, we formulate the correlated bandit problem, where the objective is to find the arm with the lowest mean-squared error (MSE) in estimating all the arms. To this end, we derive first an MSE estimator, based on sample variances and covariances, and show that our estimator exponentially concentrates around the true MSE. Under a best-arm identification framework, we propose a successive rejects type algorithm and provide bounds on the probability of error in identifying the best arm. Using minmax theory, we also derive fundamental performance limits for the correlated bandit problem. △ Less

Submitted 26 June, 2019; v1 submitted 8 February, 2019; originally announced February 2019.

arXiv:1901.00997 [pdf, ps, other]

Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed distributions

Authors: Prashanth L. A., Krishna Jagannathan, Ravi Kumar Kolla

Abstract: Conditional Value-at-Risk (CVaR) is a widely used risk metric in applications such as finance. We derive concentration bounds for CVaR estimates, considering separately the cases of light-tailed and heavy-tailed distributions. In the light-tailed case, we use a classical CVaR estimator based on the empirical distribution constructed from the samples. For heavy-tailed random variables, we assume a… ▽ More Conditional Value-at-Risk (CVaR) is a widely used risk metric in applications such as finance. We derive concentration bounds for CVaR estimates, considering separately the cases of light-tailed and heavy-tailed distributions. In the light-tailed case, we use a classical CVaR estimator based on the empirical distribution constructed from the samples. For heavy-tailed random variables, we assume a mild `bounded moment' condition, and derive a concentration bound for a truncation-based estimator. Notably, our concentration bounds enjoy an exponential decay in the sample size, for heavy-tailed as well as light-tailed distributions. To demonstrate the applicability of our concentration results, we consider a CVaR optimization problem in a multi-armed bandit setting. Specifically, we address the best CVaR-arm identification problem under a fixed budget. We modify the well-known successive rejects algorithm to incorporate a CVaR-based criterion. Using the CVaR concentration result, we derive an upper-bound on the probability of incorrect identification by the proposed algorithm. △ Less

Submitted 25 August, 2019; v1 submitted 4 January, 2019; originally announced January 2019.

arXiv:1810.09126 [pdf, ps, other]

Risk-Sensitive Reinforcement Learning via Policy Gradient Search

Authors: Prashanth L. A., Michael Fu

Abstract: The objective in a traditional reinforcement learning (RL) problem is to find a policy that optimizes the expected value of a performance metric such as the infinite-horizon cumulative discounted or long-run average cost/reward. In practice, optimizing the expected value alone may not be satisfactory, in that it may be desirable to incorporate the notion of risk into the optimization problem formu… ▽ More The objective in a traditional reinforcement learning (RL) problem is to find a policy that optimizes the expected value of a performance metric such as the infinite-horizon cumulative discounted or long-run average cost/reward. In practice, optimizing the expected value alone may not be satisfactory, in that it may be desirable to incorporate the notion of risk into the optimization problem formulation, either in the objective or as a constraint. Various risk measures have been proposed in the literature, e.g., exponential utility, variance, percentile performance, chance constraints, value at risk (quantile), conditional value-at-risk, prospect theory and its later enhancement, cumulative prospect theory. In this book, we consider risk-sensitive RL in two settings: one where the goal is to find a policy that optimizes the usual expected value objective while ensuring that a risk constraint is satisfied, and the other where the risk measure is the objective. We survey some of the recent work in this area specifically where policy gradient search is the solution approach. In the first risk-sensitive RL setting, we cover popular risk measures based on variance, conditional value-at-risk, and chance constraints, and present a template for policy gradient-based risk-sensitive RL algorithms using a Lagrangian formulation. For the setting where risk is incorporated directly into the objective function, we consider an exponential utility formulation, cumulative prospect theory, and coherent risk measures. This non-exhaustive survey aims to give a flavor of the challenges involved in solving risk-sensitive RL problems using policy gradient methods, as well as outlining some potential future research directions. △ Less

Submitted 23 May, 2022; v1 submitted 22 October, 2018; originally announced October 2018.

Comments: To appear in "Foundations and Trends in Machine Learning"

arXiv:1808.01739 [pdf, ps, other]

Concentration bounds for empirical conditional value-at-risk: The unbounded case

Authors: Ravi Kumar Kolla, Prashanth L. A., Sanjay P. Bhat, Krishna Jagannathan

Abstract: In several real-world applications involving decision making under uncertainty, the traditional expected value objective may not be suitable, as it may be necessary to control losses in the case of a rare but extreme event. Conditional Value-at-Risk (CVaR) is a popular risk measure for modeling the aforementioned objective. We consider the problem of estimating CVaR from i.i.d. samples of an unbou… ▽ More In several real-world applications involving decision making under uncertainty, the traditional expected value objective may not be suitable, as it may be necessary to control losses in the case of a rare but extreme event. Conditional Value-at-Risk (CVaR) is a popular risk measure for modeling the aforementioned objective. We consider the problem of estimating CVaR from i.i.d. samples of an unbounded random variable, which is either sub-Gaussian or sub-exponential. We derive a novel one-sided concentration bound for a natural sample-based CVaR estimator in this setting. Our bound relies on a concentration result for a quantile-based estimator for Value-at-Risk (VaR), which may be of independent interest. △ Less

Submitted 6 August, 2018; originally announced August 2018.

arXiv:1611.10283 [pdf, ps, other]

Bandit algorithms to emulate human decision making using probabilistic distortions

Authors: Ravi Kumar Kolla, Prashanth L. A., Aditya Gopalan, Krishna Jagannathan, Michael Fu, Steve Marcus

Abstract: Motivated by models of human decision making proposed to explain commonly observed deviations from conventional expected value preferences, we formulate two stochastic multi-armed bandit problems with distorted probabilities on the reward distributions: the classic $K$-armed bandit and the linearly parameterized bandit settings. We consider the aforementioned problems in the regret minimization as… ▽ More Motivated by models of human decision making proposed to explain commonly observed deviations from conventional expected value preferences, we formulate two stochastic multi-armed bandit problems with distorted probabilities on the reward distributions: the classic $K$-armed bandit and the linearly parameterized bandit settings. We consider the aforementioned problems in the regret minimization as well as best arm identification framework for multi-armed bandits. For the regret minimization setting in $K$-armed as well as linear bandit problems, we propose algorithms that are inspired by Upper Confidence Bound (UCB) algorithms, incorporate reward distortions, and exhibit sublinear regret. For the $K$-armed bandit setting, we derive an upper bound on the expected regret for our proposed algorithm, and then we prove a matching lower bound to establish the order-optimality of our algorithm. For the linearly parameterized setting, our algorithm achieves a regret upper bound that is of the same order as that of regular linear bandit algorithm called Optimism in the Face of Uncertainty Linear (OFUL) bandit algorithm, and unlike OFUL, our algorithm handles distortions and an arm-dependent noise model. For the best arm identification problem in the $K$-armed bandit setting, we propose algorithms, derive guarantees on their performance, and also show that these algorithms are order optimal by proving matching fundamental limits on performance. For best arm identification in linear bandits, we propose an algorithm and establish sample complexity guarantees. Finally, we present simulation experiments which demonstrate the advantages resulting from using distortion-aware learning algorithms in a vehicular traffic routing application. △ Less

Submitted 31 October, 2023; v1 submitted 30 November, 2016; originally announced November 2016.

Comments: The material in this paper was presented in part at the 2017 AAAI Conference on Artificial Intelligence

arXiv:1609.07087 [pdf, other]

(Bandit) Convex Optimization with Biased Noisy Gradient Oracles

Authors: Xiaowei Hu, Prashanth L. A., András György, Csaba Szepesvári

Abstract: Algorithms for bandit convex optimization and online learning often rely on constructing noisy gradient estimates, which are then used in appropriately adjusted first-order algorithms, replacing actual gradients. Depending on the properties of the function to be optimized and the nature of ``noise'' in the bandit feedback, the bias and variance of gradient estimates exhibit various tradeoffs. In t… ▽ More Algorithms for bandit convex optimization and online learning often rely on constructing noisy gradient estimates, which are then used in appropriately adjusted first-order algorithms, replacing actual gradients. Depending on the properties of the function to be optimized and the nature of ``noise'' in the bandit feedback, the bias and variance of gradient estimates exhibit various tradeoffs. In this paper we propose a novel framework that replaces the specific gradient estimation methods with an abstract oracle. With the help of the new framework we unify previous works, reproducing their results in a clean and concise fashion, while, perhaps more importantly, the framework also allows us to formally show that to achieve the optimal root-$n$ rate either the algorithms that use existing gradient estimators, or the proof techniques used to analyze them have to go beyond what exists today. △ Less

Submitted 4 July, 2020; v1 submitted 22 September, 2016; originally announced September 2016.

arXiv:1405.2690 [pdf, ps, other]

Policy Gradients for CVaR-Constrained MDPs

Authors: Prashanth L. A.

Abstract: We study a risk-constrained version of the stochastic shortest path (SSP) problem, where the risk measure considered is Conditional Value-at-Risk (CVaR). We propose two algorithms that obtain a locally risk-optimal policy by employing four tools: stochastic approximation, mini batches, policy gradients and importance sampling. Both the algorithms incorporate a CVaR estimation procedure, along the… ▽ More We study a risk-constrained version of the stochastic shortest path (SSP) problem, where the risk measure considered is Conditional Value-at-Risk (CVaR). We propose two algorithms that obtain a locally risk-optimal policy by employing four tools: stochastic approximation, mini batches, policy gradients and importance sampling. Both the algorithms incorporate a CVaR estimation procedure, along the lines of Bardou et al. [2009], which in turn is based on Rockafellar-Uryasev's representation for CVaR and utilize the likelihood ratio principle for estimating the gradient of the sum of one cost function (objective of the SSP) and the gradient of the CVaR of the sum of another cost function (in the constraint of SSP). The algorithms differ in the manner in which they approximate the CVaR estimates/necessary gradients - the first algorithm uses stochastic approximation, while the second employ mini-batches in the spirit of Monte Carlo methods. We establish asymptotic convergence of both the algorithms. Further, since estimating CVaR is related to rare-event simulation, we incorporate an importance sampling based variance reduction scheme into our proposed algorithms. △ Less

Submitted 12 May, 2014; originally announced May 2014.

arXiv:1403.6530 [pdf, other]

Variance-Constrained Actor-Critic Algorithms for Discounted and Average Reward MDPs

Authors: Prashanth L. A., Mohammad Ghavamzadeh

Abstract: In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounte… ▽ More In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms that operate on three timescales - a TD critic on the fastest timescale, a policy gradient (actor) on the intermediate timescale, and a dual ascent for Lagrange multipliers on the slowest timescale. In the discounted setting, we point out the difficulty in estimating the gradient of the variance of the return and incorporate simultaneous perturbation approaches to alleviate this. The average setting, on the other hand, allows for an actor update using compatible features to estimate the gradient of the variance. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application. △ Less

Submitted 18 March, 2015; v1 submitted 25 March, 2014; originally announced March 2014.

arXiv:1307.3176 [pdf, other]

Fast gradient descent for drifting least squares regression, with application to bandits

Authors: Nathaniel Korda, Prashanth L. A., Rémi Munos

Abstract: Online learning algorithms require to often recompute least squares regression estimates of parameters. We study improving the computational complexity of such algorithms by using stochastic gradient descent (SGD) type schemes in place of classic regression solvers. We show that SGD schemes efficiently track the true solutions of the regression problems, even in the presence of a drift. This findi… ▽ More Online learning algorithms require to often recompute least squares regression estimates of parameters. We study improving the computational complexity of such algorithms by using stochastic gradient descent (SGD) type schemes in place of classic regression solvers. We show that SGD schemes efficiently track the true solutions of the regression problems, even in the presence of a drift. This finding coupled with an $O(d)$ improvement in complexity, where $d$ is the dimension of the data, make them attractive for implementation in the big data settings. In the case when strong convexity in the regression problem is guaranteed, we provide bounds on the error both in expectation and high probability (the latter is often needed to provide theoretical guarantees for higher level algorithms), despite the drifting least squares solution. As an example of this case we prove that the regret performance of an SGD version of the PEGE linear bandit algorithm [Rusmevichientong and Tsitsiklis 2010] is worse that that of PEGE itself only by a factor of $O(\log^4 n)$. When strong convexity of the regression problem cannot be guaranteed, we investigate using an adaptive regularisation. We make an empirical study of an adaptively regularised, SGD version of LinUCB [Li et al. 2010] in a news article recommendation application, which uses the large scale news recommendation dataset from Yahoo! front page. These experiments show a large gain in computational complexity, with a consistently low tracking error and click-through-rate (CTR) performance that is $75\%$ close. △ Less

Submitted 20 November, 2014; v1 submitted 11 July, 2013; originally announced July 2013.

Showing 1–19 of 19 results for author: A., P L