Search | arXiv e-print repository

A Unifying Theory of Thompson Sampling for Continuous Risk-Averse Bandits

Authors: Joel Q. L. Chang, Vincent Y. F. Tan

Abstract: This paper unifies the design and the analysis of risk-averse Thompson sampling algorithms for the multi-armed bandit problem for a class of risk functionals $ρ$ that are continuous and dominant. We prove generalised concentration bounds for these continuous and dominant risk functionals and show that a wide class of popular risk functionals belong to this class. Using our newly developed analytic… ▽ More This paper unifies the design and the analysis of risk-averse Thompson sampling algorithms for the multi-armed bandit problem for a class of risk functionals $ρ$ that are continuous and dominant. We prove generalised concentration bounds for these continuous and dominant risk functionals and show that a wide class of popular risk functionals belong to this class. Using our newly developed analytical toolkits, we analyse the algorithm $ρ$-MTS (for multinomial distributions) and prove that they admit asymptotically optimal regret bounds of risk-averse algorithms under CVaR, proportional hazard, and other ubiquitous risk measures. More generally, we prove the asymptotic optimality of $ρ$-MTS for Bernoulli distributions for a class of risk measures known as empirical distribution performance measures (EDPMs); this includes the well-known mean-variance. Numerical simulations show that the regret bounds incurred by our algorithms are reasonably tight vis-à-vis algorithm-independent lower bounds. △ Less

Submitted 17 April, 2022; v1 submitted 25 August, 2021; originally announced August 2021.

Comments: Accepted to the Association for the Advancement of Artificial Intelligence (AAAI) 2022

arXiv:2105.06960 [pdf, ps, other]

Thompson Sampling for Gaussian Entropic Risk Bandits

Authors: Ming Liang Ang, Eloise Y. Y. Lim, Joel Q. L. Chang

Abstract: The multi-armed bandit (MAB) problem is a ubiquitous decision-making problem that exemplifies exploration-exploitation tradeoff. Standard formulations exclude risk in decision making. Risknotably complicates the basic reward-maximising objectives, in part because there is no universally agreed definition of it. In this paper, we consider an entropic risk (ER) measure and explore the performance of… ▽ More The multi-armed bandit (MAB) problem is a ubiquitous decision-making problem that exemplifies exploration-exploitation tradeoff. Standard formulations exclude risk in decision making. Risknotably complicates the basic reward-maximising objectives, in part because there is no universally agreed definition of it. In this paper, we consider an entropic risk (ER) measure and explore the performance of a Thompson sampling-based algorithm ERTS under this risk measure by providing regret bounds for ERTS and corresponding instance dependent lower bounds. △ Less

Submitted 14 May, 2021; originally announced May 2021.

Comments: arXiv admin note: text overlap with arXiv:2011.08046

arXiv:2011.08046 [pdf, other]

Risk-Constrained Thompson Sampling for CVaR Bandits

Authors: Joel Q. L. Chang, Qiuyu Zhu, Vincent Y. F. Tan

Abstract: The multi-armed bandit (MAB) problem is a ubiquitous decision-making problem that exemplifies the exploration-exploitation tradeoff. Standard formulations exclude risk in decision making. Risk notably complicates the basic reward-maximising objective, in part because there is no universally agreed definition of it. In this paper, we consider a popular risk measure in quantitative finance known as… ▽ More The multi-armed bandit (MAB) problem is a ubiquitous decision-making problem that exemplifies the exploration-exploitation tradeoff. Standard formulations exclude risk in decision making. Risk notably complicates the basic reward-maximising objective, in part because there is no universally agreed definition of it. In this paper, we consider a popular risk measure in quantitative finance known as the Conditional Value at Risk (CVaR). We explore the performance of a Thompson Sampling-based algorithm CVaR-TS under this risk measure. We provide comprehensive comparisons between our regret bounds with state-of-the-art L/UCB-based algorithms in comparable settings and demonstrate their clear improvement in performance. We also include numerical simulations to empirically verify that CVaR-TS outperforms other L/UCB-based algorithms. △ Less

Submitted 4 February, 2021; v1 submitted 16 November, 2020; originally announced November 2020.

Comments: 7 pages main paper with 11 pages supplementary material

Showing 1–3 of 3 results for author: Chang, J Q L