Bayesian bandits: balancing the exploration-exploitation tradeoff via double sampling

Urteaga, Iñigo; Wiggins, Chris H.

Statistics > Machine Learning

arXiv:1709.03162 (stat)

[Submitted on 10 Sep 2017 (v1), last revised 8 Aug 2018 (this version, v2)]

Title:Bayesian bandits: balancing the exploration-exploitation tradeoff via double sampling

Authors:Iñigo Urteaga, Chris H. Wiggins

View PDF

Abstract:Reinforcement learning studies how to balance exploration and exploitation in real-world systems, optimizing interactions with the world while simultaneously learning how the world operates. One general class of algorithms for such learning is the multi-armed bandit setting. Randomized probability matching, based upon the Thompson sampling approach introduced in the 1930s, has recently been shown to perform well and to enjoy provable optimality properties. It permits generative, interpretable modeling in a Bayesian setting, where prior knowledge is incorporated, and the computed posteriors naturally capture the full state of knowledge. In this work, we harness the information contained in the Bayesian posterior and estimate its sufficient statistics via sampling. In several application domains, for example in health and medicine, each interaction with the world can be expensive and invasive, whereas drawing samples from the model is relatively inexpensive. Exploiting this viewpoint, we develop a double sampling technique driven by the uncertainty in the learning process: it favors exploitation when certain about the properties of each arm, exploring otherwise. The proposed algorithm does not make any distributional assumption and it is applicable to complex reward distributions, as long as Bayesian posterior updates are computable. Utilizing the estimated posterior sufficient statistics, double sampling autonomously balances the exploration-exploitation tradeoff to make better informed decisions. We empirically show its reduced cumulative regret when compared to state-of-the-art alternatives in representative bandit settings.

Comments:	The software used for this study is publicly available at this https URL
Subjects:	Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
MSC classes:	I.2.6
Cite as:	arXiv:1709.03162 [stat.ML]
	(or arXiv:1709.03162v2 [stat.ML] for this version)
	https://doi.org/10.48550/arXiv.1709.03162

Submission history

From: Iñigo Urteaga [view email]
[v1] Sun, 10 Sep 2017 19:58:34 UTC (249 KB)
[v2] Wed, 8 Aug 2018 20:20:27 UTC (788 KB)

Statistics > Machine Learning

Title:Bayesian bandits: balancing the exploration-exploitation tradeoff via double sampling

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Machine Learning

Title:Bayesian bandits: balancing the exploration-exploitation tradeoff via double sampling

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators