-
Refined PAC-Bayes Bounds for Offline Bandits
Authors:
Amaury Gouverneur,
Tobias J. Oechtering,
Mikael Skoglund
Abstract:
In this paper, we present refined probabilistic bounds on empirical reward estimates for off-policy learning in bandit problems. We build on the PAC-Bayesian bounds from Seldin et al. (2010) and improve on their results using a new parameter optimization approach introduced by Rodríguez et al. (2024). This technique is based on a discretization of the space of possible events to optimize the "in p…
▽ More
In this paper, we present refined probabilistic bounds on empirical reward estimates for off-policy learning in bandit problems. We build on the PAC-Bayesian bounds from Seldin et al. (2010) and improve on their results using a new parameter optimization approach introduced by Rodríguez et al. (2024). This technique is based on a discretization of the space of possible events to optimize the "in probability" parameter. We provide two parameter-free PAC-Bayes bounds, one based on Hoeffding-Azuma's inequality and the other based on Bernstein's inequality. We prove that our bounds are almost optimal as they recover the same rate as would be obtained by setting the "in probability" parameter after the realization of the data.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
An Information-Theoretic Analysis of Thompson Sampling with Infinite Action Spaces
Authors:
Amaury Gouverneur,
Borja Rodriguez Gálvez,
Tobias Oechtering,
Mikael Skoglund
Abstract:
This paper studies the Bayesian regret of the Thompson Sampling algorithm for bandit problems, building on the information-theoretic framework introduced by Russo and Van Roy (2015). Specifically, it extends the rate-distortion analysis of Dong and Van Roy (2018), which provides near-optimal bounds for linear bandits. A limitation of these results is the assumption of a finite action space. We add…
▽ More
This paper studies the Bayesian regret of the Thompson Sampling algorithm for bandit problems, building on the information-theoretic framework introduced by Russo and Van Roy (2015). Specifically, it extends the rate-distortion analysis of Dong and Van Roy (2018), which provides near-optimal bounds for linear bandits. A limitation of these results is the assumption of a finite action space. We address this by extending the analysis to settings with infinite and continuous action spaces. Additionally, we specialize our results to bandit problems with expected rewards that are Lipschitz continuous with respect to the action space, deriving a regret bound that explicitly accounts for the complexity of the action space.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandits
Authors:
Amaury Gouverneur,
Borja Rodríguez-Gálvez,
Tobias J. Oechtering,
Mikael Skoglund
Abstract:
We study the performance of the Thompson Sampling algorithm for logistic bandit problems. In this setting, an agent receives binary rewards with probabilities determined by a logistic function, $\exp(β\langle a, θ\rangle)/(1+\exp(β\langle a, θ\rangle))$, with slope parameter $β>0$, and where both the action $a\in \mathcal{A}$ and parameter $θ\in \mathcal{O}$ lie within the $d$-dimensional unit bal…
▽ More
We study the performance of the Thompson Sampling algorithm for logistic bandit problems. In this setting, an agent receives binary rewards with probabilities determined by a logistic function, $\exp(β\langle a, θ\rangle)/(1+\exp(β\langle a, θ\rangle))$, with slope parameter $β>0$, and where both the action $a\in \mathcal{A}$ and parameter $θ\in \mathcal{O}$ lie within the $d$-dimensional unit ball. Adopting the information-theoretic framework introduced by Russo and Van Roy (2016), we analyze the information ratio, a statistic that quantifies the trade-off between the immediate regret incurred and the information gained about the optimal action. We improve upon previous results by establishing that the information ratio is bounded by $\tfrac{9}{2}dα^{-2}$, where $α$ is a minimax measure of the alignment between the action space $\mathcal{A}$ and the parameter space $\mathcal{O}$, and is independent of $β$. Using this result, we derive a bound of order $O(d/α\sqrt{T \log(βT/d)})$ on the Bayesian expected regret of Thompson Sampling incurred after $T$ time steps. To our knowledge, this is the first regret bound for logistic bandits that depends only logarithmically on $β$ while being independent of the number of actions. In particular, when the action space contains the parameter space, the bound on the expected regret is of order $\tilde{O}(d \sqrt{T})$.
△ Less
Submitted 20 February, 2025; v1 submitted 3 December, 2024;
originally announced December 2024.
-
Information-Theoretic Minimax Regret Bounds for Reinforcement Learning based on Duality
Authors:
Raghav Bongole,
Amaury Gouverneur,
Borja Rodríguez-Gálvez,
Tobias J. Oechtering,
Mikael Skoglund
Abstract:
We study agents acting in an unknown environment where the agent's goal is to find a robust policy. We consider robust policies as policies that achieve high cumulative rewards for all possible environments. To this end, we consider agents minimizing the maximum regret over different environment parameters, leading to the study of minimax regret. This research focuses on deriving information-theor…
▽ More
We study agents acting in an unknown environment where the agent's goal is to find a robust policy. We consider robust policies as policies that achieve high cumulative rewards for all possible environments. To this end, we consider agents minimizing the maximum regret over different environment parameters, leading to the study of minimax regret. This research focuses on deriving information-theoretic bounds for minimax regret in Markov Decision Processes (MDPs) with a finite time horizon. Building on concepts from supervised learning, such as minimum excess risk (MER) and minimax excess risk, we use recent bounds on the Bayesian regret to derive minimax regret bounds. Specifically, we establish minimax theorems and use bounds on the Bayesian regret to perform minimax regret analysis using these minimax theorems. Our contributions include defining a suitable minimax regret in the context of MDPs, finding information-theoretic bounds for it, and applying these bounds in various scenarios.
△ Less
Submitted 21 October, 2024;
originally announced October 2024.
-
Chained Information-Theoretic bounds and Tight Regret Rate for Linear Bandit Problems
Authors:
Amaury Gouverneur,
Borja Rodríguez-Gálvez,
Tobias J. Oechtering,
Mikael Skoglund
Abstract:
This paper studies the Bayesian regret of a variant of the Thompson-Sampling algorithm for bandit problems. It builds upon the information-theoretic framework of [Russo and Van Roy, 2015] and, more specifically, on the rate-distortion analysis from [Dong and Van Roy, 2020], where they proved a bound with regret rate of $O(d\sqrt{T \log(T)})$ for the $d$-dimensional linear bandit setting. We focus…
▽ More
This paper studies the Bayesian regret of a variant of the Thompson-Sampling algorithm for bandit problems. It builds upon the information-theoretic framework of [Russo and Van Roy, 2015] and, more specifically, on the rate-distortion analysis from [Dong and Van Roy, 2020], where they proved a bound with regret rate of $O(d\sqrt{T \log(T)})$ for the $d$-dimensional linear bandit setting. We focus on bandit problems with a metric action space and, using a chaining argument, we establish new bounds that depend on the metric entropy of the action space for a variant of Thompson-Sampling.
Under suitable continuity assumption of the rewards, our bound offers a tight rate of $O(d\sqrt{T})$ for $d$-dimensional linear bandit problems.
△ Less
Submitted 5 March, 2024;
originally announced March 2024.
-
Thompson Sampling Regret Bounds for Contextual Bandits with sub-Gaussian rewards
Authors:
Amaury Gouverneur,
Borja Rodríguez-Gálvez,
Tobias J. Oechtering,
Mikael Skoglund
Abstract:
In this work, we study the performance of the Thompson Sampling algorithm for Contextual Bandit problems based on the framework introduced by Neu et al. and their concept of lifted information ratio. First, we prove a comprehensive bound on the Thompson Sampling expected cumulative regret that depends on the mutual information of the environment parameters and the history. Then, we introduce new b…
▽ More
In this work, we study the performance of the Thompson Sampling algorithm for Contextual Bandit problems based on the framework introduced by Neu et al. and their concept of lifted information ratio. First, we prove a comprehensive bound on the Thompson Sampling expected cumulative regret that depends on the mutual information of the environment parameters and the history. Then, we introduce new bounds on the lifted information ratio that hold for sub-Gaussian rewards, thus generalizing the results from Neu et al. which analysis requires binary rewards. Finally, we provide explicit regret bounds for the special cases of unstructured bounded contextual bandits, structured bounded contextual bandits with Laplace likelihood, structured Bernoulli bandits, and bounded linear contextual bandits.
△ Less
Submitted 26 April, 2023;
originally announced April 2023.
-
An Information-Theoretic Analysis of Bayesian Reinforcement Learning
Authors:
Amaury Gouverneur,
Borja Rodríguez-Gálvez,
Tobias J. Oechtering,
Mikael Skoglund
Abstract:
Building on the framework introduced by Xu and Raginksy [1] for supervised learning problems, we study the best achievable performance for model-based Bayesian reinforcement learning problems. With this purpose, we define minimum Bayesian regret (MBR) as the difference between the maximum expected cumulative reward obtainable either by learning from the collected data or by knowing the environment…
▽ More
Building on the framework introduced by Xu and Raginksy [1] for supervised learning problems, we study the best achievable performance for model-based Bayesian reinforcement learning problems. With this purpose, we define minimum Bayesian regret (MBR) as the difference between the maximum expected cumulative reward obtainable either by learning from the collected data or by knowing the environment and its dynamics. We specialize this definition to reinforcement learning problems modeled as Markov decision processes (MDPs) whose kernel parameters are unknown to the agent and whose uncertainty is expressed by a prior distribution. One method for deriving upper bounds on the MBR is presented and specific bounds based on the relative entropy and the Wasserstein distance are given. We then focus on two particular cases of MDPs, the multi-armed bandit problem (MAB) and the online optimization with partial feedback problem. For the latter problem, we show that our bounds can recover from below the current information-theoretic bounds by Russo and Van Roy [2].
△ Less
Submitted 18 July, 2022;
originally announced July 2022.
-
Optimal Intermittent Particle Filter
Authors:
Antoine Aspeel,
Amaury Gouverneur,
Raphaël M. Jungers,
Benoit Macq
Abstract:
The problem of the optimal allocation (in the expected mean square error sense) of a measurement budget for particle filtering is addressed. We propose three different optimal intermittent filters, whose optimality criteria depend on the information available at the time of decision making. For the first, the stochastic program filter, the measurement times are given by a policy that determines wh…
▽ More
The problem of the optimal allocation (in the expected mean square error sense) of a measurement budget for particle filtering is addressed. We propose three different optimal intermittent filters, whose optimality criteria depend on the information available at the time of decision making. For the first, the stochastic program filter, the measurement times are given by a policy that determines whether a measurement should be taken based on the measurements already acquired. The second, called the offline filter, determines all measurement times at once by solving a combinatorial optimization program before any measurement acquisition. For the third one, which we call online filter, each time a new measurement is received, the next measurement time is recomputed to take all the information that is then available into account. We prove that in terms of expected mean square error, the stochastic program filter outperforms the online filter, which itself outperforms the offline filter. However, these filters are generally intractable. For this reason, the filter estimate is approximated by a particle filter. Moreover, the mean square error is approximated using a Monte-Carlo approach, and different optimization algorithms are compared to approximately solve the combinatorial programs (a random trial algorithm, greedy forward and backward algorithms, a simulated annealing algorithm, and a genetic algorithm). Finally, the performance of the proposed methods is illustrated on two examples: a tumor motion model and a common benchmark for particle filtering.
△ Less
Submitted 13 April, 2022;
originally announced April 2022.
-
Optimal measurement budget allocation for particle filtering
Authors:
Antoine Aspeel,
Amaury Gouverneur,
Raphaël M. Jungers,
Benoît Macq
Abstract:
Particle filtering is a powerful tool for target tracking. When the budget for observations is restricted, it is necessary to reduce the measurements to a limited amount of samples carefully selected. A discrete stochastic nonlinear dynamical system is studied over a finite time horizon. The problem of selecting the optimal measurement times for particle filtering is formalized as a combinatorial…
▽ More
Particle filtering is a powerful tool for target tracking. When the budget for observations is restricted, it is necessary to reduce the measurements to a limited amount of samples carefully selected. A discrete stochastic nonlinear dynamical system is studied over a finite time horizon. The problem of selecting the optimal measurement times for particle filtering is formalized as a combinatorial optimization problem. We propose an approximated solution based on the nesting of a genetic algorithm, a Monte Carlo algorithm and a particle filter. Firstly, an example demonstrates that the genetic algorithm outperforms a random trial optimization. Then, the interest of non-regular measurements versus measurements performed at regular time intervals is illustrated and the efficiency of our proposed solution is quantified: better filtering performances are obtained in 87.5% of the cases and on average, the relative improvement is 27.7%.
△ Less
Submitted 18 May, 2020;
originally announced May 2020.