Search | arXiv e-print repository

Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

Authors: Oliver Mortensen, Mohammad Sadegh Talebi

Abstract: In this paper we analyze the sample complexities of learning the optimal state-action value function $Q^*$ and an optimal policy $π^*$ in a discounted Markov decision process (MDP) where the agent has recursive entropic risk-preferences with risk-parameter $β\neq 0$ and where a generative model of the MDP is available. We provide and analyze a simple model based approach which we call model-based… ▽ More In this paper we analyze the sample complexities of learning the optimal state-action value function $Q^*$ and an optimal policy $π^*$ in a discounted Markov decision process (MDP) where the agent has recursive entropic risk-preferences with risk-parameter $β\neq 0$ and where a generative model of the MDP is available. We provide and analyze a simple model based approach which we call model-based risk-sensitive $Q$-value-iteration (MB-RS-QVI) which leads to $(ε,δ)$-PAC-bounds on $\|Q^*-Q^k\|$, and $\|V^*-V^{π_k}\|$ where $Q_k$ is the output of MB-RS-QVI after k iterations and $π_k$ is the greedy policy with respect to $Q_k$. Both PAC-bounds have exponential dependence on the effective horizon $\frac{1}{1-γ}$ and the strength of this dependence grows with the learners risk-sensitivity $|β|$. We also provide two lower bounds which shows that exponential dependence on $|β|\frac{1}{1-γ}$ is unavoidable in both cases. The lower bounds reveal that the PAC-bounds are both tight in $\varepsilon$ and $δ$ and that the PAC-bound on $Q$-learning is tight in the number of actions $A$, and that the PAC-bound on policy-learning is nearly tight in $A$. △ Less

Submitted 30 May, 2025; originally announced June 2025.

arXiv:2407.15662 [pdf, other]

How to Shrink Confidence Sets for Many Equivalent Discrete Distributions?

Authors: Odalric-Ambrym Maillard, Mohammad Sadegh Talebi

Abstract: We consider the situation when a learner faces a set of unknown discrete distributions $(p_k)_{k\in \mathcal K}$ defined over a common alphabet $\mathcal X$, and can build for each distribution $p_k$ an individual high-probability confidence set thanks to $n_k$ observations sampled from $p_k$. The set $(p_k)_{k\in \mathcal K}$ is structured: each distribution $p_k$ is obtained from the same common… ▽ More We consider the situation when a learner faces a set of unknown discrete distributions $(p_k)_{k\in \mathcal K}$ defined over a common alphabet $\mathcal X$, and can build for each distribution $p_k$ an individual high-probability confidence set thanks to $n_k$ observations sampled from $p_k$. The set $(p_k)_{k\in \mathcal K}$ is structured: each distribution $p_k$ is obtained from the same common, but unknown, distribution q via applying an unknown permutation to $\mathcal X$. We call this \emph{permutation-equivalence}. The goal is to build refined confidence sets \emph{exploiting} this structural property. Like other popular notions of structure (Lipschitz smoothness, Linearity, etc.) permutation-equivalence naturally appears in machine learning problems, and to benefit from its potential gain calls for a specific approach. We present a strategy to effectively exploit permutation-equivalence, and provide a finite-time high-probability bound on the size of the refined confidence sets output by the strategy. Since a refinement is not possible for too few observations in general, under mild technical assumptions, our finite-time analysis establish when the number of observations $(n_k)_{k\in \mathcal K}$ are large enough so that the output confidence sets improve over initial individual sets. We carefully characterize this event and the corresponding improvement. Further, our result implies that the size of confidence sets shrink at asymptotic rates of $O(1/\sqrt{\sum_{k\in \mathcal K} n_k})$ and $O(1/\max_{k\in K} n_{k})$, respectively for elements inside and outside the support of q, when the size of each individual confidence set shrinks at respective rates of $O(1/\sqrt{n_k})$ and $O(1/n_k)$. We illustrate the practical benefit of exploiting permutation equivalence on a reinforcement learning task. △ Less

Submitted 22 July, 2024; originally announced July 2024.

arXiv:2009.04575 [pdf, other]

Improved Exploration in Factored Average-Reward MDPs

Authors: Mohammad Sadegh Talebi, Anders Jonsson, Odalric-Ambrym Maillard

Abstract: We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP). More specifically, we consider an FMDP where the state-action space $\mathcal X$ and the state-space $\mathcal S$ admit the respective factored forms of $\mathcal X = \otimes_{i=1}^n \mathcal X_i$ and $\mathcal S=\otimes_{i=1}^m \mathcal S_i$, and the transition and rewa… ▽ More We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP). More specifically, we consider an FMDP where the state-action space $\mathcal X$ and the state-space $\mathcal S$ admit the respective factored forms of $\mathcal X = \otimes_{i=1}^n \mathcal X_i$ and $\mathcal S=\otimes_{i=1}^m \mathcal S_i$, and the transition and reward functions are factored over $\mathcal X$ and $\mathcal S$. Assuming known factorization structure, we introduce a novel regret minimization strategy inspired by the popular UCRL2 strategy, called DBN-UCRL, which relies on Bernstein-type confidence sets defined for individual elements of the transition function. We show that for a generic factorization structure, DBN-UCRL achieves a regret bound, whose leading term strictly improves over existing regret bounds in terms of the dependencies on the size of $\mathcal S_i$'s and the involved diameter-related terms. We further show that when the factorization structure corresponds to the Cartesian product of some base MDPs, the regret of DBN-UCRL is upper bounded by the sum of regret of the base MDPs. We demonstrate, through numerical experiments on standard environments, that DBN-UCRL enjoys substantially improved regret empirically over existing algorithms that have frequentist regret guarantees. △ Less

Submitted 11 March, 2021; v1 submitted 9 September, 2020; originally announced September 2020.

Comments: 23 pages. To appear in Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS) 2021

arXiv:2004.09656 [pdf, other]

Tightening Exploration in Upper Confidence Reinforcement Learning

Authors: Hippolyte Bourel, Odalric-Ambrym Maillard, Mohammad Sadegh Talebi

Abstract: The upper confidence reinforcement learning (UCRL2) algorithm introduced in (Jaksch et al., 2010) is a popular method to perform regret minimization in unknown discrete Markov Decision Processes under the average-reward criterion. Despite its nice and generic theoretical regret guarantees, this algorithm and its variants have remained until now mostly theoretical as numerical experiments in simple… ▽ More The upper confidence reinforcement learning (UCRL2) algorithm introduced in (Jaksch et al., 2010) is a popular method to perform regret minimization in unknown discrete Markov Decision Processes under the average-reward criterion. Despite its nice and generic theoretical regret guarantees, this algorithm and its variants have remained until now mostly theoretical as numerical experiments in simple environments exhibit long burn-in phases before the learning takes place. In pursuit of practical efficiency, we present UCRL3, following the lines of UCRL2, but with two key modifications: First, it uses state-of-the-art time-uniform concentration inequalities to compute confidence sets on the reward and (component-wise) transition distributions for each state-action pair. Furthermore, to tighten exploration, it uses an adaptive computation of the support of each transition distribution, which in turn enables us to revisit the extended value iteration procedure of UCRL2 to optimize over distributions with reduced support by disregarding low probability transitions, while still ensuring near-optimism. We demonstrate, through numerical experiments in standard environments, that reducing exploration this way yields a substantial numerical improvement compared to UCRL2 and its variants. On the theoretical side, these key modifications enable us to derive a regret bound for UCRL3 improving on UCRL2, that for the first time makes appear notions of local diameter and local effective support, thanks to variance-aware concentration bounds. △ Less

Submitted 12 April, 2021; v1 submitted 20 April, 2020; originally announced April 2020.

Comments: Appeared in Proceedings of the 27th International Conference on Machine Learning (ICML 2020). This is an improved post-proceeding version correcting minor errors

arXiv:1910.04077 [pdf, other]

Model-Based Reinforcement Learning Exploiting State-Action Equivalence

Authors: Mahsa Asadi, Mohammad Sadegh Talebi, Hippolyte Bourel, Odalric-Ambrym Maillard

Abstract: Leveraging an equivalence property in the state-space of a Markov Decision Process (MDP) has been investigated in several studies. This paper studies equivalence structure in the reinforcement learning (RL) setup, where transition distributions are no longer assumed to be known. We present a notion of similarity between transition probabilities of various state-action pairs of an MDP, which natura… ▽ More Leveraging an equivalence property in the state-space of a Markov Decision Process (MDP) has been investigated in several studies. This paper studies equivalence structure in the reinforcement learning (RL) setup, where transition distributions are no longer assumed to be known. We present a notion of similarity between transition probabilities of various state-action pairs of an MDP, which naturally defines an equivalence structure in the state-action space. We present equivalence-aware confidence sets for the case where the learner knows the underlying structure in advance. These sets are provably smaller than their corresponding equivalence-oblivious counterparts. In the more challenging case of an unknown equivalence structure, we present an algorithm called ApproxEquivalence that seeks to find an (approximate) equivalence structure, and define confidence sets using the approximate equivalence. To illustrate the efficacy of the presented confidence sets, we present C-UCRL, as a natural modification of UCRL2 for RL in undiscounted MDPs. In the case of a known equivalence structure, we show that C-UCRL improves over UCRL2 in terms of regret by a factor of $\sqrt{SA/C}$, in any communicating MDP with $S$ states, $A$ actions, and $C$ classes, which corresponds to a massive improvement when $C \ll SA$. To the best of our knowledge, this is the first work providing regret bounds for RL when an equivalence structure in the MDP is efficiently exploited. In the case of an unknown equivalence structure, we show through numerical experiments that C-UCRL combined with ApproxEquivalence outperforms UCRL2 in ergodic MDPs. △ Less

Submitted 9 October, 2019; originally announced October 2019.

Comments: ACML 2019. Recipient of the Best Student Paper Award

arXiv:1905.11128 [pdf, ps, other]

Learning Multiple Markov Chains via Adaptive Allocation

Authors: Mohammad Sadegh Talebi, Odalric-Ambrym Maillard

Abstract: We study the problem of learning the transition matrices of a set of Markov chains from a single stream of observations on each chain. We assume that the Markov chains are ergodic but otherwise unknown. The learner can sample Markov chains sequentially to observe their states. The goal of the learner is to sequentially select various chains to learn transition matrices uniformly well with respect… ▽ More We study the problem of learning the transition matrices of a set of Markov chains from a single stream of observations on each chain. We assume that the Markov chains are ergodic but otherwise unknown. The learner can sample Markov chains sequentially to observe their states. The goal of the learner is to sequentially select various chains to learn transition matrices uniformly well with respect to some loss function. We introduce a notion of loss that naturally extends the squared loss for learning distributions to the case of Markov chains, and further characterize the notion of being \emph{uniformly good} in all problem instances. We present a novel learning algorithm that efficiently balances \emph{exploration} and \emph{exploitation} intrinsic to this problem, without any prior knowledge of the chains. We provide finite-sample PAC-type guarantees on the performance of the algorithm. Further, we show that our algorithm asymptotically attains an optimal loss. △ Less

Submitted 13 November, 2019; v1 submitted 27 May, 2019; originally announced May 2019.

Comments: Accepted to NeurIPS 2019

arXiv:1803.01626 [pdf, other]

Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs

Authors: Mohammad Sadegh Talebi, Odalric-Ambrym Maillard

Abstract: The problem of reinforcement learning in an unknown and discrete Markov Decision Process (MDP) under the average-reward criterion is considered, when the learner interacts with the system in a single stream of observations, starting from an initial state without any reset. We revisit the minimax lower bound for that problem by making appear the local variance of the bias function in place of the d… ▽ More The problem of reinforcement learning in an unknown and discrete Markov Decision Process (MDP) under the average-reward criterion is considered, when the learner interacts with the system in a single stream of observations, starting from an initial state without any reset. We revisit the minimax lower bound for that problem by making appear the local variance of the bias function in place of the diameter of the MDP. Furthermore, we provide a novel analysis of the KL-UCRL algorithm establishing a high-probability regret bound scaling as $\widetilde {\mathcal O}\Bigl({\textstyle \sqrt{S\sum_{s,a}{\bf V}^\star_{s,a}T}}\Big)$ for this algorithm for ergodic MDPs, where $S$ denotes the number of states and where ${\bf V}^\star_{s,a}$ is the variance of the bias function with respect to the next-state distribution following action $a$ in state $s$. The resulting bound improves upon the best previously known regret bound $\widetilde {\mathcal O}(DS\sqrt{AT})$ for that algorithm, where $A$ and $D$ respectively denote the maximum number of actions (per state) and the diameter of MDP. We finally compare the leading terms of the two bounds in some benchmark MDPs indicating that the derived bound can provide an order of magnitude improvement in some cases. Our analysis leverages novel variations of the transportation lemma combined with Kullback-Leibler concentration inequalities, that we believe to be of independent interest. △ Less

Submitted 5 March, 2018; originally announced March 2018.

Comments: To appear in Proceedings of the 29th International Conference on Algorithmic Learning Theory (ALT 2018)

arXiv:1502.03475 [pdf, other]

Combinatorial Bandits Revisited

Authors: Richard Combes, M. Sadegh Talebi, Alexandre Proutiere, Marc Lelarge

Abstract: This paper investigates stochastic and adversarial combinatorial multi-armed bandit problems. In the stochastic setting under semi-bandit feedback, we derive a problem-specific regret lower bound, and discuss its scaling with the dimension of the decision space. We propose ESCB, an algorithm that efficiently exploits the structure of the problem and provide a finite-time analysis of its regret. ES… ▽ More This paper investigates stochastic and adversarial combinatorial multi-armed bandit problems. In the stochastic setting under semi-bandit feedback, we derive a problem-specific regret lower bound, and discuss its scaling with the dimension of the decision space. We propose ESCB, an algorithm that efficiently exploits the structure of the problem and provide a finite-time analysis of its regret. ESCB has better performance guarantees than existing algorithms, and significantly outperforms these algorithms in practice. In the adversarial setting under bandit feedback, we propose \textsc{CombEXP}, an algorithm with the same regret scaling as state-of-the-art algorithms, but with lower computational complexity for some combinatorial problems. △ Less

Submitted 5 November, 2015; v1 submitted 11 February, 2015; originally announced February 2015.

Comments: 30 pages, Advances in Neural Information Processing Systems 28 (NIPS 2015)

Showing 1–8 of 8 results for author: Talebi, M S