Skip to main content

Showing 1–8 of 8 results for author: Talebi, M S

Searching in archive stat. Search in all archives.
.
  1. arXiv:2506.00286  [pdf, ps, other

    cs.LG cs.AI math.OC stat.ML

    Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

    Authors: Oliver Mortensen, Mohammad Sadegh Talebi

    Abstract: In this paper we analyze the sample complexities of learning the optimal state-action value function $Q^*$ and an optimal policy $π^*$ in a discounted Markov decision process (MDP) where the agent has recursive entropic risk-preferences with risk-parameter $β\neq 0$ and where a generative model of the MDP is available. We provide and analyze a simple model based approach which we call model-based… ▽ More

    Submitted 30 May, 2025; originally announced June 2025.

  2. arXiv:2407.15662  [pdf, other

    stat.ML cs.AI cs.LG eess.SY

    How to Shrink Confidence Sets for Many Equivalent Discrete Distributions?

    Authors: Odalric-Ambrym Maillard, Mohammad Sadegh Talebi

    Abstract: We consider the situation when a learner faces a set of unknown discrete distributions $(p_k)_{k\in \mathcal K}$ defined over a common alphabet $\mathcal X$, and can build for each distribution $p_k$ an individual high-probability confidence set thanks to $n_k$ observations sampled from $p_k$. The set $(p_k)_{k\in \mathcal K}$ is structured: each distribution $p_k$ is obtained from the same common… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

  3. arXiv:2009.04575  [pdf, other

    cs.LG stat.ML

    Improved Exploration in Factored Average-Reward MDPs

    Authors: Mohammad Sadegh Talebi, Anders Jonsson, Odalric-Ambrym Maillard

    Abstract: We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP). More specifically, we consider an FMDP where the state-action space $\mathcal X$ and the state-space $\mathcal S$ admit the respective factored forms of $\mathcal X = \otimes_{i=1}^n \mathcal X_i$ and $\mathcal S=\otimes_{i=1}^m \mathcal S_i$, and the transition and rewa… ▽ More

    Submitted 11 March, 2021; v1 submitted 9 September, 2020; originally announced September 2020.

    Comments: 23 pages. To appear in Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS) 2021

  4. arXiv:2004.09656  [pdf, other

    cs.LG eess.SY stat.ML

    Tightening Exploration in Upper Confidence Reinforcement Learning

    Authors: Hippolyte Bourel, Odalric-Ambrym Maillard, Mohammad Sadegh Talebi

    Abstract: The upper confidence reinforcement learning (UCRL2) algorithm introduced in (Jaksch et al., 2010) is a popular method to perform regret minimization in unknown discrete Markov Decision Processes under the average-reward criterion. Despite its nice and generic theoretical regret guarantees, this algorithm and its variants have remained until now mostly theoretical as numerical experiments in simple… ▽ More

    Submitted 12 April, 2021; v1 submitted 20 April, 2020; originally announced April 2020.

    Comments: Appeared in Proceedings of the 27th International Conference on Machine Learning (ICML 2020). This is an improved post-proceeding version correcting minor errors

  5. arXiv:1910.04077  [pdf, other

    cs.LG cs.AI stat.ML

    Model-Based Reinforcement Learning Exploiting State-Action Equivalence

    Authors: Mahsa Asadi, Mohammad Sadegh Talebi, Hippolyte Bourel, Odalric-Ambrym Maillard

    Abstract: Leveraging an equivalence property in the state-space of a Markov Decision Process (MDP) has been investigated in several studies. This paper studies equivalence structure in the reinforcement learning (RL) setup, where transition distributions are no longer assumed to be known. We present a notion of similarity between transition probabilities of various state-action pairs of an MDP, which natura… ▽ More

    Submitted 9 October, 2019; originally announced October 2019.

    Comments: ACML 2019. Recipient of the Best Student Paper Award

  6. arXiv:1905.11128  [pdf, ps, other

    cs.LG stat.ML

    Learning Multiple Markov Chains via Adaptive Allocation

    Authors: Mohammad Sadegh Talebi, Odalric-Ambrym Maillard

    Abstract: We study the problem of learning the transition matrices of a set of Markov chains from a single stream of observations on each chain. We assume that the Markov chains are ergodic but otherwise unknown. The learner can sample Markov chains sequentially to observe their states. The goal of the learner is to sequentially select various chains to learn transition matrices uniformly well with respect… ▽ More

    Submitted 13 November, 2019; v1 submitted 27 May, 2019; originally announced May 2019.

    Comments: Accepted to NeurIPS 2019

  7. arXiv:1803.01626  [pdf, other

    stat.ML cs.LG eess.SY

    Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs

    Authors: Mohammad Sadegh Talebi, Odalric-Ambrym Maillard

    Abstract: The problem of reinforcement learning in an unknown and discrete Markov Decision Process (MDP) under the average-reward criterion is considered, when the learner interacts with the system in a single stream of observations, starting from an initial state without any reset. We revisit the minimax lower bound for that problem by making appear the local variance of the bias function in place of the d… ▽ More

    Submitted 5 March, 2018; originally announced March 2018.

    Comments: To appear in Proceedings of the 29th International Conference on Algorithmic Learning Theory (ALT 2018)

  8. arXiv:1502.03475  [pdf, other

    cs.LG math.OC stat.ML

    Combinatorial Bandits Revisited

    Authors: Richard Combes, M. Sadegh Talebi, Alexandre Proutiere, Marc Lelarge

    Abstract: This paper investigates stochastic and adversarial combinatorial multi-armed bandit problems. In the stochastic setting under semi-bandit feedback, we derive a problem-specific regret lower bound, and discuss its scaling with the dimension of the decision space. We propose ESCB, an algorithm that efficiently exploits the structure of the problem and provide a finite-time analysis of its regret. ES… ▽ More

    Submitted 5 November, 2015; v1 submitted 11 February, 2015; originally announced February 2015.

    Comments: 30 pages, Advances in Neural Information Processing Systems 28 (NIPS 2015)