Search | arXiv e-print repository

On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes

Abstract: A recent theoretical analysis of a Monte-Carlo tree search (MCTS) method properly modified from the ``upper confidence bound applied to trees" (UCT) algorithm established a surprising result, due to a great deal of empirical successes reported from heuristic usage of UCT with relevant adjustments for various problem domains in the literature, that its rate of convergence of the expected absolute e… ▽ More A recent theoretical analysis of a Monte-Carlo tree search (MCTS) method properly modified from the ``upper confidence bound applied to trees" (UCT) algorithm established a surprising result, due to a great deal of empirical successes reported from heuristic usage of UCT with relevant adjustments for various problem domains in the literature, that its rate of convergence of the expected absolute error to zero is $O(1/\sqrt{n})$ in estimating the optimal value at an initial state in a finite-horizon Markov decision process (MDP), where $n$ is the number of simulations. We strengthen this dispiriting slow convergence result by arguing within a simpler algorithmic framework in the perspective of MDP, apart from the usual MCTS description, that the simpler strategy, called ``upper confidence bound 1" (UCB1) for multi-armed bandit problems, when employed as an instance of MCTS by setting UCB1's arm set to be the policy set of the underlying MDP, has an asymptotically faster convergence-rate of $O(\ln n / n)$. We also point out that the UCT-based MCTS in general has the time and space complexities that depend on the size of the state space in the worst case, which contradicts the original design spirit of MCTS. Unless heuristically used, UCT-based MCTS has yet to have theoretical supports for its applicabilities. △ Less

Submitted 1 February, 2025; v1 submitted 10 February, 2024; originally announced February 2024.

arXiv:2401.08845 [pdf, ps, other]

Top Feasible-Arm Subset Identification in Constrained Multi-Armed Bandit with Limited Budget

Authors: Hyeong Soo Chang

Abstract: We present an algorithm, "constrained successive accept or reject (CSAR)," for the problem of identifying the subset of top feasible-arms from a given finite set of arms with the limited sampling-budget equal to a given time-horizon when the sequential dynamics of the arms follows the model of a constrained multi-armed bandit. We provide a finite-time upper bound on the probability of the incorrec… ▽ More We present an algorithm, "constrained successive accept or reject (CSAR)," for the problem of identifying the subset of top feasible-arms from a given finite set of arms with the limited sampling-budget equal to a given time-horizon when the sequential dynamics of the arms follows the model of a constrained multi-armed bandit. We provide a finite-time upper bound on the probability of the incorrect identification by CSAR that converges to zero with an exponential rate in the sampling-budget. △ Less

Submitted 21 January, 2025; v1 submitted 16 January, 2024; originally announced January 2024.

arXiv:2308.03297 [pdf, ps, other]

Approximate Constrained Discounted Dynamic Programming with Uniform Feasibility and Optimality

Authors: Hyeong Soo Chang

Abstract: We consider a dynamic programming (DP) approach to approximately solving an infinite-horizon constrained Markov decision process (CMDP) problem with a fixed initial-state for the expected total discounted-reward criterion with a uniform-feasibility constraint of the expected total discounted-cost in a deterministic, history-independent, and stationary policy set. We derive a DP-equation that recur… ▽ More We consider a dynamic programming (DP) approach to approximately solving an infinite-horizon constrained Markov decision process (CMDP) problem with a fixed initial-state for the expected total discounted-reward criterion with a uniform-feasibility constraint of the expected total discounted-cost in a deterministic, history-independent, and stationary policy set. We derive a DP-equation that recursively holds for a CMDP problem and its sub-CMDP problems, where each problem, induced from the parameters of the original CMDP problem, admits a uniformly-optimal feasible policy in its policy set associated with the inputs to the problem. A policy constructed from the DP-equation is shown to achieve the optimal values, defined for the CMDP problem the policy is a solution to, at all states. Based on the result, we discuss off-line and on-line computational algorithms, motivated from policy iteration for MDPs, whose output sequences have local convergences for the original CMDP problem. △ Less

Submitted 7 August, 2023; originally announced August 2023.

arXiv:2206.01860 [pdf, ps, other]

On Supervised On-line Rolling-Horizon Control for Infinite-Horizon Discounted Markov Decision Processes

Authors: Hyeong Soo Chang

Abstract: This note re-visits the rolling-horizon control approach to the problem of a Markov decision process (MDP) with infinite-horizon discounted expected reward criterion. Distinguished from the classical value-iteration approach, we develop an asynchronous on-line algorithm based on policy iteration integrated with a multi-policy improvement method of policy switching. A sequence of monotonically impr… ▽ More This note re-visits the rolling-horizon control approach to the problem of a Markov decision process (MDP) with infinite-horizon discounted expected reward criterion. Distinguished from the classical value-iteration approach, we develop an asynchronous on-line algorithm based on policy iteration integrated with a multi-policy improvement method of policy switching. A sequence of monotonically improving solutions to the forecast-horizon sub-MDP is generated by updating the current solution only at the currently visited state, building in effect a rolling-horizon control policy for the MDP over infinite horizon. Feedbacks from "supervisors," if available, can be also incorporated while updating. We focus on the convergence issue with a relation to the transition structure of the MDP. Either a global convergence to an optimal forecast-horizon policy or a local convergence to a "locally-optimal" fixed-policy in a finite time is achieved by the algorithm depending on the structure. △ Less

Submitted 3 June, 2022; originally announced June 2022.

arXiv:2112.02177 [pdf, ps, other]

On-line Policy Iteration with Policy Switching for Markov Decision Processes

Authors: Hyeong Soo Chang

Abstract: Motivated from Bertsekas' recent study on policy iteration (PI) for solving the problems of infinite-horizon discounted Markov decision processes (MDPs) in an on-line setting, we develop an off-line PI integrated with a multi-policy improvement method of policy switching and then adapt its asynchronous variant into on-line PI algorithm that generates a sequence of policies over time. The current p… ▽ More Motivated from Bertsekas' recent study on policy iteration (PI) for solving the problems of infinite-horizon discounted Markov decision processes (MDPs) in an on-line setting, we develop an off-line PI integrated with a multi-policy improvement method of policy switching and then adapt its asynchronous variant into on-line PI algorithm that generates a sequence of policies over time. The current policy is updated into the next policy by switching the action only at the current state while ensuring the monotonicity of the value functions of the policies in the sequence. Depending on MDP's state-transition structure, the sequence converges in a finite time to an optimal policy for an associated local MDP. When MDP is communicating, the sequence converges to an optimal policy for the original MDP. △ Less

Submitted 3 December, 2021; originally announced December 2021.

arXiv:2007.14550 [pdf, ps, other]

An Index-based Deterministic Asymptotically Optimal Algorithm for Constrained Multi-armed Bandit Problems

Authors: Hyeong Soo Chang

Abstract: For the model of constrained multi-armed bandit, we show that by construction there exists an index-based deterministic asymptotically optimal algorithm. The optimality is achieved by the convergence of the probability of choosing an optimal feasible arm to one over infinite horizon. The algorithm is built upon Locatelli et al.'s "anytime parameter-free thresholding" algorithm under the assumption… ▽ More For the model of constrained multi-armed bandit, we show that by construction there exists an index-based deterministic asymptotically optimal algorithm. The optimality is achieved by the convergence of the probability of choosing an optimal feasible arm to one over infinite horizon. The algorithm is built upon Locatelli et al.'s "anytime parameter-free thresholding" algorithm under the assumption that the optimal value is known. We provide a finite-time bound to the probability of the asymptotic optimality given as 1-O(|A|Te^{-T}) where T is the horizon size and A is the set of the arms in the bandit. We then study a relaxed-version of the algorithm in a general form that estimates the optimal value and discuss the asymptotic optimality of the algorithm after a sufficiently large T with examples. △ Less

Submitted 28 July, 2020; originally announced July 2020.

arXiv:1805.01237 [pdf, ps, other]

An Asymptotically Optimal Strategy for Constrained Multi-armed Bandit Problems

Authors: Hyeong Soo Chang

Abstract: For the stochastic multi-armed bandit (MAB) problem from a constrained model that generalizes the classical one, we show that an asymptotic optimality is achievable by a simple strategy extended from the $ε_t$-greedy strategy. We provide a finite-time lower bound on the probability of correct selection of an optimal near-feasible arm that holds for all time steps. Under some conditions, the bound… ▽ More For the stochastic multi-armed bandit (MAB) problem from a constrained model that generalizes the classical one, we show that an asymptotic optimality is achievable by a simple strategy extended from the $ε_t$-greedy strategy. We provide a finite-time lower bound on the probability of correct selection of an optimal near-feasible arm that holds for all time steps. Under some conditions, the bound approaches one as time $t$ goes to infinity. A particular example sequence of $\{ε_t\}$ having the asymptotic convergence rate in the order of $(1-\frac{1}{t})^4$ that holds from a sufficiently large $t$ is also discussed. △ Less

Submitted 3 May, 2018; originally announced May 2018.

arXiv:1412.4898 [pdf, ps, other]

Sleeping Experts and Bandits Approach to Constrained Markov Decision Processes

Authors: Hyeong Soo Chang

Abstract: This brief paper presents simple simulation-based algorithms for obtaining an approximately optimal policy in a given finite set in large finite constrained Markov decision processes. The algorithms are adapted from playing strategies for "sleeping experts and bandits" problem and their computational complexities are independent of state and action space sizes if the given policy set is relatively… ▽ More This brief paper presents simple simulation-based algorithms for obtaining an approximately optimal policy in a given finite set in large finite constrained Markov decision processes. The algorithms are adapted from playing strategies for "sleeping experts and bandits" problem and their computational complexities are independent of state and action space sizes if the given policy set is relatively small. We establish convergence of their expected performances to the value of an optimal policy and convergence rates, and also almost-sure convergence to an optimal policy with an exponential rate for the algorithm adapted within the context of sleeping experts. △ Less

Submitted 16 December, 2014; originally announced December 2014.

Showing 1–8 of 8 results for author: Chang, H S