-
Near-Optimal Clustering in Mixture of Markov Chains
Authors:
Junghyun Lee,
Yassir Jedra,
Alexandre Proutière,
Se-Young Yun
Abstract:
We study the problem of clustering $T$ trajectories of length $H$, each generated by one of $K$ unknown ergodic Markov chains over a finite state space of size $S$. The goal is to accurately group trajectories according to their underlying generative model. We begin by deriving an instance-dependent, high-probability lower bound on the clustering error rate, governed by the weighted KL divergence…
▽ More
We study the problem of clustering $T$ trajectories of length $H$, each generated by one of $K$ unknown ergodic Markov chains over a finite state space of size $S$. The goal is to accurately group trajectories according to their underlying generative model. We begin by deriving an instance-dependent, high-probability lower bound on the clustering error rate, governed by the weighted KL divergence between the transition kernels of the chains. We then present a novel two-stage clustering algorithm. In Stage~I, we apply spectral clustering using a new injective Euclidean embedding for ergodic Markov chains -- a contribution of independent interest that enables sharp concentration results. Stage~II refines the initial clusters via a single step of likelihood-based reassignment. Our method achieves a near-optimal clustering error with high probability, under the conditions $H = \tildeΩ(γ_{\mathrm{ps}}^{-1} (S^2 \vee π_{\min}^{-1}))$ and $TH = \tildeΩ(γ_{\mathrm{ps}}^{-1} S^2 )$, where $π_{\min}$ is the minimum stationary probability of a state across the $K$ chains and $γ_{\mathrm{ps}}$ is the minimum pseudo-spectral gap. These requirements provide significant improvements, if not at least comparable, to the state-of-the-art guarantee (Kausik et al., 2023), and moreover, our algorithm offers a key practical advantage: unlike existing approach, it requires no prior knowledge of model-specific quantities (e.g., separation between kernels or visitation probabilities). We conclude by discussing the inherent gap between our upper and lower bounds, providing insights into the unique structure of this clustering problem.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Policy Testing in Markov Decision Processes
Authors:
Kaito Ariu,
Po-An Wang,
Alexandre Proutiere,
Kenshi Abe
Abstract:
We study the policy testing problem in discounted Markov decision processes (MDPs) under the fixed-confidence setting. The goal is to determine whether the value of a given policy exceeds a specified threshold while minimizing the number of observations. We begin by deriving an instance-specific lower bound that any algorithm must satisfy. This lower bound is characterized as the solution to an op…
▽ More
We study the policy testing problem in discounted Markov decision processes (MDPs) under the fixed-confidence setting. The goal is to determine whether the value of a given policy exceeds a specified threshold while minimizing the number of observations. We begin by deriving an instance-specific lower bound that any algorithm must satisfy. This lower bound is characterized as the solution to an optimization problem with non-convex constraints. We propose a policy testing algorithm inspired by this optimization problem--a common approach in pure exploration problems such as best-arm identification, where asymptotically optimal algorithms often stem from such optimization-based characterizations. As for other pure exploration tasks in MDPs, however, the non-convex constraints in the lower-bound problem present significant challenges, raising doubts about whether statistically optimal and computationally tractable algorithms can be designed. To address this, we reformulate the lower-bound problem by interchanging the roles of the objective and the constraints, yielding an alternative problem with a non-convex objective but convex constraints. Strikingly, this reformulated problem admits an interpretation as a policy optimization task in a newly constructed reversed MDP. Leveraging recent advances in policy gradient methods, we efficiently solve this problem and use it to design a policy testing algorithm that is statistically optimal--matching the instance-specific lower bound on sample complexity--while remaining computationally tractable. We validate our approach with numerical experiments.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Model-free Low-Rank Reinforcement Learning via Leveraged Entry-wise Matrix Estimation
Authors:
Stefan Stojanovic,
Yassir Jedra,
Alexandre Proutiere
Abstract:
We consider the problem of learning an $\varepsilon$-optimal policy in controlled dynamical systems with low-rank latent structure. For this problem, we present LoRa-PI (Low-Rank Policy Iteration), a model-free learning algorithm alternating between policy improvement and policy evaluation steps. In the latter, the algorithm estimates the low-rank matrix corresponding to the (state, action) value…
▽ More
We consider the problem of learning an $\varepsilon$-optimal policy in controlled dynamical systems with low-rank latent structure. For this problem, we present LoRa-PI (Low-Rank Policy Iteration), a model-free learning algorithm alternating between policy improvement and policy evaluation steps. In the latter, the algorithm estimates the low-rank matrix corresponding to the (state, action) value function of the current policy using the following two-phase procedure. The entries of the matrix are first sampled uniformly at random to estimate, via a spectral method, the leverage scores of its rows and columns. These scores are then used to extract a few important rows and columns whose entries are further sampled. The algorithm exploits these new samples to complete the matrix estimation using a CUR-like method. For this leveraged matrix estimation procedure, we establish entry-wise guarantees that remarkably, do not depend on the coherence of the matrix but only on its spikiness. These guarantees imply that LoRa-PI learns an $\varepsilon$-optimal policy using $\widetilde{O}({S+A\over \mathrm{poly}(1-γ)\varepsilon^2})$ samples where $S$ (resp. $A$) denotes the number of states (resp. actions) and $γ$ the discount factor. Our algorithm achieves this order-optimal (in $S$, $A$ and $\varepsilon$) sample complexity under milder conditions than those assumed in previously proposed approaches.
△ Less
Submitted 10 November, 2024; v1 submitted 30 October, 2024;
originally announced October 2024.
-
Conformal Predictions under Markovian Data
Authors:
Frédéric Zheng,
Alexandre Proutiere
Abstract:
We study the split Conformal Prediction method when applied to Markovian data. We quantify the gap in terms of coverage induced by the correlations in the data (compared to exchangeable data). This gap strongly depends on the mixing properties of the underlying Markov chain, and we prove that it typically scales as $\sqrt{t_\mathrm{mix}\ln(n)/n}$ (where $t_\mathrm{mix}$ is the mixing time of the c…
▽ More
We study the split Conformal Prediction method when applied to Markovian data. We quantify the gap in terms of coverage induced by the correlations in the data (compared to exchangeable data). This gap strongly depends on the mixing properties of the underlying Markov chain, and we prove that it typically scales as $\sqrt{t_\mathrm{mix}\ln(n)/n}$ (where $t_\mathrm{mix}$ is the mixing time of the chain). We also derive upper bounds on the impact of the correlations on the size of the prediction set. Finally we present $K$-split CP, a method that consists in thinning the calibration dataset and that adapts to the mixing properties of the chain. Its coverage gap is reduced to $t_\mathrm{mix}/(n\ln(n))$ without really affecting the size of the prediction set. We finally test our algorithms on synthetic and real-world datasets.
△ Less
Submitted 21 July, 2024;
originally announced July 2024.
-
Model-Free Active Exploration in Reinforcement Learning
Authors:
Alessio Russo,
Alexandre Proutiere
Abstract:
We study the problem of exploration in Reinforcement Learning and present a novel model-free solution. We adopt an information-theoretical viewpoint and start from the instance-specific lower bound of the number of samples that have to be collected to identify a nearly-optimal policy. Deriving this lower bound along with the optimal exploration strategy entails solving an intricate optimization pr…
▽ More
We study the problem of exploration in Reinforcement Learning and present a novel model-free solution. We adopt an information-theoretical viewpoint and start from the instance-specific lower bound of the number of samples that have to be collected to identify a nearly-optimal policy. Deriving this lower bound along with the optimal exploration strategy entails solving an intricate optimization problem and requires a model of the system. In turn, most existing sample optimal exploration algorithms rely on estimating the model. We derive an approximation of the instance-specific lower bound that only involves quantities that can be inferred using model-free approaches. Leveraging this approximation, we devise an ensemble-based model-free exploration strategy applicable to both tabular and continuous Markov decision processes. Numerical results demonstrate that our strategy is able to identify efficient policies faster than state-of-the-art exploration approaches
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
Low-Rank Bandits via Tight Two-to-Infinity Singular Subspace Recovery
Authors:
Yassir Jedra,
William Réveillard,
Stefan Stojanovic,
Alexandre Proutiere
Abstract:
We study contextual bandits with low-rank structure where, in each round, if the (context, arm) pair $(i,j)\in [m]\times [n]$ is selected, the learner observes a noisy sample of the $(i,j)$-th entry of an unknown low-rank reward matrix. Successive contexts are generated randomly in an i.i.d. manner and are revealed to the learner. For such bandits, we present efficient algorithms for policy evalua…
▽ More
We study contextual bandits with low-rank structure where, in each round, if the (context, arm) pair $(i,j)\in [m]\times [n]$ is selected, the learner observes a noisy sample of the $(i,j)$-th entry of an unknown low-rank reward matrix. Successive contexts are generated randomly in an i.i.d. manner and are revealed to the learner. For such bandits, we present efficient algorithms for policy evaluation, best policy identification and regret minimization. For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal. For instance, the number of samples required to return an $\varepsilon$-optimal policy with probability at least $1-δ$ typically scales as ${r(m+n)\over \varepsilon^2}\log(1/δ)$. Our regret minimization algorithm enjoys minimax guarantees typically scaling as $r^{7/4}(m+n)^{3/4}\sqrt{T}$, which improves over existing algorithms. All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix. We show that these estimates enjoy tight error guarantees in the two-to-infinity norm. This in turn allows us to reformulate our problems as a misspecified linear bandit problem with dimension roughly $r(m+n)$ and misspecification controlled by the subspace recovery error, as well as to design the second phase of our algorithms efficiently.
△ Less
Submitted 4 July, 2024; v1 submitted 24 February, 2024;
originally announced February 2024.
-
Best Arm Identification with Fixed Budget: A Large Deviation Perspective
Authors:
Po-An Wang,
Ruo-Chun Tzeng,
Alexandre Proutiere
Abstract:
We consider the problem of identifying the best arm in stochastic Multi-Armed Bandits (MABs) using a fixed sampling budget. Characterizing the minimal instance-specific error probability for this problem constitutes one of the important remaining open problems in MABs. When arms are selected using a static sampling strategy, the error probability decays exponentially with the number of samples at…
▽ More
We consider the problem of identifying the best arm in stochastic Multi-Armed Bandits (MABs) using a fixed sampling budget. Characterizing the minimal instance-specific error probability for this problem constitutes one of the important remaining open problems in MABs. When arms are selected using a static sampling strategy, the error probability decays exponentially with the number of samples at a rate that can be explicitly derived via Large Deviation techniques. Analyzing the performance of algorithms with adaptive sampling strategies is however much more challenging. In this paper, we establish a connection between the Large Deviation Principle (LDP) satisfied by the empirical proportions of arm draws and that satisfied by the empirical arm rewards. This connection holds for any adaptive algorithm, and is leveraged (i) to improve error probability upper bounds of some existing algorithms, such as the celebrated \sr (Successive Rejects) algorithm \citep{audibert2010best}, and (ii) to devise and analyze new algorithms. In particular, we present \sred (Continuous Rejects), a truly adaptive algorithm that can reject arms in {\it any} round based on the observed empirical gaps between the rewards of various arms. Applying our Large Deviation results, we prove that \sred enjoys better performance guarantees than existing algorithms, including \sr. Extensive numerical experiments confirm this observation.
△ Less
Submitted 19 February, 2024; v1 submitted 19 December, 2023;
originally announced December 2023.
-
Spectral Entry-wise Matrix Estimation for Low-Rank Reinforcement Learning
Authors:
Stefan Stojanovic,
Yassir Jedra,
Alexandre Proutiere
Abstract:
We study matrix estimation problems arising in reinforcement learning (RL) with low-rank structure. In low-rank bandits, the matrix to be recovered specifies the expected arm rewards, and for low-rank Markov Decision Processes (MDPs), it may for example characterize the transition kernel of the MDP. In both cases, each entry of the matrix carries important information, and we seek estimation metho…
▽ More
We study matrix estimation problems arising in reinforcement learning (RL) with low-rank structure. In low-rank bandits, the matrix to be recovered specifies the expected arm rewards, and for low-rank Markov Decision Processes (MDPs), it may for example characterize the transition kernel of the MDP. In both cases, each entry of the matrix carries important information, and we seek estimation methods with low entry-wise error. Importantly, these methods further need to accommodate for inherent correlations in the available data (e.g. for MDPs, the data consists of system trajectories). We investigate the performance of simple spectral-based matrix estimation approaches: we show that they efficiently recover the singular subspaces of the matrix and exhibit nearly-minimal entry-wise error. These new results on low-rank matrix estimation make it possible to devise reinforcement learning algorithms that fully exploit the underlying low-rank structure. We provide two examples of such algorithms: a regret minimization algorithm for low-rank bandit problems, and a best policy identification algorithm for reward-free RL in low-rank MDPs. Both algorithms yield state-of-the-art performance guarantees.
△ Less
Submitted 27 October, 2023; v1 submitted 10 October, 2023;
originally announced October 2023.
-
Sub-linear Regret in Adaptive Model Predictive Control
Authors:
Damianos Tranos,
Alexandre Proutiere
Abstract:
We consider the problem of adaptive Model Predictive Control (MPC) for uncertain linear-systems with additive disturbances and with state and input constraints. We present STT-MPC (Self-Tuning Tube-based Model Predictive Control), an online algorithm that combines the certainty-equivalence principle and polytopic tubes. Specifically, at any given step, STT-MPC infers the system dynamics using the…
▽ More
We consider the problem of adaptive Model Predictive Control (MPC) for uncertain linear-systems with additive disturbances and with state and input constraints. We present STT-MPC (Self-Tuning Tube-based Model Predictive Control), an online algorithm that combines the certainty-equivalence principle and polytopic tubes. Specifically, at any given step, STT-MPC infers the system dynamics using the Least Squares Estimator (LSE), and applies a controller obtained by solving an MPC problem using these estimates. The use of polytopic tubes is so that, despite the uncertainties, state and input constraints are satisfied, and recursive-feasibility and asymptotic stability hold. In this work, we analyze the regret of the algorithm, when compared to an oracle algorithm initially aware of the system dynamics. We establish that the expected regret of STT-MPC does not exceed $O(T^{1/2 + ε})$, where $ε\in (0,1)$ is a design parameter tuning the persistent excitation component of the algorithm. Our result relies on a recently proposed exponential decay of sensitivity property and, to the best of our knowledge, is the first of its kind in this setting. We illustrate the performance of our algorithm using a simple numerical example.
△ Less
Submitted 7 October, 2023;
originally announced October 2023.
-
On Universally Optimal Algorithms for A/B Testing
Authors:
Po-An Wang,
Kaito Ariu,
Alexandre Proutiere
Abstract:
We study the problem of best-arm identification with fixed budget in stochastic multi-armed bandits with Bernoulli rewards. For the problem with two arms, also known as the A/B testing problem, we prove that there is no algorithm that (i) performs as well as the algorithm sampling each arm equally (referred to as the {\it uniform sampling} algorithm) in all instances, and that (ii) strictly outper…
▽ More
We study the problem of best-arm identification with fixed budget in stochastic multi-armed bandits with Bernoulli rewards. For the problem with two arms, also known as the A/B testing problem, we prove that there is no algorithm that (i) performs as well as the algorithm sampling each arm equally (referred to as the {\it uniform sampling} algorithm) in all instances, and that (ii) strictly outperforms uniform sampling on at least one instance. In short, there is no algorithm better than the uniform sampling algorithm. To establish this result, we first introduce the natural class of {\it consistent} and {\it stable} algorithms, and show that any algorithm that performs as well as the uniform sampling algorithm in all instances belongs to this class. The proof then proceeds by deriving a lower bound on the error rate satisfied by any consistent and stable algorithm, and by showing that the uniform sampling algorithm matches this lower bound. Our results provide a solution to the two open problems presented in \citep{qin2022open}. For the general problem with more than two arms, we provide a first set of results. We characterize the asymptotic error rate of the celebrated Successive Rejects (SR) algorithm \citep{audibert2010best} and show that, surprisingly, the uniform sampling algorithm outperforms the SR algorithm in some instances.
△ Less
Submitted 4 June, 2024; v1 submitted 23 August, 2023;
originally announced August 2023.
-
Revisiting Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model
Authors:
Kaito Ariu,
Alexandre Proutiere,
Se-Young Yun
Abstract:
In this paper, we investigate the problem of recovering hidden communities in the Labeled Stochastic Block Model (LSBM) with a finite number of clusters whose sizes grow linearly with the total number of nodes. We derive the necessary and sufficient conditions under which the expected number of misclassified nodes is less than $ s $, for any number $ s = o(n) $. To achieve this, we propose IAC (In…
▽ More
In this paper, we investigate the problem of recovering hidden communities in the Labeled Stochastic Block Model (LSBM) with a finite number of clusters whose sizes grow linearly with the total number of nodes. We derive the necessary and sufficient conditions under which the expected number of misclassified nodes is less than $ s $, for any number $ s = o(n) $. To achieve this, we propose IAC (Instance-Adaptive Clustering), the first algorithm whose performance matches the instance-specific lower bounds both in expectation and with high probability. IAC is a novel two-phase algorithm that consists of a one-shot spectral clustering step followed by iterative likelihood-based cluster assignment improvements. This approach is based on the instance-specific lower bound and notably does not require any knowledge of the model parameters, including the number of clusters. By performing the spectral clustering only once, IAC maintains an overall computational complexity of $ \mathcal{O}(n\, \text{polylog}(n)) $, making it scalable and practical for large-scale problems.
△ Less
Submitted 2 February, 2025; v1 submitted 18 June, 2023;
originally announced June 2023.
-
Conformal Off-Policy Evaluation in Markov Decision Processes
Authors:
Daniele Foffano,
Alessio Russo,
Alexandre Proutiere
Abstract:
Reinforcement Learning aims at identifying and evaluating efficient control policies from data. In many real-world applications, the learner is not allowed to experiment and cannot gather data in an online manner (this is the case when experimenting is expensive, risky or unethical). For such applications, the reward of a given policy (the target policy) must be estimated using historical data gat…
▽ More
Reinforcement Learning aims at identifying and evaluating efficient control policies from data. In many real-world applications, the learner is not allowed to experiment and cannot gather data in an online manner (this is the case when experimenting is expensive, risky or unethical). For such applications, the reward of a given policy (the target policy) must be estimated using historical data gathered under a different policy (the behavior policy). Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees. We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty. The main challenge in OPE stems from the distribution shift due to the discrepancies between the target and the behavior policies. We propose and empirically evaluate different ways to deal with this shift. Some of these methods yield conformalized intervals with reduced length compared to existing approaches, while maintaining the same certainty level.
△ Less
Submitted 19 September, 2023; v1 submitted 5 April, 2023;
originally announced April 2023.
-
On the Sample Complexity of Representation Learning in Multi-task Bandits with Global and Local structure
Authors:
Alessio Russo,
Alexandre Proutiere
Abstract:
We investigate the sample complexity of learning the optimal arm for multi-task bandit problems. Arms consist of two components: one that is shared across tasks (that we call representation) and one that is task-specific (that we call predictor). The objective is to learn the optimal (representation, predictor)-pair for each task, under the assumption that the optimal representation is common to a…
▽ More
We investigate the sample complexity of learning the optimal arm for multi-task bandit problems. Arms consist of two components: one that is shared across tasks (that we call representation) and one that is task-specific (that we call predictor). The objective is to learn the optimal (representation, predictor)-pair for each task, under the assumption that the optimal representation is common to all tasks. Within this framework, efficient learning algorithms should transfer knowledge across tasks. We consider the best-arm identification problem for a fixed confidence, where, in each round, the learner actively selects both a task, and an arm, and observes the corresponding reward. We derive instance-specific sample complexity lower bounds satisfied by any $(δ_G,δ_H)$-PAC algorithm (such an algorithm identifies the best representation with probability at least $1-δ_G$, and the best predictor for a task with probability at least $1-δ_H$). We devise an algorithm OSRL-SC whose sample complexity approaches the lower bound, and scales at most as $H(G\log(1/δ_G)+ X\log(1/δ_H))$, with $X,G,H$ being, respectively, the number of tasks, representations and predictors. By comparison, this scaling is significantly better than the classical best-arm identification algorithm that scales as $HGX\log(1/δ)$.
△ Less
Submitted 28 November, 2022;
originally announced November 2022.
-
Nearly Optimal Latent State Decoding in Block MDPs
Authors:
Yassir Jedra,
Junghyun Lee,
Alexandre Proutière,
Se-Young Yun
Abstract:
We investigate the problems of model estimation and reward-free learning in episodic Block MDPs. In these MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states. We are first interested in estimating the latent state decoding function (the mapping from the observations to latent states) based on data generated under a fixed behavior poli…
▽ More
We investigate the problems of model estimation and reward-free learning in episodic Block MDPs. In these MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states. We are first interested in estimating the latent state decoding function (the mapping from the observations to latent states) based on data generated under a fixed behavior policy. We derive an information-theoretical lower bound on the error rate for estimating this function and present an algorithm approaching this fundamental limit. In turn, our algorithm also provides estimates of all the components of the MDP. We then study the problem of learning near-optimal policies in the reward-free framework. Based on our efficient model estimation algorithm, we show that we can infer a policy converging (as the number of collected samples grows large) to the optimal policy at the best possible rate. Interestingly, our analysis provides necessary and sufficient conditions under which exploiting the block structure yields improvements in the sample complexity for identifying near-optimal policies. When these conditions are met, the sample complexity in the minimax reward-free setting is improved by a multiplicative factor $n$, where $n$ is the number of possible contexts.
△ Less
Submitted 24 February, 2023; v1 submitted 17 August, 2022;
originally announced August 2022.
-
Best Policy Identification in Linear MDPs
Authors:
Jerome Taupin,
Yassir Jedra,
Alexandre Proutiere
Abstract:
We investigate the problem of best policy identification in discounted linear Markov Decision Processes in the fixed confidence setting under a generative model. We first derive an instance-specific lower bound on the expected number of samples required to identify an $\varepsilon$-optimal policy with probability $1-δ$. The lower bound characterizes the optimal sampling rule as the solution of an…
▽ More
We investigate the problem of best policy identification in discounted linear Markov Decision Processes in the fixed confidence setting under a generative model. We first derive an instance-specific lower bound on the expected number of samples required to identify an $\varepsilon$-optimal policy with probability $1-δ$. The lower bound characterizes the optimal sampling rule as the solution of an intricate non-convex optimization program, but can be used as the starting point to devise simple and near-optimal sampling rules and algorithms. We devise such algorithms. One of these exhibits a sample complexity upper bounded by ${\cal O}({\frac{d}{(\varepsilon+Δ)^2}} (\log(\frac{1}δ)+d))$ where $Δ$ denotes the minimum reward gap of sub-optimal actions and $d$ is the dimension of the feature space. This upper bound holds in the moderate-confidence regime (i.e., for all $δ$), and matches existing minimax and gap-dependent lower bounds. We extend our algorithm to episodic linear MDPs.
△ Less
Submitted 11 August, 2022;
originally announced August 2022.
-
Measurement-based Admission Control in Sliced Networks: A Best Arm Identification Approach
Authors:
Simon Lindståhl,
Alexandre Proutiere,
Andreas Johnsson
Abstract:
In sliced networks, the shared tenancy of slices requires adaptive admission control of data flows, based on measurements of network resources. In this paper, we investigate the design of measurement-based admission control schemes, deciding whether a new data flow can be admitted and in this case, on which slice. The objective is to devise a joint measurement and decision strategy that returns a…
▽ More
In sliced networks, the shared tenancy of slices requires adaptive admission control of data flows, based on measurements of network resources. In this paper, we investigate the design of measurement-based admission control schemes, deciding whether a new data flow can be admitted and in this case, on which slice. The objective is to devise a joint measurement and decision strategy that returns a correct decision (e.g., the least loaded slice) with a certain level of confidence while minimizing the measurement cost (the number of measurements made before committing to the decision). We study the design of such strategies for several natural admission criteria specifying what a correct decision is. For each of these criteria, using tools from best arm identification in bandits, we first derive an explicit information-theoretical lower bound on the cost of any algorithm returning the correct decision with fixed confidence. We then devise a joint measurement and decision strategy achieving this theoretical limit. We compare empirically the measurement costs of these strategies, and compare them both to the lower bounds as well as a naive measurement scheme. We find that our algorithm significantly outperforms the naive scheme (by a factor $2-8$).
△ Less
Submitted 9 August, 2022; v1 submitted 14 April, 2022;
originally announced April 2022.
-
Learning Optimal Antenna Tilt Control Policies: A Contextual Linear Bandit Approach
Authors:
Filippo Vannella,
Alexandre Proutiere,
Yassir Jedra,
Jaeseong Jeong
Abstract:
Controlling antenna tilts in cellular networks is imperative to reach an efficient trade-off between network coverage and capacity. In this paper, we devise algorithms learning optimal tilt control policies from existing data (in the so-called passive learning setting) or from data actively generated by the algorithms (the active learning setting). We formalize the design of such algorithms as a B…
▽ More
Controlling antenna tilts in cellular networks is imperative to reach an efficient trade-off between network coverage and capacity. In this paper, we devise algorithms learning optimal tilt control policies from existing data (in the so-called passive learning setting) or from data actively generated by the algorithms (the active learning setting). We formalize the design of such algorithms as a Best Policy Identification (BPI) problem in Contextual Linear Multi-Arm Bandits (CL-MAB). An arm represents an antenna tilt update; the context captures current network conditions; the reward corresponds to an improvement of performance, mixing coverage and capacity; and the objective is to identify, with a given level of confidence, an approximately optimal policy (a function mapping the context to an arm with maximal reward). For CL-MAB in both active and passive learning settings, we derive information-theoretical lower bounds on the number of samples required by any algorithm returning an approximately optimal policy with a given level of certainty, and devise algorithms achieving these fundamental limits. We apply our algorithms to the Remote Electrical Tilt (RET) optimization problem in cellular networks, and show that they can produce optimal tilt update policy using much fewer data samples than naive or existing rule-based learning algorithms.
△ Less
Submitted 6 January, 2022;
originally announced January 2022.
-
Minimal Expected Regret in Linear Quadratic Control
Authors:
Yassir Jedra,
Alexandre Proutiere
Abstract:
We consider the problem of online learning in Linear Quadratic Control systems whose state transition and state-action transition matrices $A$ and $B$ may be initially unknown. We devise an online learning algorithm and provide guarantees on its expected regret. This regret at time $T$ is upper bounded (i) by $\widetilde{O}((d_u+d_x)\sqrt{d_xT})$ when $A$ and $B$ are unknown, (ii) by…
▽ More
We consider the problem of online learning in Linear Quadratic Control systems whose state transition and state-action transition matrices $A$ and $B$ may be initially unknown. We devise an online learning algorithm and provide guarantees on its expected regret. This regret at time $T$ is upper bounded (i) by $\widetilde{O}((d_u+d_x)\sqrt{d_xT})$ when $A$ and $B$ are unknown, (ii) by $\widetilde{O}(d_x^2\log(T))$ if only $A$ is unknown, and (iii) by $\widetilde{O}(d_x(d_u+d_x)\log(T))$ if only $B$ is unknown and under some mild non-degeneracy condition ($d_x$ and $d_u$ denote the dimensions of the state and of the control input, respectively). These regret scalings are minimal in $T$, $d_x$ and $d_u$ as they match existing lower bounds in scenario (i) when $d_x\le d_u$ [SF20], and in scenario (ii) [lai1986]. We conjecture that our upper bounds are also optimal in scenario (iii) (there is no known lower bound in this setting).
Existing online algorithms proceed in epochs of (typically exponentially) growing durations. The control policy is fixed within each epoch, which considerably simplifies the analysis of the estimation error on $A$ and $B$ and hence of the regret. Our algorithm departs from this design choice: it is a simple variant of certainty-equivalence regulators, where the estimates of $A$ and $B$ and the resulting control policy can be updated as frequently as we wish, possibly at every step. Quantifying the impact of such a constantly-varying control policy on the performance of these estimates and on the regret constitutes one of the technical challenges tackled in this paper.
△ Less
Submitted 29 September, 2021;
originally announced September 2021.
-
Balancing detectability and performance of attacks on the control channel of Markov Decision Processes
Authors:
Alessio Russo,
Alexandre Proutiere
Abstract:
We investigate the problem of designing optimal stealthy poisoning attacks on the control channel of Markov decision processes (MDPs). This research is motivated by the recent interest of the research community for adversarial and poisoning attacks applied to MDPs, and reinforcement learning (RL) methods. The policies resulting from these methods have been shown to be vulnerable to attacks perturb…
▽ More
We investigate the problem of designing optimal stealthy poisoning attacks on the control channel of Markov decision processes (MDPs). This research is motivated by the recent interest of the research community for adversarial and poisoning attacks applied to MDPs, and reinforcement learning (RL) methods. The policies resulting from these methods have been shown to be vulnerable to attacks perturbing the observations of the decision-maker. In such an attack, drawing inspiration from adversarial examples used in supervised learning, the amplitude of the adversarial perturbation is limited according to some norm, with the hope that this constraint will make the attack imperceptible. However, such constraints do not grant any level of undetectability and do not take into account the dynamic nature of the underlying Markov process. In this paper, we propose a new attack formulation, based on information-theoretical quantities, that considers the objective of minimizing the detectability of the attack as well as the performance of the controlled process. We analyze the trade-off between the efficiency of the attack and its detectability. We conclude with examples and numerical simulations illustrating this trade-off.
△ Less
Submitted 15 September, 2021;
originally announced September 2021.
-
Online Learning of Optimally Diverse Rankings
Authors:
Stefan Magureanu,
Alexandre Proutiere,
Marcus Isaksson,
Boxun Zhang
Abstract:
Search engines answer users' queries by listing relevant items (e.g. documents, songs, products, web pages, ...). These engines rely on algorithms that learn to rank items so as to present an ordered list maximizing the probability that it contains relevant item. The main challenge in the design of learning-to-rank algorithms stems from the fact that queries often have different meanings for diffe…
▽ More
Search engines answer users' queries by listing relevant items (e.g. documents, songs, products, web pages, ...). These engines rely on algorithms that learn to rank items so as to present an ordered list maximizing the probability that it contains relevant item. The main challenge in the design of learning-to-rank algorithms stems from the fact that queries often have different meanings for different users. In absence of any contextual information about the query, one often has to adhere to the {\it diversity} principle, i.e., to return a list covering the various possible topics or meanings of the query. To formalize this learning-to-rank problem, we propose a natural model where (i) items are categorized into topics, (ii) users find items relevant only if they match the topic of their query, and (iii) the engine is not aware of the topic of an arriving query, nor of the frequency at which queries related to various topics arrive, nor of the topic-dependent click-through-rates of the items. For this problem, we devise LDR (Learning Diverse Rankings), an algorithm that efficiently learns the optimal list based on users' feedback only. We show that after $T$ queries, the regret of LDR scales as $O((N-L)\log(T))$ where $N$ is the number of all items. We further establish that this scaling cannot be improved, i.e., LDR is order optimal. Finally, using numerical experiments on both artificial and real-world data, we illustrate the superiority of LDR compared to existing learning-to-rank algorithms.
△ Less
Submitted 13 September, 2021;
originally announced September 2021.
-
Regret Analysis in Deterministic Reinforcement Learning
Authors:
Damianos Tranos,
Alexandre Proutiere
Abstract:
We consider Markov Decision Processes (MDPs) with deterministic transitions and study the problem of regret minimization, which is central to the analysis and design of optimal learning algorithms. We present logarithmic problem-specific regret lower bounds that explicitly depend on the system parameter (in contrast to previous minimax approaches) and thus, truly quantify the fundamental limit of…
▽ More
We consider Markov Decision Processes (MDPs) with deterministic transitions and study the problem of regret minimization, which is central to the analysis and design of optimal learning algorithms. We present logarithmic problem-specific regret lower bounds that explicitly depend on the system parameter (in contrast to previous minimax approaches) and thus, truly quantify the fundamental limit of performance achievable by any learning algorithm. Deterministic MDPs can be interpreted as graphs and analyzed in terms of their cycles, a fact which we leverage in order to identify a class of deterministic MDPs whose regret lower bound can be determined numerically. We further exemplify this result on a deterministic line search problem, and a deterministic MDP with state-dependent rewards, whose regret lower bounds we can state explicitly. These bounds share similarities with the known problem-specific bound of the multi-armed bandit problem and suggest that navigation on a deterministic MDP need not have an effect on the performance of a learning algorithm.
△ Less
Submitted 27 June, 2021;
originally announced June 2021.
-
Navigating to the Best Policy in Markov Decision Processes
Authors:
Aymen Al Marjani,
Aurélien Garivier,
Alexandre Proutiere
Abstract:
We investigate the classical active pure exploration problem in Markov Decision Processes, where the agent sequentially selects actions and, from the resulting system trajectory, aims at identifying the best policy as fast as possible. We propose a problem-dependent lower bound on the average number of steps required before a correct answer can be given with probability at least $1-δ$. We further…
▽ More
We investigate the classical active pure exploration problem in Markov Decision Processes, where the agent sequentially selects actions and, from the resulting system trajectory, aims at identifying the best policy as fast as possible. We propose a problem-dependent lower bound on the average number of steps required before a correct answer can be given with probability at least $1-δ$. We further provide the first algorithm with an instance-specific sample complexity in this setting. This algorithm addresses the general case of communicating MDPs; we also propose a variant with a reduced exploration rate (and hence faster convergence) under an additional ergodicity assumption. This work extends previous results relative to the \emph{generative setting}~\cite{pmlr-v139-marjani21a}, where the agent could at each step query the random outcome of any (state, action) pair. In contrast, we show here how to deal with the \emph{navigation constraints}, induced by the \emph{online setting}. Our analysis relies on an ergodic theorem for non-homogeneous Markov chains which we consider of wide interest in the analysis of Markov Decision Processes.
△ Less
Submitted 25 October, 2021; v1 submitted 5 June, 2021;
originally announced June 2021.
-
Regret in Online Recommendation Systems
Authors:
Kaito Ariu,
Narae Ryu,
Se-Young Yun,
Alexandre Proutière
Abstract:
This paper proposes a theoretical analysis of recommendation systems in an online setting, where items are sequentially recommended to users over time. In each round, a user, randomly picked from a population of $m$ users, requests a recommendation. The decision-maker observes the user and selects an item from a catalogue of $n$ items. Importantly, an item cannot be recommended twice to the same u…
▽ More
This paper proposes a theoretical analysis of recommendation systems in an online setting, where items are sequentially recommended to users over time. In each round, a user, randomly picked from a population of $m$ users, requests a recommendation. The decision-maker observes the user and selects an item from a catalogue of $n$ items. Importantly, an item cannot be recommended twice to the same user. The probabilities that a user likes each item are unknown. The performance of the recommendation algorithm is captured through its regret, considering as a reference an Oracle algorithm aware of these probabilities. We investigate various structural assumptions on these probabilities: we derive for each structure regret lower bounds, and devise algorithms achieving these limits. Interestingly, our analysis reveals the relative weights of the different components of regret: the component due to the constraint of not presenting the same item twice to the same user, that due to learning the chances users like items, and finally that arising when learning the underlying structure.
△ Less
Submitted 23 October, 2020;
originally announced October 2020.
-
Thresholded Lasso Bandit
Authors:
Kaito Ariu,
Kenshi Abe,
Alexandre Proutière
Abstract:
In this paper, we revisit the regret minimization problem in sparse stochastic contextual linear bandits, where feature vectors may be of large dimension $d$, but where the reward function depends on a few, say $s_0\ll d$, of these features only. We present Thresholded Lasso bandit, an algorithm that (i) estimates the vector defining the reward function as well as its sparse support, i.e., signifi…
▽ More
In this paper, we revisit the regret minimization problem in sparse stochastic contextual linear bandits, where feature vectors may be of large dimension $d$, but where the reward function depends on a few, say $s_0\ll d$, of these features only. We present Thresholded Lasso bandit, an algorithm that (i) estimates the vector defining the reward function as well as its sparse support, i.e., significant feature elements, using the Lasso framework with thresholding, and (ii) selects an arm greedily according to this estimate projected on its support. The algorithm does not require prior knowledge of the sparsity index $s_0$ and can be parameter-free under some symmetric assumptions. For this simple algorithm, we establish non-asymptotic regret upper bounds scaling as $\mathcal{O}( \log d + \sqrt{T} )$ in general, and as $\mathcal{O}( \log d + \log T)$ under the so-called margin condition (a probabilistic condition on the separation of the arm rewards). The regret of previous algorithms scales as $\mathcal{O}( \log d + \sqrt{T \log (d T)})$ and $\mathcal{O}( \log T \log d)$ in the two settings, respectively. Through numerical experiments, we confirm that our algorithm outperforms existing methods.
△ Less
Submitted 19 June, 2022; v1 submitted 22 October, 2020;
originally announced October 2020.
-
Adaptive Sampling for Best Policy Identification in Markov Decision Processes
Authors:
Aymen Al Marjani,
Alexandre Proutiere
Abstract:
We investigate the problem of best-policy identification in discounted Markov Decision Processes (MDPs) when the learner has access to a generative model. The objective is to devise a learning algorithm returning the best policy as early as possible. We first derive a problem-specific lower bound of the sample complexity satisfied by any learning algorithm. This lower bound corresponds to an optim…
▽ More
We investigate the problem of best-policy identification in discounted Markov Decision Processes (MDPs) when the learner has access to a generative model. The objective is to devise a learning algorithm returning the best policy as early as possible. We first derive a problem-specific lower bound of the sample complexity satisfied by any learning algorithm. This lower bound corresponds to an optimal sample allocation that solves a non-convex program, and hence, is hard to exploit in the design of efficient algorithms. We then provide a simple and tight upper bound of the sample complexity lower bound, whose corresponding nearly-optimal sample allocation becomes explicit. The upper bound depends on specific functionals of the MDP such as the sub-optimality gaps and the variance of the next-state value function, and thus really captures the hardness of the MDP. Finally, we devise KLB-TS (KL Ball Track-and-Stop), an algorithm tracking this nearly-optimal allocation, and provide asymptotic guarantees for its sample complexity (both almost surely and in expectation). The advantages of KLB-TS against state-of-the-art algorithms are discussed and illustrated numerically.
△ Less
Submitted 10 May, 2021; v1 submitted 28 September, 2020;
originally announced September 2020.
-
Optimal Best-arm Identification in Linear Bandits
Authors:
Yassir Jedra,
Alexandre Proutiere
Abstract:
We study the problem of best-arm identification with fixed confidence in stochastic linear bandits. The objective is to identify the best arm with a given level of certainty while minimizing the sampling budget. We devise a simple algorithm whose sampling complexity matches known instance-specific lower bounds, asymptotically almost surely and in expectation. The algorithm relies on an arm samplin…
▽ More
We study the problem of best-arm identification with fixed confidence in stochastic linear bandits. The objective is to identify the best arm with a given level of certainty while minimizing the sampling budget. We devise a simple algorithm whose sampling complexity matches known instance-specific lower bounds, asymptotically almost surely and in expectation. The algorithm relies on an arm sampling rule that tracks an optimal proportion of arm draws, and that remarkably can be updated as rarely as we wish, without compromising its theoretical guarantees. Moreover, unlike existing best-arm identification strategies, our algorithm uses a stopping rule that does not depend on the number of arms. Experimental results suggest that our algorithm significantly outperforms existing algorithms. The paper further provides a first analysis of the best-arm identification problem in linear bandits with a continuous set of arms.
△ Less
Submitted 29 June, 2020;
originally announced June 2020.
-
Off-policy Learning for Remote Electrical Tilt Optimization
Authors:
Filippo Vannella,
Jaeseong Jeong,
Alexandre Proutiere
Abstract:
We address the problem of Remote Electrical Tilt (RET) optimization using off-policy Contextual Multi-Armed-Bandit (CMAB) techniques. The goal in RET optimization is to control the orientation of the vertical tilt angle of the antenna to optimize Key Performance Indicators (KPIs) representing the Quality of Service (QoS) perceived by the users in cellular networks. Learning an improved tilt update…
▽ More
We address the problem of Remote Electrical Tilt (RET) optimization using off-policy Contextual Multi-Armed-Bandit (CMAB) techniques. The goal in RET optimization is to control the orientation of the vertical tilt angle of the antenna to optimize Key Performance Indicators (KPIs) representing the Quality of Service (QoS) perceived by the users in cellular networks. Learning an improved tilt update policy is hard. On the one hand, coming up with a new policy in an online manner in a real network requires exploring tilt updates that have never been used before, and is operationally too risky. On the other hand, devising this policy via simulations suffers from the simulation-to-reality gap. In this paper, we circumvent these issues by learning an improved policy in an offline manner using existing data collected on real networks. We formulate the problem of devising such a policy using the off-policy CMAB framework. We propose CMAB learning algorithms to extract optimal tilt update policies from the data. We train and evaluate these policies on real-world 4G Long Term Evolution (LTE) cellular network data. Our policies show consistent improvements over the rule-based logging policy used to collect the data.
△ Less
Submitted 21 May, 2020;
originally announced May 2020.
-
Predictive Bandits
Authors:
Simon Lindståhl,
Alexandre Proutiere,
Andreas Johnsson
Abstract:
We introduce and study a new class of stochastic bandit problems, referred to as predictive bandits. In each round, the decision maker first decides whether to gather information about the rewards of particular arms (so that their rewards in this round can be predicted). These measurements are costly, and may be corrupted by noise. The decision maker then selects an arm to be actually played in th…
▽ More
We introduce and study a new class of stochastic bandit problems, referred to as predictive bandits. In each round, the decision maker first decides whether to gather information about the rewards of particular arms (so that their rewards in this round can be predicted). These measurements are costly, and may be corrupted by noise. The decision maker then selects an arm to be actually played in the round. Predictive bandits find applications in many areas; e.g. they can be applied to channel selection problems in radio communication systems. In this paper, we provide the first theoretical results about predictive bandits, and focus on scenarios where the decision maker is allowed to measure at most one arm per round. We derive asymptotic instance-specific regret lower bounds for these problems, and develop algorithms whose regret match these fundamental limits. We illustrate the performance of our algorithms through numerical experiments. In particular, we highlight the gains that can be achieved by using reward predictions, and investigate the impact of the noise in the corresponding measurements.
△ Less
Submitted 2 April, 2020;
originally announced April 2020.
-
Finite-time Identification of Stable Linear Systems: Optimality of the Least-Squares Estimator
Authors:
Yassir Jedra,
Alexandre Proutiere
Abstract:
We present a new finite-time analysis of the estimation error of the Ordinary Least Squares (OLS) estimator for stable linear time-invariant systems. We characterize the number of observed samples (the length of the observed trajectory) sufficient for the OLS estimator to be $(\varepsilon,δ)$-PAC, i.e., to yield an estimation error less than $\varepsilon$ with probability at least $1-δ$. We show t…
▽ More
We present a new finite-time analysis of the estimation error of the Ordinary Least Squares (OLS) estimator for stable linear time-invariant systems. We characterize the number of observed samples (the length of the observed trajectory) sufficient for the OLS estimator to be $(\varepsilon,δ)$-PAC, i.e., to yield an estimation error less than $\varepsilon$ with probability at least $1-δ$. We show that this number matches existing sample complexity lower bounds [1,2] up to universal multiplicative factors (independent of ($\varepsilon,δ)$ and of the system). This paper hence establishes the optimality of the OLS estimator for stable systems, a result conjectured in [1]. Our analysis of the performance of the OLS estimator is simpler, sharper, and easier to interpret than existing analyses. It relies on new concentration results for the covariates matrix.
△ Less
Submitted 26 March, 2020; v1 submitted 17 March, 2020;
originally announced March 2020.
-
Distributed Online Optimization with Long-Term Constraints
Authors:
Deming Yuan,
Alexandre Proutiere,
Guodong Shi
Abstract:
We consider distributed online convex optimization problems, where the distributed system consists of various computing units connected through a time-varying communication graph. In each time step, each computing unit selects a constrained vector, experiences a loss equal to an arbitrary convex function evaluated at this vector, and may communicate to its neighbors in the graph. The objective is…
▽ More
We consider distributed online convex optimization problems, where the distributed system consists of various computing units connected through a time-varying communication graph. In each time step, each computing unit selects a constrained vector, experiences a loss equal to an arbitrary convex function evaluated at this vector, and may communicate to its neighbors in the graph. The objective is to minimize the system-wide loss accumulated over time. We propose a decentralized algorithm with regret and cumulative constraint violation in $\mathcal{O}(T^{\max\{c,1-c\} })$ and $\mathcal{O}(T^{1-c/2})$, respectively, for any $c\in (0,1)$, where $T$ is the time horizon. When the loss functions are strongly convex, we establish improved regret and constraint violation upper bounds in $\mathcal{O}(\log(T))$ and $\mathcal{O}(\sqrt{T\log(T)})$. These regret scalings match those obtained by state-of-the-art algorithms and fundamental limits in the corresponding centralized online optimization problem (for both convex and strongly convex loss functions). In the case of bandit feedback, the proposed algorithms achieve a regret and constraint violation in $\mathcal{O}(T^{\max\{c,1-c/3 \} })$ and $\mathcal{O}(T^{1-c/2})$ for any $c\in (0,1)$. We numerically illustrate the performance of our algorithms for the particular case of distributed online regularized linear regression problems.
△ Less
Submitted 20 December, 2019;
originally announced December 2019.
-
Optimal Clustering from Noisy Binary Feedback
Authors:
Kaito Ariu,
Jungseul Ok,
Alexandre Proutiere,
Se-Young Yun
Abstract:
We study the problem of clustering a set of items from binary user feedback. Such a problem arises in crowdsourcing platforms solving large-scale labeling tasks with minimal effort put on the users. For example, in some of the recent reCAPTCHA systems, users clicks (binary answers) can be used to efficiently label images. In our inference problem, items are grouped into initially unknown non-overl…
▽ More
We study the problem of clustering a set of items from binary user feedback. Such a problem arises in crowdsourcing platforms solving large-scale labeling tasks with minimal effort put on the users. For example, in some of the recent reCAPTCHA systems, users clicks (binary answers) can be used to efficiently label images. In our inference problem, items are grouped into initially unknown non-overlapping clusters. To recover these clusters, the learner sequentially presents to users a finite list of items together with a question with a binary answer selected from a fixed finite set. For each of these items, the user provides a noisy answer whose expectation is determined by the item cluster and the question and by an item-specific parameter characterizing the {\it hardness} of classifying the item. The objective is to devise an algorithm with a minimal cluster recovery error rate. We derive problem-specific information-theoretical lower bounds on the error rate satisfied by any algorithm, for both uniform and adaptive (list, question) selection strategies. For uniform selection, we present a simple algorithm built upon the K-means algorithm and whose performance almost matches the fundamental limits. For adaptive selection, we develop an adaptive algorithm that is inspired by the derivation of the information-theoretical error lower bounds, and in turn allocates the budget in an efficient way. The algorithm learns to select items hard to cluster and relevant questions more often. We compare the performance of our algorithms with or without the adaptive selection strategy numerically and illustrate the gain achieved by being adaptive.
△ Less
Submitted 5 February, 2024; v1 submitted 14 October, 2019;
originally announced October 2019.
-
An Optimal Algorithm for Multiplayer Multi-Armed Bandits
Authors:
Alexandre Proutiere,
Po-An Wang
Abstract:
The paper addresses the Multiplayer Multi-Armed Bandit (MMAB) problem, where $M$ decision makers or players collaborate to maximize their cumulative reward. When several players select the same arm, a collision occurs and no reward is collected on this arm. Players involved in a collision are informed about this collision. We present DPE (Decentralized Parsimonious Exploration), a decentralized al…
▽ More
The paper addresses the Multiplayer Multi-Armed Bandit (MMAB) problem, where $M$ decision makers or players collaborate to maximize their cumulative reward. When several players select the same arm, a collision occurs and no reward is collected on this arm. Players involved in a collision are informed about this collision. We present DPE (Decentralized Parsimonious Exploration), a decentralized algorithm that achieves the same regret as that obtained by an optimal centralized algorithm. Our algorithm has better regret guarantees than the state-of-the-art algorithm SIC-MMAB \cite{boursier2019}. As in SIC-MMAB, players communicate through collisions only. An additional important advantage of DPE is that it requires very little communication. Specifically, the expected number of rounds where players use collisions to communicate is finite.
△ Less
Submitted 26 October, 2019; v1 submitted 28 September, 2019;
originally announced September 2019.
-
Optimal Attacks on Reinforcement Learning Policies
Authors:
Alessio Russo,
Alexandre Proutiere
Abstract:
Control policies, trained using the Deep Reinforcement Learning, have been recently shown to be vulnerable to adversarial attacks introducing even very small perturbations to the policy input. The attacks proposed so far have been designed using heuristics, and build on existing adversarial example crafting techniques used to dupe classifiers in supervised learning. In contrast, this paper investi…
▽ More
Control policies, trained using the Deep Reinforcement Learning, have been recently shown to be vulnerable to adversarial attacks introducing even very small perturbations to the policy input. The attacks proposed so far have been designed using heuristics, and build on existing adversarial example crafting techniques used to dupe classifiers in supervised learning. In contrast, this paper investigates the problem of devising optimal attacks, depending on a well-defined attacker's objective, e.g., to minimize the main agent average reward. When the policy and the system dynamics, as well as rewards, are known to the attacker, a scenario referred to as a white-box attack, designing optimal attacks amounts to solving a Markov Decision Process. For what we call black-box attacks, where neither the policy nor the system is known, optimal attacks can be trained using Reinforcement Learning techniques. Through numerical experiments, we demonstrate the efficiency of our attacks compared to existing attacks (usually based on Gradient methods). We further quantify the potential impact of attacks and establish its connection to the smoothness of the policy under attack. Smooth policies are naturally less prone to attacks (this explains why Lipschitz policies, with respect to the state, are more resilient). Finally, we show that from the main agent perspective, the system uncertainties and the attacker can be modeled as a Partially Observable Markov Decision Process. We actually demonstrate that using Reinforcement Learning techniques tailored to POMDP (e.g. using Recurrent Neural Networks) leads to more resilient policies.
△ Less
Submitted 31 July, 2019;
originally announced July 2019.
-
From self-tuning regulators to reinforcement learning and back again
Authors:
Nikolai Matni,
Alexandre Proutiere,
Anders Rantzer,
Stephen Tu
Abstract:
Machine and reinforcement learning (RL) are increasingly being applied to plan and control the behavior of autonomous systems interacting with the physical world. Examples include self-driving vehicles, distributed sensor networks, and agile robots. However, when machine learning is to be applied in these new settings, the algorithms had better come with the same type of reliability, robustness, a…
▽ More
Machine and reinforcement learning (RL) are increasingly being applied to plan and control the behavior of autonomous systems interacting with the physical world. Examples include self-driving vehicles, distributed sensor networks, and agile robots. However, when machine learning is to be applied in these new settings, the algorithms had better come with the same type of reliability, robustness, and safety bounds that are hallmarks of control theory, or failures could be catastrophic. Thus, as learning algorithms are increasingly and more aggressively deployed in safety critical settings, it is imperative that control theorists join the conversation. The goal of this tutorial paper is to provide a starting point for control theorists wishing to work on learning related problems, by covering recent advances bridging learning and control theory, and by placing these results within an appropriate historical context of system identification and adaptive control.
△ Less
Submitted 22 September, 2019; v1 submitted 26 June, 2019;
originally announced June 2019.
-
Sample Complexity Lower Bounds for Linear System Identification
Authors:
Yassir Jedra,
Alexandre Proutiere
Abstract:
This paper establishes problem-specific sample complexity lower bounds for linear system identification problems. The sample complexity is defined in the PAC framework: it corresponds to the time it takes to identify the system parameters with prescribed accuracy and confidence levels. By problem-specific, we mean that the lower bound explicitly depends on the system to be identified (which contra…
▽ More
This paper establishes problem-specific sample complexity lower bounds for linear system identification problems. The sample complexity is defined in the PAC framework: it corresponds to the time it takes to identify the system parameters with prescribed accuracy and confidence levels. By problem-specific, we mean that the lower bound explicitly depends on the system to be identified (which contrasts with minimax lower bounds), and hence really captures the identification hardness specific to the system. We consider both uncontrolled and controlled systems. For uncontrolled systems, the lower bounds are valid for any linear system, stable or not, and only depend of the system finite-time controllability gramian. A simplified lower bound depending on the spectrum of the system only is also derived. In view of recent finitetime analysis of classical estimation methods (e.g. ordinary least squares), our sample complexity lower bounds are tight for many systems. For controlled systems, our lower bounds are not as explicit as in the case of uncontrolled systems, but could well provide interesting insights into the design of control policy with minimal sample complexity.
△ Less
Submitted 25 March, 2019;
originally announced March 2019.
-
Distributed Online Linear Regression
Authors:
Deming Yuan,
Alexandre Proutiere,
Guodong Shi
Abstract:
We study online linear regression problems in a distributed setting, where the data is spread over a network. In each round, each network node proposes a linear predictor, with the objective of fitting the \emph{network-wide} data. It then updates its predictor for the next round according to the received local feedback and information received from neighboring nodes. The predictions made at a giv…
▽ More
We study online linear regression problems in a distributed setting, where the data is spread over a network. In each round, each network node proposes a linear predictor, with the objective of fitting the \emph{network-wide} data. It then updates its predictor for the next round according to the received local feedback and information received from neighboring nodes. The predictions made at a given node are assessed through the notion of regret, defined as the difference between their cumulative network-wide square errors and those of the best off-line network-wide linear predictor. Various scenarios are investigated, depending on the nature of the local feedback (full information or bandit feedback), on the set of available predictors (the decision set), and the way data is generated (by an oblivious or adaptive adversary). We propose simple and natural distributed regression algorithms, involving, at each node and in each round, a local gradient descent step and a communication and averaging step where nodes aim at aligning their predictors to those of their neighbors. We establish regret upper bounds typically in ${\cal O}(T^{3/4})$ when the decision set is unbounded and in ${\cal O}(\sqrt{T})$ in case of bounded decision set.
△ Less
Submitted 13 February, 2019;
originally announced February 2019.
-
Learning to Personalize in Appearance-Based Gaze Tracking
Authors:
Erik Lindén,
Jonas Sjöstrand,
Alexandre Proutiere
Abstract:
Personal variations severely limit the performance of appearance-based gaze tracking. Adapting to these variations using standard neural network model adaptation methods is difficult. The problems range from overfitting, due to small amounts of training data, to underfitting, due to restrictive model architectures. We tackle these problems by introducing the SPatial Adaptive GaZe Estimator (SPAZE)…
▽ More
Personal variations severely limit the performance of appearance-based gaze tracking. Adapting to these variations using standard neural network model adaptation methods is difficult. The problems range from overfitting, due to small amounts of training data, to underfitting, due to restrictive model architectures. We tackle these problems by introducing the SPatial Adaptive GaZe Estimator (SPAZE). By modeling personal variations as a low-dimensional latent parameter space, SPAZE provides just enough adaptability to capture the range of personal variations without being prone to overfitting. Calibrating SPAZE for a new person reduces to solving a small optimization problem. SPAZE achieves an error of 2.70 degrees with 9 calibration samples on MPIIGaze, improving on the state-of-the-art by 14 %. We contribute to gaze tracking research by empirically showing that personal variations are well-modeled as a 3-dimensional latent parameter space for each eye. We show that this low-dimensionality is expected by examining model-based approaches to gaze tracking. We also show that accurate head pose-free gaze tracking is possible.
△ Less
Submitted 2 September, 2019; v1 submitted 2 July, 2018;
originally announced July 2018.
-
Exploration in Structured Reinforcement Learning
Authors:
Jungseul Ok,
Alexandre Proutiere,
Damianos Tranos
Abstract:
We address reinforcement learning problems with finite state and action spaces where the underlying MDP has some known structure that could be potentially exploited to minimize the exploration rates of suboptimal (state, action) pairs. For any arbitrary structure, we derive problem-specific regret lower bounds satisfied by any learning algorithm. These lower bounds are made explicit for unstructur…
▽ More
We address reinforcement learning problems with finite state and action spaces where the underlying MDP has some known structure that could be potentially exploited to minimize the exploration rates of suboptimal (state, action) pairs. For any arbitrary structure, we derive problem-specific regret lower bounds satisfied by any learning algorithm. These lower bounds are made explicit for unstructured MDPs and for those whose transition probabilities and average reward functions are Lipschitz continuous w.r.t. the state and action. For Lipschitz MDPs, the bounds are shown not to scale with the sizes $S$ and $A$ of the state and action spaces, i.e., they are smaller than $c\log T$ where $T$ is the time horizon and the constant $c$ only depends on the Lipschitz structure, the span of the bias function, and the minimal action sub-optimality gap. This contrasts with unstructured MDPs where the regret lower bound typically scales as $SA\log T$. We devise DEL (Directed Exploration Learning), an algorithm that matches our regret lower bounds. We further simplify the algorithm for Lipschitz MDPs, and show that the simplified version is still able to efficiently exploit the structure.
△ Less
Submitted 29 November, 2018; v1 submitted 3 June, 2018;
originally announced June 2018.
-
Clustering in Block Markov Chains
Authors:
Jaron Sanders,
Alexandre Proutière,
Se-Young Yun
Abstract:
This paper considers cluster detection in Block Markov Chains (BMCs). These Markov chains are characterized by a block structure in their transition matrix. More precisely, the $n$ possible states are divided into a finite number of $K$ groups or clusters, such that states in the same cluster exhibit the same transition rates to other states. One observes a trajectory of the Markov chain, and the…
▽ More
This paper considers cluster detection in Block Markov Chains (BMCs). These Markov chains are characterized by a block structure in their transition matrix. More precisely, the $n$ possible states are divided into a finite number of $K$ groups or clusters, such that states in the same cluster exhibit the same transition rates to other states. One observes a trajectory of the Markov chain, and the objective is to recover, from this observation only, the (initially unknown) clusters. In this paper we devise a clustering procedure that accurately, efficiently, and provably detects the clusters. We first derive a fundamental information-theoretical lower bound on the detection error rate satisfied under any clustering algorithm. This bound identifies the parameters of the BMC, and trajectory lengths, for which it is possible to accurately detect the clusters. We next develop two clustering algorithms that can together accurately recover the cluster structure from the shortest possible trajectories, whenever the parameters allow detection. These algorithms thus reach the fundamental detectability limit, and are optimal in that sense.
△ Less
Submitted 29 July, 2019; v1 submitted 26 December, 2017;
originally announced December 2017.
-
Minimal Exploration in Structured Stochastic Bandits
Authors:
Richard Combes,
Stefan Magureanu,
Alexandre Proutiere
Abstract:
This paper introduces and addresses a wide class of stochastic bandit problems where the function mapping the arm to the corresponding reward exhibits some known structural properties. Most existing structures (e.g. linear, Lipschitz, unimodal, combinatorial, dueling, ...) are covered by our framework. We derive an asymptotic instance-specific regret lower bound for these problems, and develop OSS…
▽ More
This paper introduces and addresses a wide class of stochastic bandit problems where the function mapping the arm to the corresponding reward exhibits some known structural properties. Most existing structures (e.g. linear, Lipschitz, unimodal, combinatorial, dueling, ...) are covered by our framework. We derive an asymptotic instance-specific regret lower bound for these problems, and develop OSSB, an algorithm whose regret matches this fundamental limit. OSSB is not based on the classical principle of "optimism in the face of uncertainty" or on Thompson sampling, and rather aims at matching the minimal exploration rates of sub-optimal arms as characterized in the derivation of the regret lower bound. We illustrate the efficiency of OSSB using numerical experiments in the case of the linear bandit problem and show that OSSB outperforms existing algorithms, including Thompson sampling.
△ Less
Submitted 1 November, 2017;
originally announced November 2017.
-
Strategic Arrivals to Queues Offering Priority Service
Authors:
Rajat Talak,
D. Manjunath,
Alexandre Proutiere
Abstract:
We consider strategic arrivals to a FCFS service system that starts service at a fixed time and has to serve a fixed number of customers, e.g., an airplane boarding system. Arriving early induces a higher waiting cost (waiting before service begins) while arriving late induces a cost because earlier arrivals take the better seats. We first consider arrivals of heterogeneous customers that choose a…
▽ More
We consider strategic arrivals to a FCFS service system that starts service at a fixed time and has to serve a fixed number of customers, e.g., an airplane boarding system. Arriving early induces a higher waiting cost (waiting before service begins) while arriving late induces a cost because earlier arrivals take the better seats. We first consider arrivals of heterogeneous customers that choose arrival times to minimize the weighted sum of waiting cost and and cost due to expected number of predecessors. We characterize the unique Nash equilibria for this system.
Next, we consider a system offering L levels of priority service with a FCFS queue for each priority level. Higher priorities are charged higher admission prices. Customers make two choices - time of arrival and priority of service. We show that the Nash equilibrium corresponds to the customer types being divided into L intervals and customers belonging to each interval choosing the same priority level. We further analyze the net revenue to the server and consider revenue maximizing strategies - number of priority levels and pricing. Numerical results show that with only three queues the server can attain near maximum revenue.
△ Less
Submitted 11 August, 2018; v1 submitted 19 April, 2017;
originally announced April 2017.
-
Optimal Cluster Recovery in the Labeled Stochastic Block Model
Authors:
Se-Young Yun,
Alexandre Proutiere
Abstract:
We consider the problem of community detection or clustering in the labeled Stochastic Block Model (LSBM) with a finite number $K$ of clusters of sizes linearly growing with the global population of items $n$. Every pair of items is labeled independently at random, and label $\ell$ appears with probability $p(i,j,\ell)$ between two items in clusters indexed by $i$ and $j$, respectively. The object…
▽ More
We consider the problem of community detection or clustering in the labeled Stochastic Block Model (LSBM) with a finite number $K$ of clusters of sizes linearly growing with the global population of items $n$. Every pair of items is labeled independently at random, and label $\ell$ appears with probability $p(i,j,\ell)$ between two items in clusters indexed by $i$ and $j$, respectively. The objective is to reconstruct the clusters from the observation of these random labels.
Clustering under the SBM and their extensions has attracted much attention recently. Most existing work aimed at characterizing the set of parameters such that it is possible to infer clusters either positively correlated with the true clusters, or with a vanishing proportion of misclassified items, or exactly matching the true clusters. We find the set of parameters such that there exists a clustering algorithm with at most $s$ misclassified items in average under the general LSBM and for any $s=o(n)$, which solves one open problem raised in \cite{abbe2015community}. We further develop an algorithm, based on simple spectral methods, that achieves this fundamental performance limit within $O(n \mbox{polylog}(n))$ computations and without the a-priori knowledge of the model parameters.
△ Less
Submitted 21 May, 2016; v1 submitted 20 October, 2015;
originally announced October 2015.
-
Boolean Gossip Networks
Authors:
Bo Li,
Junfeng Wu,
Hongsheng Qi,
Alexandre Proutiere,
Guodong Shi
Abstract:
This paper proposes and investigates a Boolean gossip model as a simplified but non-trivial probabilistic Boolean network. With positive node interactions, in view of standard theories from Markov chains, we prove that the node states asymptotically converge to an agreement at a binary random variable, whose distribution is characterized for large-scale networks by mean-field approximation. Using…
▽ More
This paper proposes and investigates a Boolean gossip model as a simplified but non-trivial probabilistic Boolean network. With positive node interactions, in view of standard theories from Markov chains, we prove that the node states asymptotically converge to an agreement at a binary random variable, whose distribution is characterized for large-scale networks by mean-field approximation. Using combinatorial analysis, we also successfully count the number of communication classes of the positive Boolean network explicitly in terms of the topology of the underlying interaction graph, where remarkably minor variation in local structures can drastically change the number of network communication classes. With general Boolean interaction rules, emergence of absorbing network Boolean dynamics is shown to be determined by the network structure with necessary and sufficient conditions established regarding when the Boolean gossip process defines absorbing Markov chains. Particularly, it is shown that for the majority of the Boolean interaction rules, except for nine out of the total $2^{16}-1$ possible nonempty sets of binary Boolean functions, whether the induced chain is absorbing has nothing to do with the topology of the underlying interaction graph, as long as connectivity is assumed. These results illustrate possibilities of {relating dynamical} properties of Boolean networks to graphical properties of the underlying interactions.
△ Less
Submitted 21 May, 2017; v1 submitted 13 July, 2015;
originally announced July 2015.
-
Cluster-Aided Mobility Predictions
Authors:
Jaeseong Jeong,
Mathieu Leconte,
Alexandre Proutiere
Abstract:
Predicting the future location of users in wireless net- works has numerous applications, and can help service providers to improve the quality of service perceived by their clients. The location predictors proposed so far estimate the next location of a specific user by inspecting the past individual trajectories of this user. As a consequence, when the training data collected for a given user is…
▽ More
Predicting the future location of users in wireless net- works has numerous applications, and can help service providers to improve the quality of service perceived by their clients. The location predictors proposed so far estimate the next location of a specific user by inspecting the past individual trajectories of this user. As a consequence, when the training data collected for a given user is limited, the resulting prediction is inaccurate. In this paper, we develop cluster-aided predictors that exploit past trajectories collected from all users to predict the next location of a given user. These predictors rely on clustering techniques and extract from the training data similarities among the mobility patterns of the various users to improve the prediction accuracy. Specifically, we present CAMP (Cluster-Aided Mobility Predictor), a cluster-aided predictor whose design is based on recent non-parametric bayesian statistical tools. CAMP is robust and adaptive in the sense that it exploits similarities in users' mobility only if such similarities are really present in the training data. We analytically prove the consistency of the predictions provided by CAMP, and investigate its performance using two large-scale datasets. CAMP significantly outperforms existing predictors, and in particular those that only exploit individual past trajectories.
△ Less
Submitted 21 January, 2016; v1 submitted 12 July, 2015;
originally announced July 2015.
-
Combinatorial Bandits Revisited
Authors:
Richard Combes,
M. Sadegh Talebi,
Alexandre Proutiere,
Marc Lelarge
Abstract:
This paper investigates stochastic and adversarial combinatorial multi-armed bandit problems. In the stochastic setting under semi-bandit feedback, we derive a problem-specific regret lower bound, and discuss its scaling with the dimension of the decision space. We propose ESCB, an algorithm that efficiently exploits the structure of the problem and provide a finite-time analysis of its regret. ES…
▽ More
This paper investigates stochastic and adversarial combinatorial multi-armed bandit problems. In the stochastic setting under semi-bandit feedback, we derive a problem-specific regret lower bound, and discuss its scaling with the dimension of the decision space. We propose ESCB, an algorithm that efficiently exploits the structure of the problem and provide a finite-time analysis of its regret. ESCB has better performance guarantees than existing algorithms, and significantly outperforms these algorithms in practice. In the adversarial setting under bandit feedback, we propose \textsc{CombEXP}, an algorithm with the same regret scaling as state-of-the-art algorithms, but with lower computational complexity for some combinatorial problems.
△ Less
Submitted 5 November, 2015; v1 submitted 11 February, 2015;
originally announced February 2015.
-
Accurate Community Detection in the Stochastic Block Model via Spectral Algorithms
Authors:
Se-Young Yun,
Alexandre Proutiere
Abstract:
We consider the problem of community detection in the Stochastic Block Model with a finite number $K$ of communities of sizes linearly growing with the network size $n$. This model consists in a random graph such that each pair of vertices is connected independently with probability $p$ within communities and $q$ across communities. One observes a realization of this random graph, and the objectiv…
▽ More
We consider the problem of community detection in the Stochastic Block Model with a finite number $K$ of communities of sizes linearly growing with the network size $n$. This model consists in a random graph such that each pair of vertices is connected independently with probability $p$ within communities and $q$ across communities. One observes a realization of this random graph, and the objective is to reconstruct the communities from this observation. We show that under spectral algorithms, the number of misclassified vertices does not exceed $s$ with high probability as $n$ grows large, whenever $pn=ω(1)$, $s=o(n)$ and \begin{equation*} \lim\inf_{n\to\infty} {n(α_1 p+α_2 q-(α_1 + α_2)p^{\frac{α_1}{α_1 + α_2}}q^{\frac{α_2}{α_1 + α_2}})\over \log (\frac{n}{s})} >1,\quad\quad(1) \end{equation*} where $α_1$ and $α_2$ denote the (fixed) proportions of vertices in the two smallest communities. In view of recent work by Abbe et al. and Mossel et al., this establishes that the proposed spectral algorithms are able to exactly recover communities whenever this is at all possible in the case of networks with two communities with equal sizes. We conjecture that condition (1) is actually necessary to obtain less than $s$ misclassified vertices asymptotically, which would establish the optimality of spectral method in more general scenarios.
△ Less
Submitted 23 December, 2014;
originally announced December 2014.
-
Emergent Behaviors over Signed Random Dynamical Networks: Relative-State-Flipping Model
Authors:
Guodong Shi,
Alexandre Proutiere,
Mikael Johansson,
John. S. Baras,
Karl H. Johansson
Abstract:
We study asymptotic dynamical patterns that emerge among a set of nodes interacting in a dynamically evolving signed random network, where positive links carry out standard consensus and negative links induce relative-state flipping. A sequence of deterministic signed graphs define potential node interactions that take place independently. Each node receives a positive recommendation consistent wi…
▽ More
We study asymptotic dynamical patterns that emerge among a set of nodes interacting in a dynamically evolving signed random network, where positive links carry out standard consensus and negative links induce relative-state flipping. A sequence of deterministic signed graphs define potential node interactions that take place independently. Each node receives a positive recommendation consistent with the standard consensus algorithm from its positive neighbors, and a negative recommendation defined by relative-state flipping from its negative neighbors. After receiving these recommendations, each node puts a deterministic weight to each recommendation, and then encodes these weighted recommendations in its state update through stochastic attentions defined by two Bernoulli random variables. We establish a number of conditions regarding almost sure convergence and divergence of the node states. We also propose a condition for almost sure state clustering for essentially weakly balanced graphs, with the help of several martingale convergence lemmas. Some fundamental differences on the impact of the deterministic weights and stochastic attentions to the node state evolution are highlighted between the current relative-state-flipping model and the state-flipping model considered in Altafini 2013 and Shi et al. 2014.
△ Less
Submitted 5 December, 2014;
originally announced December 2014.
-
Streaming, Memory Limited Algorithms for Community Detection
Authors:
Se-Young Yun,
Marc Lelarge,
Alexandre Proutiere
Abstract:
In this paper, we consider sparse networks consisting of a finite number of non-overlapping communities, i.e. disjoint clusters, so that there is higher density within clusters than across clusters. Both the intra- and inter-cluster edge densities vanish when the size of the graph grows large, making the cluster reconstruction problem nosier and hence difficult to solve. We are interested in scena…
▽ More
In this paper, we consider sparse networks consisting of a finite number of non-overlapping communities, i.e. disjoint clusters, so that there is higher density within clusters than across clusters. Both the intra- and inter-cluster edge densities vanish when the size of the graph grows large, making the cluster reconstruction problem nosier and hence difficult to solve. We are interested in scenarios where the network size is very large, so that the adjacency matrix of the graph is hard to manipulate and store. The data stream model in which columns of the adjacency matrix are revealed sequentially constitutes a natural framework in this setting. For this model, we develop two novel clustering algorithms that extract the clusters asymptotically accurately. The first algorithm is {\it offline}, as it needs to store and keep the assignments of nodes to clusters, and requires a memory that scales linearly with the network size. The second algorithm is {\it online}, as it may classify a node when the corresponding column is revealed and then discard this information. This algorithm requires a memory growing sub-linearly with the network size. To construct these efficient streaming memory-limited clustering algorithms, we first address the problem of clustering with partial information, where only a small proportion of the columns of the adjacency matrix is observed and develop, for this setting, a new spectral algorithm which is of independent interest.
△ Less
Submitted 3 November, 2014;
originally announced November 2014.
-
Emergent Behaviors over Signed Random Dynamical Networks: State-Flipping Model
Authors:
Guodong Shi,
Alexandre Proutiere,
Mikael Johansson,
John S. Baras,
Karl H. Johansson
Abstract:
Recent studies from social, biological, and engineering network systems have drawn attention to the dynamics over signed networks, where each link is associated with a positive/negative sign indicating trustful/mistrustful, activator/inhibitor, or secure/malicious interactions. We study asymptotic dynamical patterns that emerge among a set of nodes that interact in a dynamically evolving signed ra…
▽ More
Recent studies from social, biological, and engineering network systems have drawn attention to the dynamics over signed networks, where each link is associated with a positive/negative sign indicating trustful/mistrustful, activator/inhibitor, or secure/malicious interactions. We study asymptotic dynamical patterns that emerge among a set of nodes that interact in a dynamically evolving signed random network. Node interactions take place at random on a sequence of deterministic signed graphs. Each node receives positive or negative recommendations from its neighbors depending on the sign of the interaction arcs, and updates its state accordingly. Recommendations along a positive arc follow the standard consensus update. As in the work by Altafini, negative recommendations use an update where the sign of the neighbor state is flipped. Nodes may weight positive and negative recommendations differently, and random processes are introduced to model the time-varying attention that nodes pay to these recommendations. Conditions for almost sure convergence and divergence of the node states are established. We show that under this so-called state-flipping model, all links contribute to a consensus of the absolute values of the nodes, even under switching sign patterns and dynamically changing environment. A no-survivor property is established, indicating that every node state diverges almost surely if the maximum network state diverges.
△ Less
Submitted 1 November, 2014;
originally announced November 2014.
-
Unimodal Bandits without Smoothness
Authors:
Richard Combes,
Alexandre Proutiere
Abstract:
We consider stochastic bandit problems with a continuous set of arms and where the expected reward is a continuous and unimodal function of the arm. No further assumption is made regarding the smoothness and the structure of the expected reward function. For these problems, we propose the Stochastic Pentachotomy (SP) algorithm, and derive finite-time upper bounds on its regret and optimization err…
▽ More
We consider stochastic bandit problems with a continuous set of arms and where the expected reward is a continuous and unimodal function of the arm. No further assumption is made regarding the smoothness and the structure of the expected reward function. For these problems, we propose the Stochastic Pentachotomy (SP) algorithm, and derive finite-time upper bounds on its regret and optimization error. In particular, we show that, for any expected reward function $μ$ that behaves as $μ(x)=μ(x^\star)-C|x-x^\star|^ξ$ locally around its maximizer $x^\star$ for some $ξ, C>0$, the SP algorithm is order-optimal. Namely its regret and optimization error scale as $O(\sqrt{T\log(T)})$ and $O(\sqrt{\log(T)/T})$, respectively, when the time horizon $T$ grows large. These scalings are achieved without the knowledge of $ξ$ and $C$. Our algorithm is based on asymptotically optimal sequential statistical tests used to successively trim an interval that contains the best arm with high probability. To our knowledge, the SP algorithm constitutes the first sequential arm selection rule that achieves a regret and optimization error scaling as $O(\sqrt{T})$ and $O(1/\sqrt{T})$, respectively, up to a logarithmic factor for non-smooth expected reward functions, as well as for smooth functions with unknown smoothness.
△ Less
Submitted 6 March, 2015; v1 submitted 28 June, 2014;
originally announced June 2014.