-
Improved Regret Bounds for Linear Bandits with Heavy-Tailed Rewards
Authors:
Artin Tajdini,
Jonathan Scarlett,
Kevin Jamieson
Abstract:
We study stochastic linear bandits with heavy-tailed rewards, where the rewards have a finite $(1+ε)$-absolute central moment bounded by $\upsilon$ for some $ε\in (0,1]$. We improve both upper and lower bounds on the minimax regret compared to prior work. When $\upsilon = \mathcal{O}(1)$, the best prior known regret upper bound is $\tilde{\mathcal{O}}(d T^{\frac{1}{1+ε}})$. While a lower with the…
▽ More
We study stochastic linear bandits with heavy-tailed rewards, where the rewards have a finite $(1+ε)$-absolute central moment bounded by $\upsilon$ for some $ε\in (0,1]$. We improve both upper and lower bounds on the minimax regret compared to prior work. When $\upsilon = \mathcal{O}(1)$, the best prior known regret upper bound is $\tilde{\mathcal{O}}(d T^{\frac{1}{1+ε}})$. While a lower with the same scaling has been given, it relies on a construction using $\upsilon = \mathcal{O}(d)$, and adapting the construction to the bounded-moment regime with $\upsilon = \mathcal{O}(1)$ yields only a $Ω(d^{\fracε{1+ε}} T^{\frac{1}{1+ε}})$ lower bound. This matches the known rate for multi-armed bandits and is generally loose for linear bandits, in particular being $\sqrt{d}$ below the optimal rate in the finite-variance case ($ε= 1$). We propose a new elimination-based algorithm guided by experimental design, which achieves regret $\tilde{\mathcal{O}}(d^{\frac{1+3ε}{2(1+ε)}} T^{\frac{1}{1+ε}})$, thus improving the dependence on $d$ for all $ε\in (0,1)$ and recovering a known optimal result for $ε= 1$. We also establish a lower bound of $Ω(d^{\frac{2ε}{1+ε}} T^{\frac{1}{1+ε}})$, which strictly improves upon the multi-armed bandit rate and highlights the hardness of heavy-tailed linear bandit problems. For finite action sets, we derive similarly improved upper and lower bounds for regret. Finally, we provide action set dependent regret upper bounds showing that for some geometries, such as $l_p$-norm balls for $p \le 1 + ε$, we can further reduce the dependence on $d$, and we can handle infinite-dimensional settings via the kernel trick, in particular establishing new regret bounds for the Matérn kernel that are the first to be sublinear for all $ε\in (0, 1]$.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
Active learning of neural population dynamics using two-photon holographic optogenetics
Authors:
Andrew Wagenmaker,
Lu Mi,
Marton Rozsa,
Matthew S. Bull,
Karel Svoboda,
Kayvon Daie,
Matthew D. Golub,
Kevin Jamieson
Abstract:
Recent advances in techniques for monitoring and perturbing neural populations have greatly enhanced our ability to study circuits in the brain. In particular, two-photon holographic optogenetics now enables precise photostimulation of experimenter-specified groups of individual neurons, while simultaneous two-photon calcium imaging enables the measurement of ongoing and induced activity across th…
▽ More
Recent advances in techniques for monitoring and perturbing neural populations have greatly enhanced our ability to study circuits in the brain. In particular, two-photon holographic optogenetics now enables precise photostimulation of experimenter-specified groups of individual neurons, while simultaneous two-photon calcium imaging enables the measurement of ongoing and induced activity across the neural population. Despite the enormous space of potential photostimulation patterns and the time-consuming nature of photostimulation experiments, very little algorithmic work has been done to determine the most effective photostimulation patterns for identifying the neural population dynamics. Here, we develop methods to efficiently select which neurons to stimulate such that the resulting neural responses will best inform a dynamical model of the neural population activity. Using neural population responses to photostimulation in mouse motor cortex, we demonstrate the efficacy of a low-rank linear dynamical systems model, and develop an active learning procedure which takes advantage of low-rank structure to determine informative photostimulation patterns. We demonstrate our approach on both real and synthetic data, obtaining in some cases as much as a two-fold reduction in the amount of data required to reach a given predictive power. Our active stimulation design method is based on a novel active learning procedure for low-rank regression, which may be of independent interest.
△ Less
Submitted 8 May, 2025; v1 submitted 3 December, 2024;
originally announced December 2024.
-
Overcoming the Sim-to-Real Gap: Leveraging Simulation to Learn to Explore for Real-World RL
Authors:
Andrew Wagenmaker,
Kevin Huang,
Liyiming Ke,
Byron Boots,
Kevin Jamieson,
Abhishek Gupta
Abstract:
In order to mitigate the sample complexity of real-world reinforcement learning, common practice is to first train a policy in a simulator where samples are cheap, and then deploy this policy in the real world, with the hope that it generalizes effectively. Such \emph{direct sim2real} transfer is not guaranteed to succeed, however, and in cases where it fails, it is unclear how to best utilize the…
▽ More
In order to mitigate the sample complexity of real-world reinforcement learning, common practice is to first train a policy in a simulator where samples are cheap, and then deploy this policy in the real world, with the hope that it generalizes effectively. Such \emph{direct sim2real} transfer is not guaranteed to succeed, however, and in cases where it fails, it is unclear how to best utilize the simulator. In this work, we show that in many regimes, while direct sim2real transfer may fail, we can utilize the simulator to learn a set of \emph{exploratory} policies which enable efficient exploration in the real world. In particular, in the setting of low-rank MDPs, we show that coupling these exploratory policies with simple, practical approaches -- least-squares regression oracles and naive randomized exploration -- yields a polynomial sample complexity in the real world, an exponential improvement over direct sim2real transfer, or learning without access to a simulator. To the best of our knowledge, this is the first evidence that simulation transfer yields a provable gain in reinforcement learning in settings where direct sim2real transfer fails. We validate our theoretical results on several realistic robotic simulators and a real-world robotic sim2real task, demonstrating that transferring exploratory policies can yield substantial gains in practice as well.
△ Less
Submitted 26 October, 2024;
originally announced October 2024.
-
Fair Active Learning in Low-Data Regimes
Authors:
Romain Camilleri,
Andrew Wagenmaker,
Jamie Morgenstern,
Lalit Jain,
Kevin Jamieson
Abstract:
In critical machine learning applications, ensuring fairness is essential to avoid perpetuating social inequities. In this work, we address the challenges of reducing bias and improving accuracy in data-scarce environments, where the cost of collecting labeled data prohibits the use of large, labeled datasets. In such settings, active learning promises to maximize marginal accuracy gains of small…
▽ More
In critical machine learning applications, ensuring fairness is essential to avoid perpetuating social inequities. In this work, we address the challenges of reducing bias and improving accuracy in data-scarce environments, where the cost of collecting labeled data prohibits the use of large, labeled datasets. In such settings, active learning promises to maximize marginal accuracy gains of small amounts of labeled data. However, existing applications of active learning for fairness fail to deliver on this, typically requiring large labeled datasets, or failing to ensure the desired fairness tolerance is met on the population distribution.
To address such limitations, we introduce an innovative active learning framework that combines an exploration procedure inspired by posterior sampling with a fair classification subroutine. We demonstrate that this framework performs effectively in very data-scarce regimes, maximizing accuracy while satisfying fairness constraints with high probability. We evaluate our proposed approach using well-established real-world benchmark datasets and compare it against state-of-the-art methods, demonstrating its effectiveness in producing fair models, and improvement over existing methods.
△ Less
Submitted 13 December, 2023;
originally announced December 2023.
-
Nearly Minimax Optimal Submodular Maximization with Bandit Feedback
Authors:
Artin Tajdini,
Lalit Jain,
Kevin Jamieson
Abstract:
We consider maximizing an unknown monotonic, submodular set function $f: 2^{[n]} \rightarrow [0,1]$ with cardinality constraint under stochastic bandit feedback. At each time $t=1,\dots,T$ the learner chooses a set $S_t \subset [n]$ with $|S_t| \leq k$ and receives reward $f(S_t) + η_t$ where $η_t$ is mean-zero sub-Gaussian noise. The objective is to minimize the learner's regret with respect to a…
▽ More
We consider maximizing an unknown monotonic, submodular set function $f: 2^{[n]} \rightarrow [0,1]$ with cardinality constraint under stochastic bandit feedback. At each time $t=1,\dots,T$ the learner chooses a set $S_t \subset [n]$ with $|S_t| \leq k$ and receives reward $f(S_t) + η_t$ where $η_t$ is mean-zero sub-Gaussian noise. The objective is to minimize the learner's regret with respect to an approximation of the maximum $f(S_*)$ with $|S_*| = k$, obtained through robust greedy maximization of $f$. To date, the best regret bound in the literature scales as $k n^{1/3} T^{2/3}$. And by trivially treating every set as a unique arm one deduces that $\sqrt{ {n \choose k} T }$ is also achievable using standard multi-armed bandit algorithms. In this work, we establish the first minimax lower bound for this setting that scales like $\tildeΩ(\min_{L \le k}(L^{1/3}n^{1/3}T^{2/3} + \sqrt{{n \choose k - L}T}))$. For a slightly restricted algorithm class, we prove a stronger regret lower bound of $\tildeΩ(\min_{L \le k}(Ln^{1/3}T^{2/3} + \sqrt{{n \choose k - L}T}))$. Moreover, we propose an algorithm Sub-UCB that achieves regret $\tilde{\mathcal{O}}(\min_{L \le k}(Ln^{1/3}T^{2/3} + \sqrt{{n \choose k - L}T}))$ capable of matching the lower bound on regret for the restricted class up to logarithmic factors.
△ Less
Submitted 12 December, 2024; v1 submitted 27 October, 2023;
originally announced October 2023.
-
Optimal Exploration is no harder than Thompson Sampling
Authors:
Zhaoqi Li,
Kevin Jamieson,
Lalit Jain
Abstract:
Given a set of arms $\mathcal{Z}\subset \mathbb{R}^d$ and an unknown parameter vector $θ_\ast\in\mathbb{R}^d$, the pure exploration linear bandit problem aims to return $\arg\max_{z\in \mathcal{Z}} z^{\top}θ_{\ast}$, with high probability through noisy measurements of $x^{\top}θ_{\ast}$ with $x\in \mathcal{X}\subset \mathbb{R}^d$. Existing (asymptotically) optimal methods require either a) potenti…
▽ More
Given a set of arms $\mathcal{Z}\subset \mathbb{R}^d$ and an unknown parameter vector $θ_\ast\in\mathbb{R}^d$, the pure exploration linear bandit problem aims to return $\arg\max_{z\in \mathcal{Z}} z^{\top}θ_{\ast}$, with high probability through noisy measurements of $x^{\top}θ_{\ast}$ with $x\in \mathcal{X}\subset \mathbb{R}^d$. Existing (asymptotically) optimal methods require either a) potentially costly projections for each arm $z\in \mathcal{Z}$ or b) explicitly maintaining a subset of $\mathcal{Z}$ under consideration at each time. This complexity is at odds with the popular and simple Thompson Sampling algorithm for regret minimization, which just requires access to a posterior sampling and argmax oracle, and does not need to enumerate $\mathcal{Z}$ at any point. Unfortunately, Thompson sampling is known to be sub-optimal for pure exploration. In this work, we pose a natural question: is there an algorithm that can explore optimally and only needs the same computational primitives as Thompson Sampling? We answer the question in the affirmative. We provide an algorithm that leverages only sampling and argmax oracles and achieves an exponential convergence rate, with the exponent being the optimal among all possible allocations asymptotically. In addition, we show that our algorithm can be easily implemented and performs as well empirically as existing asymptotically optimal methods.
△ Less
Submitted 24 October, 2023; v1 submitted 9 October, 2023;
originally announced October 2023.
-
A/B Testing and Best-arm Identification for Linear Bandits with Robustness to Non-stationarity
Authors:
Zhihan Xiong,
Romain Camilleri,
Maryam Fazel,
Lalit Jain,
Kevin Jamieson
Abstract:
We investigate the fixed-budget best-arm identification (BAI) problem for linear bandits in a potentially non-stationary environment. Given a finite arm set $\mathcal{X}\subset\mathbb{R}^d$, a fixed budget $T$, and an unpredictable sequence of parameters $\left\lbraceθ_t\right\rbrace_{t=1}^{T}$, an algorithm will aim to correctly identify the best arm…
▽ More
We investigate the fixed-budget best-arm identification (BAI) problem for linear bandits in a potentially non-stationary environment. Given a finite arm set $\mathcal{X}\subset\mathbb{R}^d$, a fixed budget $T$, and an unpredictable sequence of parameters $\left\lbraceθ_t\right\rbrace_{t=1}^{T}$, an algorithm will aim to correctly identify the best arm $x^* := \arg\max_{x\in\mathcal{X}}x^\top\sum_{t=1}^{T}θ_t$ with probability as high as possible. Prior work has addressed the stationary setting where $θ_t = θ_1$ for all $t$ and demonstrated that the error probability decreases as $\exp(-T /ρ^*)$ for a problem-dependent constant $ρ^*$. But in many real-world $A/B/n$ multivariate testing scenarios that motivate our work, the environment is non-stationary and an algorithm expecting a stationary setting can easily fail. For robust identification, it is well-known that if arms are chosen randomly and non-adaptively from a G-optimal design over $\mathcal{X}$ at each time then the error probability decreases as $\exp(-TΔ^2_{(1)}/d)$, where $Δ_{(1)} = \min_{x \neq x^*} (x^* - x)^\top \frac{1}{T}\sum_{t=1}^T θ_t$. As there exist environments where $Δ_{(1)}^2/ d \ll 1/ ρ^*$, we are motivated to propose a novel algorithm $\mathsf{P1}$-$\mathsf{RAGE}$ that aims to obtain the best of both worlds: robustness to non-stationarity and fast rates of identification in benign settings. We characterize the error probability of $\mathsf{P1}$-$\mathsf{RAGE}$ and demonstrate empirically that the algorithm indeed never performs worse than G-optimal design but compares favorably to the best algorithms in the stationary setting.
△ Less
Submitted 15 February, 2024; v1 submitted 27 July, 2023;
originally announced July 2023.
-
Optimal Exploration for Model-Based RL in Nonlinear Systems
Authors:
Andrew Wagenmaker,
Guanya Shi,
Kevin Jamieson
Abstract:
Learning to control unknown nonlinear dynamical systems is a fundamental problem in reinforcement learning and control theory. A commonly applied approach is to first explore the environment (exploration), learn an accurate model of it (system identification), and then compute an optimal controller with the minimum cost on this estimated system (policy optimization). While existing work has shown…
▽ More
Learning to control unknown nonlinear dynamical systems is a fundamental problem in reinforcement learning and control theory. A commonly applied approach is to first explore the environment (exploration), learn an accurate model of it (system identification), and then compute an optimal controller with the minimum cost on this estimated system (policy optimization). While existing work has shown that it is possible to learn a uniformly good model of the system~\citep{mania2020active}, in practice, if we aim to learn a good controller with a low cost on the actual system, certain system parameters may be significantly more critical than others, and we therefore ought to focus our exploration on learning such parameters.
In this work, we consider the setting of nonlinear dynamical systems and seek to formally quantify, in such settings, (a) which parameters are most relevant to learning a good controller, and (b) how we can best explore so as to minimize uncertainty in such parameters. Inspired by recent work in linear systems~\citep{wagenmaker2021task}, we show that minimizing the controller loss in nonlinear systems translates to estimating the system parameters in a particular, task-dependent metric. Motivated by this, we develop an algorithm able to efficiently explore the system to reduce uncertainty in this metric, and prove a lower bound showing that our approach learns a controller at a near-instance-optimal rate. Our algorithm relies on a general reduction from policy optimization to optimal experiment design in arbitrary systems, and may be of independent interest. We conclude with experiments demonstrating the effectiveness of our method in realistic nonlinear robotic systems.
△ Less
Submitted 15 June, 2023;
originally announced June 2023.
-
Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via Online Experiment Design
Authors:
Andrew Wagenmaker,
Kevin Jamieson
Abstract:
While much progress has been made in understanding the minimax sample complexity of reinforcement learning (RL) -- the complexity of learning on the "worst-case" instance -- such measures of complexity often do not capture the true difficulty of learning. In practice, on an "easy" instance, we might hope to achieve a complexity far better than that achievable on the worst-case instance. In this wo…
▽ More
While much progress has been made in understanding the minimax sample complexity of reinforcement learning (RL) -- the complexity of learning on the "worst-case" instance -- such measures of complexity often do not capture the true difficulty of learning. In practice, on an "easy" instance, we might hope to achieve a complexity far better than that achievable on the worst-case instance. In this work we seek to understand the "instance-dependent" complexity of learning near-optimal policies (PAC RL) in the setting of RL with linear function approximation. We propose an algorithm, \textsc{Pedel}, which achieves a fine-grained instance-dependent measure of complexity, the first of its kind in the RL with function approximation setting, thereby capturing the difficulty of learning on each particular problem instance. Through an explicit example, we show that \textsc{Pedel} yields provable gains over low-regret, minimax-optimal algorithms and that such algorithms are unable to hit the instance-optimal rate. Our approach relies on a novel online experiment design-based procedure which focuses the exploration budget on the "directions" most relevant to learning a near-optimal policy, and may be of independent interest.
△ Less
Submitted 20 July, 2023; v1 submitted 6 July, 2022;
originally announced July 2022.
-
Instance-optimal PAC Algorithms for Contextual Bandits
Authors:
Zhaoqi Li,
Lillian Ratliff,
Houssam Nassif,
Kevin Jamieson,
Lalit Jain
Abstract:
In the stochastic contextual bandit setting, regret-minimizing algorithms have been extensively researched, but their instance-minimizing best-arm identification counterparts remain seldom studied. In this work, we focus on the stochastic bandit problem in the $(ε,δ)$-$\textit{PAC}$ setting: given a policy class $Π$ the goal of the learner is to return a policy $π\in Π$ whose expected reward is wi…
▽ More
In the stochastic contextual bandit setting, regret-minimizing algorithms have been extensively researched, but their instance-minimizing best-arm identification counterparts remain seldom studied. In this work, we focus on the stochastic bandit problem in the $(ε,δ)$-$\textit{PAC}$ setting: given a policy class $Π$ the goal of the learner is to return a policy $π\in Π$ whose expected reward is within $ε$ of the optimal policy with probability greater than $1-δ$. We characterize the first $\textit{instance-dependent}$ PAC sample complexity of contextual bandits through a quantity $ρ_Π$, and provide matching upper and lower bounds in terms of $ρ_Π$ for the agnostic and linear contextual best-arm identification settings. We show that no algorithm can be simultaneously minimax-optimal for regret minimization and instance-dependent PAC for best-arm identification. Our main result is a new instance-optimal and computationally efficient algorithm that relies on a polynomial number of calls to an argmax oracle.
△ Less
Submitted 5 July, 2022;
originally announced July 2022.
-
Active Learning with Safety Constraints
Authors:
Romain Camilleri,
Andrew Wagenmaker,
Jamie Morgenstern,
Lalit Jain,
Kevin Jamieson
Abstract:
Active learning methods have shown great promise in reducing the number of samples necessary for learning. As automated learning systems are adopted into real-time, real-world decision-making pipelines, it is increasingly important that such algorithms are designed with safety in mind. In this work we investigate the complexity of learning the best safe decision in interactive environments. We red…
▽ More
Active learning methods have shown great promise in reducing the number of samples necessary for learning. As automated learning systems are adopted into real-time, real-world decision-making pipelines, it is increasingly important that such algorithms are designed with safety in mind. In this work we investigate the complexity of learning the best safe decision in interactive environments. We reduce this problem to a constrained linear bandits problem, where our goal is to find the best arm satisfying certain (unknown) safety constraints. We propose an adaptive experimental design-based algorithm, which we show efficiently trades off between the difficulty of showing an arm is unsafe vs suboptimal. To our knowledge, our results are the first on best-arm identification in linear bandits with safety constraints. In practice, we demonstrate that this approach performs well on synthetic and real world datasets.
△ Less
Submitted 22 June, 2022;
originally announced June 2022.
-
Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes
Authors:
Andrew Wagenmaker,
Yifang Chen,
Max Simchowitz,
Simon S. Du,
Kevin Jamieson
Abstract:
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration, but must propose a near-optimal policy for an arbitrary reward function revealed only after exploring. In the the tabular setting, it is well known that this is a more difficult problem than reward-aware (PAC) RL -- where the agent has access to the reward fun…
▽ More
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration, but must propose a near-optimal policy for an arbitrary reward function revealed only after exploring. In the the tabular setting, it is well known that this is a more difficult problem than reward-aware (PAC) RL -- where the agent has access to the reward function during exploration -- with optimal sample complexities in the two settings differing by a factor of $|\mathcal{S}|$, the size of the state space. We show that this separation does not exist in the setting of linear MDPs. We first develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP with sample complexity scaling as $\widetilde{\mathcal{O}}(d^2 H^5/ε^2)$. We then show a lower bound with matching dimension-dependence of $Ω(d^2 H^2/ε^2)$, which holds for the reward-aware RL setting. To our knowledge, our approach is the first computationally efficient algorithm to achieve optimal $d$ dependence in linear MDPs, even in the single-reward PAC setting. Our algorithm relies on a novel procedure which efficiently traverses a linear MDP, collecting samples in any given ``feature direction'', and enjoys a sample complexity scaling optimally in the (linear MDP equivalent of the) maximal state visitation probability. We show that this exploration procedure can also be applied to solve the problem of obtaining ``well-conditioned'' covariates in linear MDPs.
△ Less
Submitted 18 June, 2022; v1 submitted 26 January, 2022;
originally announced January 2022.
-
First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach
Authors:
Andrew Wagenmaker,
Yifang Chen,
Max Simchowitz,
Simon S. Du,
Kevin Jamieson
Abstract:
Obtaining first-order regret bounds -- regret bounds scaling not as the worst-case but with some measure of the performance of the optimal policy on a given instance -- is a core question in sequential decision-making. While such bounds exist in many settings, they have proven elusive in reinforcement learning with large state spaces. In this work we address this gap, and show that it is possible…
▽ More
Obtaining first-order regret bounds -- regret bounds scaling not as the worst-case but with some measure of the performance of the optimal policy on a given instance -- is a core question in sequential decision-making. While such bounds exist in many settings, they have proven elusive in reinforcement learning with large state spaces. In this work we address this gap, and show that it is possible to obtain regret scaling as $\widetilde{\mathcal{O}}(\sqrt{d^3 H^3 \cdot V_1^\star \cdot K} + d^{3.5}H^3\log K )$ in reinforcement learning with large state spaces, namely the linear MDP setting. Here $V_1^\star$ is the value of the optimal policy and $K$ is the number of episodes. We demonstrate that existing techniques based on least squares estimation are insufficient to obtain this result, and instead develop a novel robust self-normalized concentration bound based on the robust Catoni mean estimator, which may be of independent interest.
△ Less
Submitted 20 October, 2022; v1 submitted 6 December, 2021;
originally announced December 2021.
-
Best Arm Identification with Safety Constraints
Authors:
Zhenlin Wang,
Andrew Wagenmaker,
Kevin Jamieson
Abstract:
The best arm identification problem in the multi-armed bandit setting is an excellent model of many real-world decision-making problems, yet it fails to capture the fact that in the real-world, safety constraints often must be met while learning. In this work we study the question of best-arm identification in safety-critical settings, where the goal of the agent is to find the best safe option ou…
▽ More
The best arm identification problem in the multi-armed bandit setting is an excellent model of many real-world decision-making problems, yet it fails to capture the fact that in the real-world, safety constraints often must be met while learning. In this work we study the question of best-arm identification in safety-critical settings, where the goal of the agent is to find the best safe option out of many, while exploring in a way that guarantees certain, initially unknown safety constraints are met. We first analyze this problem in the setting where the reward and safety constraint takes a linear structure, and show nearly matching upper and lower bounds. We then analyze a much more general version of the problem where we only assume the reward and safety constraint can be modeled by monotonic functions, and propose an algorithm in this setting which is guaranteed to learn safely. We conclude with experimental results demonstrating the effectiveness of our approaches in scenarios such as safely identifying the best drug out of many in order to treat an illness.
△ Less
Submitted 23 November, 2021;
originally announced November 2021.
-
Practical, Provably-Correct Interactive Learning in the Realizable Setting: The Power of True Believers
Authors:
Julian Katz-Samuels,
Blake Mason,
Kevin Jamieson,
Rob Nowak
Abstract:
We consider interactive learning in the realizable setting and develop a general framework to handle problems ranging from best arm identification to active classification. We begin our investigation with the observation that agnostic algorithms \emph{cannot} be minimax-optimal in the realizable setting. Hence, we design novel computationally efficient algorithms for the realizable setting that ma…
▽ More
We consider interactive learning in the realizable setting and develop a general framework to handle problems ranging from best arm identification to active classification. We begin our investigation with the observation that agnostic algorithms \emph{cannot} be minimax-optimal in the realizable setting. Hence, we design novel computationally efficient algorithms for the realizable setting that match the minimax lower bound up to logarithmic factors and are general-purpose, accommodating a wide variety of function classes including kernel methods, H{ö}lder smooth functions, and convex functions. The sample complexities of our algorithms can be quantified in terms of well-known quantities like the extended teaching dimension and haystack dimension. However, unlike algorithms based directly on those combinatorial quantities, our algorithms are computationally efficient. To achieve computational efficiency, our algorithms sample from the version space using Monte Carlo "hit-and-run" algorithms instead of maintaining the version space explicitly. Our approach has two key strengths. First, it is simple, consisting of two unifying, greedy algorithms. Second, our algorithms have the capability to seamlessly leverage prior knowledge that is often available and useful in practice. In addition to our new theoretical results, we demonstrate empirically that our algorithms are competitive with Gaussian process UCB methods.
△ Less
Submitted 8 November, 2021;
originally announced November 2021.
-
Nearly Optimal Algorithms for Level Set Estimation
Authors:
Blake Mason,
Romain Camilleri,
Subhojyoti Mukherjee,
Kevin Jamieson,
Robert Nowak,
Lalit Jain
Abstract:
The level set estimation problem seeks to find all points in a domain ${\cal X}$ where the value of an unknown function $f:{\cal X}\rightarrow \mathbb{R}$ exceeds a threshold $α$. The estimation is based on noisy function evaluations that may be acquired at sequentially and adaptively chosen locations in ${\cal X}$. The threshold value $α$ can either be \emph{explicit} and provided a priori, or \e…
▽ More
The level set estimation problem seeks to find all points in a domain ${\cal X}$ where the value of an unknown function $f:{\cal X}\rightarrow \mathbb{R}$ exceeds a threshold $α$. The estimation is based on noisy function evaluations that may be acquired at sequentially and adaptively chosen locations in ${\cal X}$. The threshold value $α$ can either be \emph{explicit} and provided a priori, or \emph{implicit} and defined relative to the optimal function value, i.e. $α= (1-ε)f(x_\ast)$ for a given $ε> 0$ where $f(x_\ast)$ is the maximal function value and is unknown. In this work we provide a new approach to the level set estimation problem by relating it to recent adaptive experimental design methods for linear bandits in the Reproducing Kernel Hilbert Space (RKHS) setting. We assume that $f$ can be approximated by a function in the RKHS up to an unknown misspecification and provide novel algorithms for both the implicit and explicit cases in this setting with strong theoretical guarantees. Moreover, in the linear (kernel) setting, we show that our bounds are nearly optimal, namely, our upper bounds match existing lower bounds for threshold linear bandits. To our knowledge this work provides the first instance-dependent, non-asymptotic upper bounds on sample complexity of level-set estimation that match information theoretic lower bounds.
△ Less
Submitted 2 November, 2021;
originally announced November 2021.
-
Beyond No Regret: Instance-Dependent PAC Reinforcement Learning
Authors:
Andrew Wagenmaker,
Max Simchowitz,
Kevin Jamieson
Abstract:
The theory of reinforcement learning has focused on two fundamental problems: achieving low regret, and identifying $ε$-optimal policies. While a simple reduction allows one to apply a low-regret algorithm to obtain an $ε$-optimal policy and achieve the worst-case optimal rate, it is unknown whether low-regret algorithms can obtain the instance-optimal rate for policy identification. We show this…
▽ More
The theory of reinforcement learning has focused on two fundamental problems: achieving low regret, and identifying $ε$-optimal policies. While a simple reduction allows one to apply a low-regret algorithm to obtain an $ε$-optimal policy and achieve the worst-case optimal rate, it is unknown whether low-regret algorithms can obtain the instance-optimal rate for policy identification. We show this is not possible -- there exists a fundamental tradeoff between achieving low regret and identifying an $ε$-optimal policy at the instance-optimal rate.
Motivated by our negative finding, we propose a new measure of instance-dependent sample complexity for PAC tabular reinforcement learning which explicitly accounts for the attainable state visitation distributions in the underlying MDP. We then propose and analyze a novel, planning-based algorithm which attains this sample complexity -- yielding a complexity which scales with the suboptimality gaps and the "reachability" of a state. We show our algorithm is nearly minimax optimal, and on several examples that our instance-dependent sample complexity offers significant improvements over worst-case bounds.
△ Less
Submitted 21 June, 2022; v1 submitted 5 August, 2021;
originally announced August 2021.
-
Corruption Robust Active Learning
Authors:
Yifang Chen,
Simon S. Du,
Kevin Jamieson
Abstract:
We conduct theoretical studies on streaming-based active learning for binary classification under unknown adversarial label corruptions. In this setting, every time before the learner observes a sample, the adversary decides whether to corrupt the label or not. First, we show that, in a benign corruption setting (which includes the misspecification setting as a special case), with a slight enlarge…
▽ More
We conduct theoretical studies on streaming-based active learning for binary classification under unknown adversarial label corruptions. In this setting, every time before the learner observes a sample, the adversary decides whether to corrupt the label or not. First, we show that, in a benign corruption setting (which includes the misspecification setting as a special case), with a slight enlargement on the hypothesis elimination threshold, the classical RobustCAL framework can (surprisingly) achieve nearly the same label complexity guarantee as in the non-corrupted setting. However, this algorithm can fail in the general corruption setting. To resolve this drawback, we propose a new algorithm which is provably correct without any assumptions on the presence of corruptions. Furthermore, this algorithm enjoys the minimax label complexity in the non-corrupted setting (which is achieved by RobustCAL) and only requires $\tilde{\mathcal{O}}(C_{\mathrm{total}})$ additional labels in the corrupted setting to achieve $\mathcal{O}(\varepsilon + \frac{C_{\mathrm{total}}}{n})$, where $\varepsilon$ is the target accuracy, $C_{\mathrm{total}}$ is the total number of corruptions and $n$ is the total number of unlabeled samples.
△ Less
Submitted 21 June, 2021;
originally announced June 2021.
-
Improved Algorithms for Agnostic Pool-based Active Classification
Authors:
Julian Katz-Samuels,
Jifan Zhang,
Lalit Jain,
Kevin Jamieson
Abstract:
We consider active learning for binary classification in the agnostic pool-based setting. The vast majority of works in active learning in the agnostic setting are inspired by the CAL algorithm where each query is uniformly sampled from the disagreement region of the current version space. The sample complexity of such algorithms is described by a quantity known as the disagreement coefficient whi…
▽ More
We consider active learning for binary classification in the agnostic pool-based setting. The vast majority of works in active learning in the agnostic setting are inspired by the CAL algorithm where each query is uniformly sampled from the disagreement region of the current version space. The sample complexity of such algorithms is described by a quantity known as the disagreement coefficient which captures both the geometry of the hypothesis space as well as the underlying probability space. To date, the disagreement coefficient has been justified by minimax lower bounds only, leaving the door open for superior instance dependent sample complexities. In this work we propose an algorithm that, in contrast to uniform sampling over the disagreement region, solves an experimental design problem to determine a distribution over examples from which to request labels. We show that the new approach achieves sample complexity bounds that are never worse than the best disagreement coefficient-based bounds, but in specific cases can be dramatically smaller. From a practical perspective, the proposed algorithm requires no hyperparameters to tune (e.g., to control the aggressiveness of sampling), and is computationally efficient by means of assuming access to an empirical risk minimization oracle (without any constraints). Empirically, we demonstrate that our algorithm is superior to state of the art agnostic active learning algorithms on image classification datasets.
△ Less
Submitted 13 May, 2021;
originally announced May 2021.
-
Task-Optimal Exploration in Linear Dynamical Systems
Authors:
Andrew Wagenmaker,
Max Simchowitz,
Kevin Jamieson
Abstract:
Exploration in unknown environments is a fundamental problem in reinforcement learning and control. In this work, we study task-guided exploration and determine what precisely an agent must learn about their environment in order to complete a particular task. Formally, we study a broad class of decision-making problems in the setting of linear dynamical systems, a class that includes the linear qu…
▽ More
Exploration in unknown environments is a fundamental problem in reinforcement learning and control. In this work, we study task-guided exploration and determine what precisely an agent must learn about their environment in order to complete a particular task. Formally, we study a broad class of decision-making problems in the setting of linear dynamical systems, a class that includes the linear quadratic regulator problem. We provide instance- and task-dependent lower bounds which explicitly quantify the difficulty of completing a task of interest. Motivated by our lower bound, we propose a computationally efficient experiment-design based exploration algorithm. We show that it optimally explores the environment, collecting precisely the information needed to complete the task, and provide finite-time bounds guaranteeing that it achieves the instance- and task-optimal sample complexity, up to constant factors. Through several examples of the LQR problem, we show that performing task-guided exploration provably improves on exploration schemes which do not take into account the task of interest. Along the way, we establish that certainty equivalence decision making is instance- and task-optimal, and obtain the first algorithm for the linear quadratic regulator problem which is instance-optimal. We conclude with several experiments illustrating the effectiveness of our approach in practice.
△ Less
Submitted 9 July, 2021; v1 submitted 9 February, 2021;
originally announced February 2021.
-
Experimental Design for Regret Minimization in Linear Bandits
Authors:
Andrew Wagenmaker,
Julian Katz-Samuels,
Kevin Jamieson
Abstract:
In this paper we propose a novel experimental design-based algorithm to minimize regret in online stochastic linear and combinatorial bandits. While existing literature tends to focus on optimism-based algorithms--which have been shown to be suboptimal in many cases--our approach carefully plans which action to take by balancing the tradeoff between information gain and reward, overcoming the fail…
▽ More
In this paper we propose a novel experimental design-based algorithm to minimize regret in online stochastic linear and combinatorial bandits. While existing literature tends to focus on optimism-based algorithms--which have been shown to be suboptimal in many cases--our approach carefully plans which action to take by balancing the tradeoff between information gain and reward, overcoming the failures of optimism. In addition, we leverage tools from the theory of suprema of empirical processes to obtain regret guarantees that scale with the Gaussian width of the action set, avoiding wasteful union bounds. We provide state-of-the-art finite time regret guarantees and show that our algorithm can be applied in both the bandit and semi-bandit feedback regime. In the combinatorial semi-bandit setting, we show that our algorithm is computationally efficient and relies only on calls to a linear maximization oracle. In addition, we show that with slight modification our algorithm can be used for pure exploration, obtaining state-of-the-art pure exploration guarantees in the semi-bandit setting. Finally, we provide, to the best of our knowledge, the first example where optimism fails in the semi-bandit regime, and show that in this setting our algorithm succeeds.
△ Less
Submitted 26 February, 2021; v1 submitted 1 November, 2020;
originally announced November 2020.
-
A New Perspective on Pool-Based Active Classification and False-Discovery Control
Authors:
Lalit Jain,
Kevin Jamieson
Abstract:
In many scientific settings there is a need for adaptive experimental design to guide the process of identifying regions of the search space that contain as many true positives as possible subject to a low rate of false discoveries (i.e. false alarms). Such regions of the search space could differ drastically from a predicted set that minimizes 0/1 error and accurate identification could require v…
▽ More
In many scientific settings there is a need for adaptive experimental design to guide the process of identifying regions of the search space that contain as many true positives as possible subject to a low rate of false discoveries (i.e. false alarms). Such regions of the search space could differ drastically from a predicted set that minimizes 0/1 error and accurate identification could require very different sampling strategies. Like active learning for binary classification, this experimental design cannot be optimally chosen a priori, but rather the data must be taken sequentially and adaptively. However, unlike classification with 0/1 error, collecting data adaptively to find a set with high true positive rate and low false discovery rate (FDR) is not as well understood. In this paper we provide the first provably sample efficient adaptive algorithm for this problem. Along the way we highlight connections between classification, combinatorial bandits, and FDR control making contributions to each.
△ Less
Submitted 14 August, 2020;
originally announced August 2020.
-
An Empirical Process Approach to the Union Bound: Practical Algorithms for Combinatorial and Linear Bandits
Authors:
Julian Katz-Samuels,
Lalit Jain,
Zohar Karnin,
Kevin Jamieson
Abstract:
This paper proposes near-optimal algorithms for the pure-exploration linear bandit problem in the fixed confidence and fixed budget settings. Leveraging ideas from the theory of suprema of empirical processes, we provide an algorithm whose sample complexity scales with the geometry of the instance and avoids an explicit union bound over the number of arms. Unlike previous approaches which sample b…
▽ More
This paper proposes near-optimal algorithms for the pure-exploration linear bandit problem in the fixed confidence and fixed budget settings. Leveraging ideas from the theory of suprema of empirical processes, we provide an algorithm whose sample complexity scales with the geometry of the instance and avoids an explicit union bound over the number of arms. Unlike previous approaches which sample based on minimizing a worst-case variance (e.g. G-optimal design), we define an experimental design objective based on the Gaussian-width of the underlying arm set. We provide a novel lower bound in terms of this objective that highlights its fundamental role in the sample complexity. The sample complexity of our fixed confidence algorithm matches this lower bound, and in addition is computationally efficient for combinatorial classes, e.g. shortest-path, matchings and matroids, where the arm sets can be exponentially large in the dimension. Finally, we propose the first algorithm for linear bandits in the the fixed budget setting. Its guarantee matches our lower bound up to logarithmic factors.
△ Less
Submitted 20 June, 2020;
originally announced June 2020.
-
Estimating the number and effect sizes of non-null hypotheses
Authors:
Jennifer Brennan,
Ramya Korlakai Vinayak,
Kevin Jamieson
Abstract:
We study the problem of estimating the distribution of effect sizes (the mean of the test statistic under the alternate hypothesis) in a multiple testing setting. Knowing this distribution allows us to calculate the power (type II error) of any experimental design. We show that it is possible to estimate this distribution using an inexpensive pilot experiment, which takes significantly fewer sampl…
▽ More
We study the problem of estimating the distribution of effect sizes (the mean of the test statistic under the alternate hypothesis) in a multiple testing setting. Knowing this distribution allows us to calculate the power (type II error) of any experimental design. We show that it is possible to estimate this distribution using an inexpensive pilot experiment, which takes significantly fewer samples than would be required by an experiment that identified the discoveries. Our estimator can be used to guarantee the number of discoveries that will be made using a given experimental design in a future experiment. We prove that this simple and computationally efficient estimator enjoys a number of favorable theoretical properties, and demonstrate its effectiveness on data from a gene knockout experiment on influenza inhibition in Drosophila.
△ Less
Submitted 24 July, 2020; v1 submitted 17 February, 2020;
originally announced February 2020.
-
Active Learning for Identification of Linear Dynamical Systems
Authors:
Andrew Wagenmaker,
Kevin Jamieson
Abstract:
We propose an algorithm to actively estimate the parameters of a linear dynamical system. Given complete control over the system's input, our algorithm adaptively chooses the inputs to accelerate estimation. We show a finite time bound quantifying the estimation rate our algorithm attains and prove matching upper and lower bounds which guarantee its asymptotic optimality, up to constants. In addit…
▽ More
We propose an algorithm to actively estimate the parameters of a linear dynamical system. Given complete control over the system's input, our algorithm adaptively chooses the inputs to accelerate estimation. We show a finite time bound quantifying the estimation rate our algorithm attains and prove matching upper and lower bounds which guarantee its asymptotic optimality, up to constants. In addition, we show that this optimal rate is unattainable when using Gaussian noise to excite the system, even with optimally tuned covariance, and analyze several examples where our algorithm provably improves over rates obtained by playing noise. Our analysis critically relies on a novel result quantifying the error in estimating the parameters of a dynamical system when arbitrary periodic inputs are being played. We conclude with numerical examples that illustrate the effectiveness of our algorithm in practice.
△ Less
Submitted 22 June, 2020; v1 submitted 2 February, 2020;
originally announced February 2020.
-
Sequential Experimental Design for Transductive Linear Bandits
Authors:
Tanner Fiez,
Lalit Jain,
Kevin Jamieson,
Lillian Ratliff
Abstract:
In this paper we introduce the transductive linear bandit problem: given a set of measurement vectors $\mathcal{X}\subset \mathbb{R}^d$, a set of items $\mathcal{Z}\subset \mathbb{R}^d$, a fixed confidence $δ$, and an unknown vector $θ^{\ast}\in \mathbb{R}^d$, the goal is to infer $\text{argmax}_{z\in \mathcal{Z}} z^\topθ^\ast$ with probability $1-δ$ by making as few sequentially chosen noisy meas…
▽ More
In this paper we introduce the transductive linear bandit problem: given a set of measurement vectors $\mathcal{X}\subset \mathbb{R}^d$, a set of items $\mathcal{Z}\subset \mathbb{R}^d$, a fixed confidence $δ$, and an unknown vector $θ^{\ast}\in \mathbb{R}^d$, the goal is to infer $\text{argmax}_{z\in \mathcal{Z}} z^\topθ^\ast$ with probability $1-δ$ by making as few sequentially chosen noisy measurements of the form $x^\topθ^{\ast}$ as possible. When $\mathcal{X}=\mathcal{Z}$, this setting generalizes linear bandits, and when $\mathcal{X}$ is the standard basis vectors and $\mathcal{Z}\subset \{0,1\}^d$, combinatorial bandits. Such a transductive setting naturally arises when the set of measurement vectors is limited due to factors such as availability or cost. As an example, in drug discovery the compounds and dosages $\mathcal{X}$ a practitioner may be willing to evaluate in the lab in vitro due to cost or safety reasons may differ vastly from those compounds and dosages $\mathcal{Z}$ that can be safely administered to patients in vivo. Alternatively, in recommender systems for books, the set of books $\mathcal{X}$ a user is queried about may be restricted to well known best-sellers even though the goal might be to recommend more esoteric titles $\mathcal{Z}$. In this paper, we provide instance-dependent lower bounds for the transductive setting, an algorithm that matches these up to logarithmic factors, and an evaluation. In particular, we provide the first non-asymptotic algorithm for linear bandits that nearly achieves the information theoretic lower bound.
△ Less
Submitted 19 June, 2019;
originally announced June 2019.
-
The True Sample Complexity of Identifying Good Arms
Authors:
Julian Katz-Samuels,
Kevin Jamieson
Abstract:
We consider two multi-armed bandit problems with $n$ arms: (i) given an $ε> 0$, identify an arm with mean that is within $ε$ of the largest mean and (ii) given a threshold $μ_0$ and integer $k$, identify $k$ arms with means larger than $μ_0$. Existing lower bounds and algorithms for the PAC framework suggest that both of these problems require $Ω(n)$ samples. However, we argue that these definitio…
▽ More
We consider two multi-armed bandit problems with $n$ arms: (i) given an $ε> 0$, identify an arm with mean that is within $ε$ of the largest mean and (ii) given a threshold $μ_0$ and integer $k$, identify $k$ arms with means larger than $μ_0$. Existing lower bounds and algorithms for the PAC framework suggest that both of these problems require $Ω(n)$ samples. However, we argue that these definitions not only conflict with how these algorithms are used in practice, but also that these results disagree with intuition that says (i) requires only $Θ(\frac{n}{m})$ samples where $m = |\{ i : μ_i > \max_{i \in [n]} μ_i - ε\}|$ and (ii) requires $Θ(\frac{n}{m}k)$ samples where $m = |\{ i : μ_i > μ_0 \}|$. We provide definitions that formalize these intuitions, obtain lower bounds that match the above sample complexities, and develop explicit, practical algorithms that achieve nearly matching upper bounds.
△ Less
Submitted 15 June, 2019;
originally announced June 2019.
-
Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs
Authors:
Max Simchowitz,
Kevin Jamieson
Abstract:
This paper establishes that optimistic algorithms attain gap-dependent and non-asymptotic logarithmic regret for episodic MDPs. In contrast to prior work, our bounds do not suffer a dependence on diameter-like quantities or ergodicity, and smoothly interpolate between the gap dependent logarithmic-regret, and the $\widetilde{\mathcal{O}}(\sqrt{HSAT})$-minimax rate. The key technique in our analysi…
▽ More
This paper establishes that optimistic algorithms attain gap-dependent and non-asymptotic logarithmic regret for episodic MDPs. In contrast to prior work, our bounds do not suffer a dependence on diameter-like quantities or ergodicity, and smoothly interpolate between the gap dependent logarithmic-regret, and the $\widetilde{\mathcal{O}}(\sqrt{HSAT})$-minimax rate. The key technique in our analysis is a novel "clipped" regret decomposition which applies to a broad family of recent optimistic algorithms for episodic MDPs.
△ Less
Submitted 28 October, 2019; v1 submitted 9 May, 2019;
originally announced May 2019.
-
MLSys: The New Frontier of Machine Learning Systems
Authors:
Alexander Ratner,
Dan Alistarh,
Gustavo Alonso,
David G. Andersen,
Peter Bailis,
Sarah Bird,
Nicholas Carlini,
Bryan Catanzaro,
Jennifer Chayes,
Eric Chung,
Bill Dally,
Jeff Dean,
Inderjit S. Dhillon,
Alexandros Dimakis,
Pradeep Dubey,
Charles Elkan,
Grigori Fursin,
Gregory R. Ganger,
Lise Getoor,
Phillip B. Gibbons,
Garth A. Gibson,
Joseph E. Gonzalez,
Justin Gottschlich,
Song Han,
Kim Hazelwood
, et al. (44 additional authors not shown)
Abstract:
Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a ne…
▽ More
Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a new systems machine learning research community at the intersection of the traditional systems and ML communities, focused on topics such as hardware systems for ML, software systems for ML, and ML optimized for metrics beyond predictive accuracy. To do this, we describe a new conference, MLSys, that explicitly targets research at the intersection of systems and machine learning with a program committee split evenly between experts in systems and ML, and an explicit focus on topics at the intersection of the two.
△ Less
Submitted 1 December, 2019; v1 submitted 29 March, 2019;
originally announced April 2019.
-
Exploiting Reuse in Pipeline-Aware Hyperparameter Tuning
Authors:
Liam Li,
Evan Sparks,
Kevin Jamieson,
Ameet Talwalkar
Abstract:
Hyperparameter tuning of multi-stage pipelines introduces a significant computational burden. Motivated by the observation that work can be reused across pipelines if the intermediate computations are the same, we propose a pipeline-aware approach to hyperparameter tuning. Our approach optimizes both the design and execution of pipelines to maximize reuse. We design pipelines amenable for reuse by…
▽ More
Hyperparameter tuning of multi-stage pipelines introduces a significant computational burden. Motivated by the observation that work can be reused across pipelines if the intermediate computations are the same, we propose a pipeline-aware approach to hyperparameter tuning. Our approach optimizes both the design and execution of pipelines to maximize reuse. We design pipelines amenable for reuse by (i) introducing a novel hybrid hyperparameter tuning method called gridded random search, and (ii) reducing the average training time in pipelines by adapting early-stopping hyperparameter tuning approaches. We then realize the potential for reuse during execution by introducing a novel caching problem for ML workloads which we pose as a mixed integer linear program (ILP), and subsequently evaluating various caching heuristics relative to the optimal solution of the ILP. We conduct experiments on simulated and real-world machine learning pipelines to show that a pipeline-aware approach to hyperparameter tuning can offer over an order-of-magnitude speedup over independently evaluating pipeline configurations.
△ Less
Submitted 12 March, 2019;
originally announced March 2019.
-
Pure-Exploration for Infinite-Armed Bandits with General Arm Reservoirs
Authors:
Maryam Aziz,
Kevin Jamieson,
Javed Aslam
Abstract:
This paper considers a multi-armed bandit game where the number of arms is much larger than the maximum budget and is effectively infinite. We characterize necessary and sufficient conditions on the total budget for an algorithm to return an ε-good arm with probability at least 1 - δ. In such situations, the sample complexity depends on ε, δ and the so-called reservoir distribution ν from which th…
▽ More
This paper considers a multi-armed bandit game where the number of arms is much larger than the maximum budget and is effectively infinite. We characterize necessary and sufficient conditions on the total budget for an algorithm to return an ε-good arm with probability at least 1 - δ. In such situations, the sample complexity depends on ε, δ and the so-called reservoir distribution ν from which the means of the arms are drawn iid. While a substantial literature has developed around analyzing specific cases of ν such as the beta distribution, our analysis makes no assumption about the form of ν. Our algorithm is based on successive halving with the surprising exception that arms start to be discarded after just a single pull, requiring an analysis that goes beyond concentration alone. The provable correctness of this algorithm also provides an explanation for the empirical observation that the most aggressive bracket of the Hyperband algorithm of Li et al. (2017) for hyperparameter tuning is almost always best.
△ Less
Submitted 12 January, 2019; v1 submitted 14 November, 2018;
originally announced November 2018.
-
A System for Massively Parallel Hyperparameter Tuning
Authors:
Liam Li,
Kevin Jamieson,
Afshin Rostamizadeh,
Ekaterina Gonina,
Moritz Hardt,
Benjamin Recht,
Ameet Talwalkar
Abstract:
Modern learning models are characterized by large hyperparameter spaces and long training times. These properties, coupled with the rise of parallel computing and the growing demand to productionize machine learning workloads, motivate the need to develop mature hyperparameter optimization functionality in distributed computing settings. We address this challenge by first introducing a simple and…
▽ More
Modern learning models are characterized by large hyperparameter spaces and long training times. These properties, coupled with the rise of parallel computing and the growing demand to productionize machine learning workloads, motivate the need to develop mature hyperparameter optimization functionality in distributed computing settings. We address this challenge by first introducing a simple and robust hyperparameter optimization algorithm called ASHA, which exploits parallelism and aggressive early-stopping to tackle large-scale hyperparameter optimization problems. Our extensive empirical results show that ASHA outperforms existing state-of-the-art hyperparameter optimization methods; scales linearly with the number of workers in distributed settings; and is suitable for massive parallelism, as demonstrated on a task with 500 workers. We then describe several design decisions we encountered, along with our associated solutions, when integrating ASHA in Determined AI's end-to-end production-quality machine learning system that offers hyperparameter tuning as a service.
△ Less
Submitted 15 March, 2020; v1 submitted 13 October, 2018;
originally announced October 2018.
-
A Bandit Approach to Multiple Testing with False Discovery Control
Authors:
Kevin Jamieson,
Lalit Jain
Abstract:
We propose an adaptive sampling approach for multiple testing which aims to maximize statistical power while ensuring anytime false discovery control. We consider $n$ distributions whose means are partitioned by whether they are below or equal to a baseline (nulls), versus above the baseline (actual positives). In addition, each distribution can be sequentially and repeatedly sampled. Inspired by…
▽ More
We propose an adaptive sampling approach for multiple testing which aims to maximize statistical power while ensuring anytime false discovery control. We consider $n$ distributions whose means are partitioned by whether they are below or equal to a baseline (nulls), versus above the baseline (actual positives). In addition, each distribution can be sequentially and repeatedly sampled. Inspired by the multi-armed bandit literature, we provide an algorithm that takes as few samples as possible to exceed a target true positive proportion (i.e. proportion of actual positives discovered) while giving anytime control of the false discovery proportion (nulls predicted as actual positives). Our sample complexity results match known information theoretic lower bounds and through simulations we show a substantial performance improvement over uniform sampling and an adaptive elimination style algorithm. Given the simplicity of the approach, and its sample efficiency, the method has promise for wide adoption in the biological sciences, clinical testing for drug discovery, and online A/B/n testing problems.
△ Less
Submitted 16 July, 2019; v1 submitted 6 September, 2018;
originally announced September 2018.
-
Adaptive Sampling for Convex Regression
Authors:
Max Simchowitz,
Kevin Jamieson,
Jordan W. Suchow,
Thomas L. Griffiths
Abstract:
In this paper, we introduce the first principled adaptive-sampling procedure for learning a convex function in the $L_\infty$ norm, a problem that arises often in the behavioral and social sciences. We present a function-specific measure of complexity and use it to prove that, for each convex function $f_{\star}$, our algorithm nearly attains the information-theoretically optimal, function-specifi…
▽ More
In this paper, we introduce the first principled adaptive-sampling procedure for learning a convex function in the $L_\infty$ norm, a problem that arises often in the behavioral and social sciences. We present a function-specific measure of complexity and use it to prove that, for each convex function $f_{\star}$, our algorithm nearly attains the information-theoretically optimal, function-specific error rate. We also corroborate our theoretical contributions with numerical experiments, finding that our method substantially outperforms passive, uniform sampling for favorable synthetic and data-derived functions in low-noise settings with large sampling budgets. Our results also suggest an idealized "oracle strategy", which we use to gauge the potential advance of any adaptive-sampling strategy over passive sampling, for any given convex function.
△ Less
Submitted 26 August, 2018; v1 submitted 14 August, 2018;
originally announced August 2018.
-
A framework for Multi-A(rmed)/B(andit) testing with online FDR control
Authors:
Fanny Yang,
Aaditya Ramdas,
Kevin Jamieson,
Martin J. Wainwright
Abstract:
We propose an alternative framework to existing setups for controlling false alarms when multiple A/B tests are run over time. This setup arises in many practical applications, e.g. when pharmaceutical companies test new treatment options against control pills for different diseases, or when internet companies test their default webpages versus various alternatives over time. Our framework propose…
▽ More
We propose an alternative framework to existing setups for controlling false alarms when multiple A/B tests are run over time. This setup arises in many practical applications, e.g. when pharmaceutical companies test new treatment options against control pills for different diseases, or when internet companies test their default webpages versus various alternatives over time. Our framework proposes to replace a sequence of A/B tests by a sequence of best-arm MAB instances, which can be continuously monitored by the data scientist. When interleaving the MAB tests with an an online false discovery rate (FDR) algorithm, we can obtain the best of both worlds: low sample complexity and any time online FDR control. Our main contributions are: (i) to propose reasonable definitions of a null hypothesis for MAB instances; (ii) to demonstrate how one can derive an always-valid sequential p-value that allows continuous monitoring of each MAB test; and (iii) to show that using rejection thresholds of online-FDR algorithms as the confidence levels for the MAB algorithms results in both sample-optimality, high power and low FDR at any point in time. We run extensive simulations to verify our claims, and also report results on real data collected from the New Yorker Cartoon Caption contest.
△ Less
Submitted 18 November, 2017; v1 submitted 16 June, 2017;
originally announced June 2017.
-
Open Loop Hyperparameter Optimization and Determinantal Point Processes
Authors:
Jesse Dodge,
Kevin Jamieson,
Noah A. Smith
Abstract:
Driven by the need for parallelizable hyperparameter optimization methods, this paper studies \emph{open loop} search methods: sequences that are predetermined and can be generated before a single configuration is evaluated. Examples include grid search, uniform random search, low discrepancy sequences, and other sampling distributions. In particular, we propose the use of $k$-determinantal point…
▽ More
Driven by the need for parallelizable hyperparameter optimization methods, this paper studies \emph{open loop} search methods: sequences that are predetermined and can be generated before a single configuration is evaluated. Examples include grid search, uniform random search, low discrepancy sequences, and other sampling distributions. In particular, we propose the use of $k$-determinantal point processes in hyperparameter optimization via random search. Compared to conventional uniform random search where hyperparameter settings are sampled independently, a $k$-DPP promotes diversity. We describe an approach that transforms hyperparameter search spaces for efficient use with a $k$-DPP. In addition, we introduce a novel Metropolis-Hastings algorithm which can sample from $k$-DPPs defined over any space from which uniform samples can be drawn, including spaces with a mixture of discrete and continuous dimensions or tree structure. Our experiments show significant benefits in realistic scenarios with a limited budget for training supervised learners, whether in serial or parallel.
△ Less
Submitted 8 May, 2019; v1 submitted 5 June, 2017;
originally announced June 2017.
-
The Simulator: Understanding Adaptive Sampling in the Moderate-Confidence Regime
Authors:
Max Simchowitz,
Kevin Jamieson,
Benjamin Recht
Abstract:
We propose a novel technique for analyzing adaptive sampling called the {\em Simulator}. Our approach differs from the existing methods by considering not how much information could be gathered by any fixed sampling strategy, but how difficult it is to distinguish a good sampling strategy from a bad one given the limited amount of data collected up to any given time. This change of perspective all…
▽ More
We propose a novel technique for analyzing adaptive sampling called the {\em Simulator}. Our approach differs from the existing methods by considering not how much information could be gathered by any fixed sampling strategy, but how difficult it is to distinguish a good sampling strategy from a bad one given the limited amount of data collected up to any given time. This change of perspective allows us to match the strength of both Fano and change-of-measure techniques, without succumbing to the limitations of either method. For concreteness, we apply our techniques to a structured multi-arm bandit problem in the fixed-confidence pure exploration setting, where we show that the constraints on the means imply a substantial gap between the moderate-confidence sample complexity, and the asymptotic sample complexity as $δ\to 0$ found in the literature. We also prove the first instance-based lower bounds for the top-k problem which incorporate the appropriate log-factors. Moreover, our lower bounds zero-in on the number of times each \emph{individual} arm needs to be pulled, uncovering new phenomena which are drowned out in the aggregate sample complexity. Our new analysis inspires a simple and near-optimal algorithm for the best-arm and top-k identification, the first {\em practical} algorithm of its kind for the latter problem which removes extraneous log factors, and outperforms the state-of-the-art in experiments.
△ Less
Submitted 23 April, 2023; v1 submitted 16 February, 2017;
originally announced February 2017.
-
Finite Sample Prediction and Recovery Bounds for Ordinal Embedding
Authors:
Lalit Jain,
Kevin Jamieson,
Robert Nowak
Abstract:
The goal of ordinal embedding is to represent items as points in a low-dimensional Euclidean space given a set of constraints in the form of distance comparisons like "item $i$ is closer to item $j$ than item $k$". Ordinal constraints like this often come from human judgments. To account for errors and variation in judgments, we consider the noisy situation in which the given constraints are indep…
▽ More
The goal of ordinal embedding is to represent items as points in a low-dimensional Euclidean space given a set of constraints in the form of distance comparisons like "item $i$ is closer to item $j$ than item $k$". Ordinal constraints like this often come from human judgments. To account for errors and variation in judgments, we consider the noisy situation in which the given constraints are independently corrupted by reversing the correct constraint with some probability. This paper makes several new contributions to this problem. First, we derive prediction error bounds for ordinal embedding with noise by exploiting the fact that the rank of a distance matrix of points in $\mathbb{R}^d$ is at most $d+2$. These bounds characterize how well a learned embedding predicts new comparative judgments. Second, we investigate the special case of a known noise model and study the Maximum Likelihood estimator. Third, knowledge of the noise model enables us to relate prediction errors to embedding accuracy. This relationship is highly non-trivial since we show that the linear map corresponding to distance comparisons is non-invertible, but there exists a nonlinear map that is invertible. Fourth, two new algorithms for ordinal embedding are proposed and evaluated in experiments.
△ Less
Submitted 22 June, 2016;
originally announced June 2016.
-
Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization
Authors:
Lisha Li,
Kevin Jamieson,
Giulia DeSalvo,
Afshin Rostamizadeh,
Ameet Talwalkar
Abstract:
Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters. While recent approaches use Bayesian optimization to adaptively select configurations, we focus on speeding up random search through adaptive resource allocation and early-stopping. We formulate hyperparameter optimization as a pure-exploration non-stochastic infinite-armed bandit problem wh…
▽ More
Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters. While recent approaches use Bayesian optimization to adaptively select configurations, we focus on speeding up random search through adaptive resource allocation and early-stopping. We formulate hyperparameter optimization as a pure-exploration non-stochastic infinite-armed bandit problem where a predefined resource like iterations, data samples, or features is allocated to randomly sampled configurations. We introduce a novel algorithm, Hyperband, for this framework and analyze its theoretical properties, providing several desirable guarantees. Furthermore, we compare Hyperband with popular Bayesian optimization methods on a suite of hyperparameter optimization problems. We observe that Hyperband can provide over an order-of-magnitude speedup over our competitor set on a variety of deep-learning and kernel-based learning problems.
△ Less
Submitted 18 June, 2018; v1 submitted 21 March, 2016;
originally announced March 2016.
-
Best-of-K Bandits
Authors:
Max Simchowitz,
Kevin Jamieson,
Benjamin Recht
Abstract:
This paper studies the Best-of-K Bandit game: At each time the player chooses a subset S among all N-choose-K possible options and observes reward max(X(i) : i in S) where X is a random vector drawn from a joint distribution. The objective is to identify the subset that achieves the highest expected reward with high probability using as few queries as possible. We present distribution-dependent lo…
▽ More
This paper studies the Best-of-K Bandit game: At each time the player chooses a subset S among all N-choose-K possible options and observes reward max(X(i) : i in S) where X is a random vector drawn from a joint distribution. The objective is to identify the subset that achieves the highest expected reward with high probability using as few queries as possible. We present distribution-dependent lower bounds based on a particular construction which force a learner to consider all N-choose-K subsets, and match naive extensions of known upper bounds in the bandit setting obtained by treating each subset as a separate arm. Nevertheless, we present evidence that exhaustive search may be avoided for certain, favorable distributions because the influence of high-order order correlations may be dominated by lower order statistics. Finally, we present an algorithm and analysis for independent arms, which mitigates the surprising non-trivial information occlusion that occurs due to only observing the max in the subset. This may inform strategies for more general dependent measures, and we complement these result with independent-arm lower bounds.
△ Less
Submitted 18 March, 2016; v1 submitted 8 March, 2016;
originally announced March 2016.
-
Non-stochastic Best Arm Identification and Hyperparameter Optimization
Authors:
Kevin Jamieson,
Ameet Talwalkar
Abstract:
Motivated by the task of hyperparameter optimization, we introduce the non-stochastic best-arm identification problem. Within the multi-armed bandit literature, the cumulative regret objective enjoys algorithms and analyses for both the non-stochastic and stochastic settings while to the best of our knowledge, the best-arm identification framework has only been considered in the stochastic setting…
▽ More
Motivated by the task of hyperparameter optimization, we introduce the non-stochastic best-arm identification problem. Within the multi-armed bandit literature, the cumulative regret objective enjoys algorithms and analyses for both the non-stochastic and stochastic settings while to the best of our knowledge, the best-arm identification framework has only been considered in the stochastic setting. We introduce the non-stochastic setting under this framework, identify a known algorithm that is well-suited for this setting, and analyze its behavior. Next, by leveraging the iterative nature of standard machine learning algorithms, we cast hyperparameter optimization as an instance of non-stochastic best-arm identification, and empirically evaluate our proposed algorithm on this task. Our empirical results show that, by allocating more resources to promising hyperparameter settings, we typically achieve comparable test accuracies an order of magnitude faster than baseline methods.
△ Less
Submitted 27 February, 2015;
originally announced February 2015.
-
Sparse Dueling Bandits
Authors:
Kevin Jamieson,
Sumeet Katariya,
Atul Deshpande,
Robert Nowak
Abstract:
The dueling bandit problem is a variation of the classical multi-armed bandit in which the allowable actions are noisy comparisons between pairs of arms. This paper focuses on a new approach for finding the "best" arm according to the Borda criterion using noisy comparisons. We prove that in the absence of structural assumptions, the sample complexity of this problem is proportional to the sum of…
▽ More
The dueling bandit problem is a variation of the classical multi-armed bandit in which the allowable actions are noisy comparisons between pairs of arms. This paper focuses on a new approach for finding the "best" arm according to the Borda criterion using noisy comparisons. We prove that in the absence of structural assumptions, the sample complexity of this problem is proportional to the sum of the inverse squared gaps between the Borda scores of each suboptimal arm and the best arm. We explore this dependence further and consider structural constraints on the pairwise comparison matrix (a particular form of sparsity natural to this problem) that can significantly reduce the sample complexity. This motivates a new algorithm called Successive Elimination with Comparison Sparsity (SECS) that exploits sparsity to find the Borda winner using fewer samples than standard algorithms. We also evaluate the new algorithm experimentally with synthetic and real data. The results show that the sparsity model and the new algorithm can provide significant improvements over standard approaches.
△ Less
Submitted 31 January, 2015;
originally announced February 2015.
-
lil' UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits
Authors:
Kevin Jamieson,
Matthew Malloy,
Robert Nowak,
Sébastien Bubeck
Abstract:
The paper proposes a novel upper confidence bound (UCB) procedure for identifying the arm with the largest mean in a multi-armed bandit game in the fixed confidence setting using a small number of total samples. The procedure cannot be improved in the sense that the number of samples required to identify the best arm is within a constant factor of a lower bound based on the law of the iterated log…
▽ More
The paper proposes a novel upper confidence bound (UCB) procedure for identifying the arm with the largest mean in a multi-armed bandit game in the fixed confidence setting using a small number of total samples. The procedure cannot be improved in the sense that the number of samples required to identify the best arm is within a constant factor of a lower bound based on the law of the iterated logarithm (LIL). Inspired by the LIL, we construct our confidence bounds to explicitly account for the infinite time horizon of the algorithm. In addition, by using a novel stopping time for the algorithm we avoid a union bound over the arms that has been observed in other UCB-type algorithms. We prove that the algorithm is optimal up to constants and also show through simulations that it provides superior performance with respect to the state-of-the-art.
△ Less
Submitted 27 December, 2013;
originally announced December 2013.
-
On Finding the Largest Mean Among Many
Authors:
Kevin Jamieson,
Matthew Malloy,
Robert Nowak,
Sebastien Bubeck
Abstract:
Sampling from distributions to find the one with the largest mean arises in a broad range of applications, and it can be mathematically modeled as a multi-armed bandit problem in which each distribution is associated with an arm. This paper studies the sample complexity of identifying the best arm (largest mean) in a multi-armed bandit problem. Motivated by large-scale applications, we are especia…
▽ More
Sampling from distributions to find the one with the largest mean arises in a broad range of applications, and it can be mathematically modeled as a multi-armed bandit problem in which each distribution is associated with an arm. This paper studies the sample complexity of identifying the best arm (largest mean) in a multi-armed bandit problem. Motivated by large-scale applications, we are especially interested in identifying situations where the total number of samples that are necessary and sufficient to find the best arm scale linearly with the number of arms. We present a single-parameter multi-armed bandit model that spans the range from linear to superlinear sample complexity. We also give a new algorithm for best arm identification, called PRISM, with linear sample complexity for a wide range of mean distributions. The algorithm, like most exploration procedures for multi-armed bandits, is adaptive in the sense that the next arms to sample are selected based on previous samples. We compare the sample complexity of adaptive procedures with simpler non-adaptive procedures using new lower bounds. For many problem instances, the increased sample complexity required by non-adaptive procedures is a polynomial factor of the number of arms.
△ Less
Submitted 17 June, 2013;
originally announced June 2013.
-
Query Complexity of Derivative-Free Optimization
Authors:
Kevin G. Jamieson,
Robert D. Nowak,
Benjamin Recht
Abstract:
This paper provides lower bounds on the convergence rate of Derivative Free Optimization (DFO) with noisy function evaluations, exposing a fundamental and unavoidable gap between the performance of algorithms with access to gradients and those with access to only function evaluations. However, there are situations in which DFO is unavoidable, and for such situations we propose a new DFO algorithm…
▽ More
This paper provides lower bounds on the convergence rate of Derivative Free Optimization (DFO) with noisy function evaluations, exposing a fundamental and unavoidable gap between the performance of algorithms with access to gradients and those with access to only function evaluations. However, there are situations in which DFO is unavoidable, and for such situations we propose a new DFO algorithm that is proved to be near optimal for the class of strongly convex objective functions. A distinctive feature of the algorithm is that it uses only Boolean-valued function comparisons, rather than function evaluations. This makes the algorithm useful in an even wider range of applications, such as optimization based on paired comparisons from human subjects, for example. We also show that regardless of whether DFO is based on noisy function evaluations or Boolean-valued function comparisons, the convergence rate is the same.
△ Less
Submitted 11 September, 2012;
originally announced September 2012.
-
Active Ranking using Pairwise Comparisons
Authors:
Kevin G. Jamieson,
Robert D. Nowak
Abstract:
This paper examines the problem of ranking a collection of objects using pairwise comparisons (rankings of two objects). In general, the ranking of $n$ objects can be identified by standard sorting methods using $n log_2 n$ pairwise comparisons. We are interested in natural situations in which relationships among the objects may allow for ranking using far fewer pairwise comparisons. Specifically,…
▽ More
This paper examines the problem of ranking a collection of objects using pairwise comparisons (rankings of two objects). In general, the ranking of $n$ objects can be identified by standard sorting methods using $n log_2 n$ pairwise comparisons. We are interested in natural situations in which relationships among the objects may allow for ranking using far fewer pairwise comparisons. Specifically, we assume that the objects can be embedded into a $d$-dimensional Euclidean space and that the rankings reflect their relative distances from a common reference point in $R^d$. We show that under this assumption the number of possible rankings grows like $n^{2d}$ and demonstrate an algorithm that can identify a randomly selected ranking using just slightly more than $d log n$ adaptively selected pairwise comparisons, on average. If instead the comparisons are chosen at random, then almost all pairwise comparisons must be made in order to identify any ranking. In addition, we propose a robust, error-tolerant algorithm that only requires that the pairwise comparisons are probably correct. Experimental studies with synthetic and real datasets support the conclusions of our theoretical analysis.
△ Less
Submitted 9 December, 2011; v1 submitted 16 September, 2011;
originally announced September 2011.