-
On Union-Closedness of Language Generation
Authors:
Steve Hanneke,
Amin Karbasi,
Anay Mehrotra,
Grigoris Velegkas
Abstract:
We investigate language generation in the limit - a model by Kleinberg and Mullainathan [NeurIPS 2024] and extended by Li, Raman, and Tewari [COLT 2025]. While Kleinberg and Mullainathan proved generation is possible for all countable collections, Li et al. defined a hierarchy of generation notions (uniform, non-uniform, and generatable) and explored their feasibility for uncountable collections.…
▽ More
We investigate language generation in the limit - a model by Kleinberg and Mullainathan [NeurIPS 2024] and extended by Li, Raman, and Tewari [COLT 2025]. While Kleinberg and Mullainathan proved generation is possible for all countable collections, Li et al. defined a hierarchy of generation notions (uniform, non-uniform, and generatable) and explored their feasibility for uncountable collections.
Our first set of results resolve two open questions of Li et al. by proving finite unions of generatable or non-uniformly generatable classes need not be generatable. These follow from a stronger result: there is a non-uniformly generatable class and a uniformly generatable class whose union is non-generatable. This adds to the aspects along which language generation in the limit is different from traditional tasks in statistical learning theory like classification, which are closed under finite unions. In particular, it implies that given two generators for different collections, one cannot combine them to obtain a single "more powerful" generator, prohibiting this notion of boosting.
Our construction also addresses a third open question of Li et al. on whether there are uncountable classes that are non-uniformly generatable and do not satisfy the eventually unbounded closure (EUC) condition introduced by Li, Raman, and Tewari. Our approach utilizes carefully constructed classes along with a novel diagonalization argument that could be of independent interest in the growing area of language generation.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
Universal Rates of ERM for Agnostic Learning
Authors:
Steve Hanneke,
Mingyue Xu
Abstract:
The universal learning framework has been developed to obtain guarantees on the learning rates that hold for any fixed distribution, which can be much faster than the ones uniformly hold over all the distributions. Given that the Empirical Risk Minimization (ERM) principle being fundamental in the PAC theory and ubiquitous in practical machine learning, the recent work of arXiv:2412.02810 studied…
▽ More
The universal learning framework has been developed to obtain guarantees on the learning rates that hold for any fixed distribution, which can be much faster than the ones uniformly hold over all the distributions. Given that the Empirical Risk Minimization (ERM) principle being fundamental in the PAC theory and ubiquitous in practical machine learning, the recent work of arXiv:2412.02810 studied the universal rates of ERM for binary classification under the realizable setting. However, the assumption of realizability is too restrictive to hold in practice. Indeed, the majority of the literature on universal learning has focused on the realizable case, leaving the non-realizable case barely explored.
In this paper, we consider the problem of universal learning by ERM for binary classification under the agnostic setting, where the ''learning curve" reflects the decay of the excess risk as the sample size increases. We explore the possibilities of agnostic universal rates and reveal a compact trichotomy: there are three possible agnostic universal rates of ERM, being either $e^{-n}$, $o(n^{-1/2})$, or arbitrarily slow. We provide a complete characterization of which concept classes fall into each of these categories. Moreover, we also establish complete characterizations for the target-dependent universal rates as well as the Bayes-dependent universal rates.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Private List Learnability vs. Online List Learnability
Authors:
Steve Hanneke,
Shay Moran,
Hilla Schefler,
Iska Tsubari
Abstract:
This work explores the connection between differential privacy (DP) and online learning in the context of PAC list learning. In this setting, a $k$-list learner outputs a list of $k$ potential predictions for an instance $x$ and incurs a loss if the true label of $x$ is not included in the list. A basic result in the multiclass PAC framework with a finite number of labels states that private learn…
▽ More
This work explores the connection between differential privacy (DP) and online learning in the context of PAC list learning. In this setting, a $k$-list learner outputs a list of $k$ potential predictions for an instance $x$ and incurs a loss if the true label of $x$ is not included in the list. A basic result in the multiclass PAC framework with a finite number of labels states that private learnability is equivalent to online learnability [Alon, Livni, Malliaris, and Moran (2019); Bun, Livni, and Moran (2020); Jung, Kim, and Tewari (2020)]. Perhaps surprisingly, we show that this equivalence does not hold in the context of list learning. Specifically, we prove that, unlike in the multiclass setting, a finite $k$-Littlestone dimensio--a variant of the classical Littlestone dimension that characterizes online $k$-list learnability--is not a sufficient condition for DP $k$-list learnability. However, similar to the multiclass case, we prove that it remains a necessary condition.
To demonstrate where the equivalence breaks down, we provide an example showing that the class of monotone functions with $k+1$ labels over $\mathbb{N}$ is online $k$-list learnable, but not DP $k$-list learnable. This leads us to introduce a new combinatorial dimension, the \emph{$k$-monotone dimension}, which serves as a generalization of the threshold dimension. Unlike the multiclass setting, where the Littlestone and threshold dimensions are finite together, for $k>1$, the $k$-Littlestone and $k$-monotone dimensions do not exhibit this relationship. We prove that a finite $k$-monotone dimension is another necessary condition for DP $k$-list learnability, alongside finite $k$-Littlestone dimension. Whether the finiteness of both dimensions implies private $k$-list learnability remains an open question.
△ Less
Submitted 15 June, 2025;
originally announced June 2025.
-
Tradeoffs between Mistakes and ERM Oracle Calls in Online and Transductive Online Learning
Authors:
Idan Attias,
Steve Hanneke,
Arvind Ramaswami
Abstract:
We study online and transductive online learning when the learner interacts with the concept class only via Empirical Risk Minimization (ERM) or weak consistency oracles on arbitrary instance subsets. This contrasts with standard online models, where the learner knows the entire class. The ERM oracle returns a hypothesis minimizing loss on a given subset, while the weak consistency oracle returns…
▽ More
We study online and transductive online learning when the learner interacts with the concept class only via Empirical Risk Minimization (ERM) or weak consistency oracles on arbitrary instance subsets. This contrasts with standard online models, where the learner knows the entire class. The ERM oracle returns a hypothesis minimizing loss on a given subset, while the weak consistency oracle returns a binary signal indicating whether the subset is realizable by some concept. The learner is evaluated by the number of mistakes and oracle calls. In the standard online setting with ERM access, we prove tight lower bounds in both realizable and agnostic cases: $Ω(2^{d_{VC}})$ mistakes and $Ω(\sqrt{T 2^{d_{LD}}})$ regret, where $T$ is the number of timesteps and $d_{LD}$ is the Littlestone dimension. We further show that existing online learning results with ERM access carry over to the weak consistency setting, incurring an additional $O(T)$ in oracle calls. We then consider the transductive online model, where the instance sequence is known but labels are revealed sequentially. For general Littlestone classes, we show that optimal realizable and agnostic mistake bounds can be achieved using $O(T^{d_{VC}+1})$ weak consistency oracle calls. On the negative side, we show that limiting the learner to $Ω(T)$ weak consistency queries is necessary for transductive online learnability, and that restricting the learner to $Ω(T)$ ERM queries is necessary to avoid exponential dependence on the Littlestone dimension. Finally, for certain concept classes, we reduce oracle calls via randomized algorithms while maintaining similar mistake bounds. In particular, for Thresholds on an unknown ordering, $O(\log T)$ ERM queries suffice; for $k$-Intervals, $O(T^3 2^{2k})$ weak consistency queries suffice.
△ Less
Submitted 30 May, 2025;
originally announced June 2025.
-
Data Selection for ERMs
Authors:
Steve Hanneke,
Shay Moran,
Alexander Shlimovich,
Amir Yehudayoff
Abstract:
Learning theory has traditionally followed a model-centric approach, focusing on designing optimal algorithms for a fixed natural learning task (e.g., linear classification or regression). In this paper, we adopt a complementary data-centric perspective, whereby we fix a natural learning rule and focus on optimizing the training data. Specifically, we study the following question: given a learning…
▽ More
Learning theory has traditionally followed a model-centric approach, focusing on designing optimal algorithms for a fixed natural learning task (e.g., linear classification or regression). In this paper, we adopt a complementary data-centric perspective, whereby we fix a natural learning rule and focus on optimizing the training data. Specifically, we study the following question: given a learning rule $\mathcal{A}$ and a data selection budget $n$, how well can $\mathcal{A}$ perform when trained on at most $n$ data points selected from a population of $N$ points? We investigate when it is possible to select $n \ll N$ points and achieve performance comparable to training on the entire population.
We address this question across a variety of empirical risk minimizers. Our results include optimal data-selection bounds for mean estimation, linear classification, and linear regression. Additionally, we establish two general results: a taxonomy of error rates in binary classification and in stochastic convex optimization. Finally, we propose several open questions and directions for future research.
△ Less
Submitted 25 April, 2025; v1 submitted 20 April, 2025;
originally announced April 2025.
-
Proofs as Explanations: Short Certificates for Reliable Predictions
Authors:
Avrim Blum,
Steve Hanneke,
Chirag Pabbaraju,
Donya Saless
Abstract:
We consider a model for explainable AI in which an explanation for a prediction $h(x)=y$ consists of a subset $S'$ of the training data (if it exists) such that all classifiers $h' \in H$ that make at most $b$ mistakes on $S'$ predict $h'(x)=y$. Such a set $S'$ serves as a proof that $x$ indeed has label $y$ under the assumption that (1) the target function $h^\star$ belongs to $H$, and (2) the se…
▽ More
We consider a model for explainable AI in which an explanation for a prediction $h(x)=y$ consists of a subset $S'$ of the training data (if it exists) such that all classifiers $h' \in H$ that make at most $b$ mistakes on $S'$ predict $h'(x)=y$. Such a set $S'$ serves as a proof that $x$ indeed has label $y$ under the assumption that (1) the target function $h^\star$ belongs to $H$, and (2) the set $S$ contains at most $b$ corrupted points. For example, if $b=0$ and $H$ is the family of linear classifiers in $\mathbb{R}^d$, and if $x$ lies inside the convex hull of the positive data points in $S$ (and hence every consistent linear classifier labels $x$ as positive), then Carathéodory's theorem states that $x$ lies inside the convex hull of $d+1$ of those points. So, a set $S'$ of size $d+1$ could be released as an explanation for a positive prediction, and would serve as a short proof of correctness of the prediction under the assumption of realizability.
In this work, we consider this problem more generally, for general hypothesis classes $H$ and general values $b\geq 0$. We define the notion of the robust hollow star number of $H$ (which generalizes the standard hollow star number), and show that it precisely characterizes the worst-case size of the smallest certificate achievable, and analyze its size for natural classes. We also consider worst-case distributional bounds on certificate size, as well as distribution-dependent bounds that we show tightly control the sample size needed to get a certificate for any given test example. In particular, we define a notion of the certificate coefficient $\varepsilon_x$ of an example $x$ with respect to a data distribution $D$ and target function $h^\star$, and prove matching upper and lower bounds on sample size as a function of $\varepsilon_x$, $b$, and the VC dimension $d$ of $H$.
△ Less
Submitted 24 June, 2025; v1 submitted 11 April, 2025;
originally announced April 2025.
-
A Theory of Optimistically Universal Online Learnability for General Concept Classes
Authors:
Steve Hanneke,
Hongao Wang
Abstract:
We provide a full characterization of the concept classes that are optimistically universally online learnable with $\{0, 1\}$ labels. The notion of optimistically universal online learning was defined in [Hanneke, 2021] in order to understand learnability under minimal assumptions. In this paper, following the philosophy behind that work, we investigate two questions, namely, for every concept cl…
▽ More
We provide a full characterization of the concept classes that are optimistically universally online learnable with $\{0, 1\}$ labels. The notion of optimistically universal online learning was defined in [Hanneke, 2021] in order to understand learnability under minimal assumptions. In this paper, following the philosophy behind that work, we investigate two questions, namely, for every concept class: (1) What are the minimal assumptions on the data process admitting online learnability? (2) Is there a learning algorithm which succeeds under every data process satisfying the minimal assumptions? Such an algorithm is said to be optimistically universal for the given concept class. We resolve both of these questions for all concept classes, and moreover, as part of our solution, we design general learning algorithms for each case. Finally, we extend these algorithms and results to the agnostic case, showing an equivalence between the minimal assumptions on the data process for learnability in the agnostic and realizable cases, for every concept class, as well as the equivalence of optimistically universal learnability.
△ Less
Submitted 14 January, 2025;
originally announced January 2025.
-
Learning from Snapshots of Discrete and Continuous Data Streams
Authors:
Pramith Devulapalli,
Steve Hanneke
Abstract:
Imagine a smart camera trap selectively clicking pictures to understand animal movement patterns within a particular habitat. These "snapshots", or pieces of data captured from a data stream at adaptively chosen times, provide a glimpse of different animal movements unfolding through time. Learning a continuous-time process through snapshots, such as smart camera traps, is a central theme governin…
▽ More
Imagine a smart camera trap selectively clicking pictures to understand animal movement patterns within a particular habitat. These "snapshots", or pieces of data captured from a data stream at adaptively chosen times, provide a glimpse of different animal movements unfolding through time. Learning a continuous-time process through snapshots, such as smart camera traps, is a central theme governing a wide array of online learning situations. In this paper, we adopt a learning-theoretic perspective in understanding the fundamental nature of learning different classes of functions from both discrete data streams and continuous data streams. In our first framework, the \textit{update-and-deploy} setting, a learning algorithm discretely queries from a process to update a predictor designed to make predictions given as input the data stream. We construct a uniform sampling algorithm that can learn with bounded error any concept class with finite Littlestone dimension. Our second framework, known as the \emph{blind-prediction} setting, consists of a learning algorithm generating predictions independently of observing the process, only engaging with the process when it chooses to make queries. Interestingly, we show a stark contrast in learnability where non-trivial concept classes are unlearnable. However, we show that adaptive learning algorithms are necessary to learn sets of time-dependent and data-dependent functions, called pattern classes, in either framework. Finally, we develop a theory of pattern classes under discrete data streams for the blind-prediction setting.
△ Less
Submitted 8 December, 2024;
originally announced December 2024.
-
Universal Rates of Empirical Risk Minimization
Authors:
Steve Hanneke,
Mingyue Xu
Abstract:
The well-known empirical risk minimization (ERM) principle is the basis of many widely used machine learning algorithms, and plays an essential role in the classical PAC theory. A common description of a learning algorithm's performance is its so-called "learning curve", that is, the decay of the expected error as a function of the input sample size. As the PAC model fails to explain the behavior…
▽ More
The well-known empirical risk minimization (ERM) principle is the basis of many widely used machine learning algorithms, and plays an essential role in the classical PAC theory. A common description of a learning algorithm's performance is its so-called "learning curve", that is, the decay of the expected error as a function of the input sample size. As the PAC model fails to explain the behavior of learning curves, recent research has explored an alternative universal learning model and has ultimately revealed a distinction between optimal universal and uniform learning rates (Bousquet et al., 2021). However, a basic understanding of such differences with a particular focus on the ERM principle has yet to be developed.
In this paper, we consider the problem of universal learning by ERM in the realizable case and study the possible universal rates. Our main result is a fundamental tetrachotomy: there are only four possible universal learning rates by ERM, namely, the learning curves of any concept class learnable by ERM decay either at $e^{-n}$, $1/n$, $\log(n)/n$, or arbitrarily slow rates. Moreover, we provide a complete characterization of which concept classes fall into each of these categories, via new complexity structures. We also develop new combinatorial dimensions which supply sharp asymptotically-valid constant factors for these rates, whenever possible.
△ Less
Submitted 30 January, 2025; v1 submitted 3 December, 2024;
originally announced December 2024.
-
On the ERM Principle in Meta-Learning
Authors:
Yannay Alon,
Steve Hanneke,
Shay Moran,
Uri Shalit
Abstract:
Classic supervised learning involves algorithms trained on $n$ labeled examples to produce a hypothesis $h \in \mathcal{H}$ aimed at performing well on unseen examples. Meta-learning extends this by training across $n$ tasks, with $m$ examples per task, producing a hypothesis class $\mathcal{H}$ within some meta-class $\mathbb{H}$. This setting applies to many modern problems such as in-context le…
▽ More
Classic supervised learning involves algorithms trained on $n$ labeled examples to produce a hypothesis $h \in \mathcal{H}$ aimed at performing well on unseen examples. Meta-learning extends this by training across $n$ tasks, with $m$ examples per task, producing a hypothesis class $\mathcal{H}$ within some meta-class $\mathbb{H}$. This setting applies to many modern problems such as in-context learning, hypernetworks, and learning-to-learn. A common method for evaluating the performance of supervised learning algorithms is through their learning curve, which depicts the expected error as a function of the number of training examples. In meta-learning, the learning curve becomes a two-dimensional learning surface, which evaluates the expected error on unseen domains for varying values of $n$ (number of tasks) and $m$ (number of training examples).
Our findings characterize the distribution-free learning surfaces of meta-Empirical Risk Minimizers when either $m$ or $n$ tend to infinity: we show that the number of tasks must increase inversely with the desired error. In contrast, we show that the number of examples exhibits very different behavior: it satisfies a dichotomy where every meta-class conforms to one of the following conditions: (i) either $m$ must grow inversely with the error, or (ii) a \emph{finite} number of examples per task suffices for the error to vanish as $n$ goes to infinity. This finding illustrates and characterizes cases in which a small number of examples per task is sufficient for successful learning. We further refine this for positive values of $\varepsilon$ and identify for each $\varepsilon$ how many examples per task are needed to achieve an error of $\varepsilon$ in the limit as the number of tasks $n$ goes to infinity. We achieve this by developing a necessary and sufficient condition for meta-learnability using a bounded number of examples per domain.
△ Less
Submitted 26 November, 2024;
originally announced November 2024.
-
Multiclass Transductive Online Learning
Authors:
Steve Hanneke,
Vinod Raman,
Amirreza Shaeiri,
Unique Subedi
Abstract:
We consider the problem of multiclass transductive online learning when the number of labels can be unbounded. Previous works by Ben-David et al. [1997] and Hanneke et al. [2023b] only consider the case of binary and finite label spaces, respectively. The latter work determined that their techniques fail to extend to the case of unbounded label spaces, and they pose the question of characterizing…
▽ More
We consider the problem of multiclass transductive online learning when the number of labels can be unbounded. Previous works by Ben-David et al. [1997] and Hanneke et al. [2023b] only consider the case of binary and finite label spaces, respectively. The latter work determined that their techniques fail to extend to the case of unbounded label spaces, and they pose the question of characterizing the optimal mistake bound for unbounded label spaces. We answer this question by showing that a new dimension, termed the Level-constrained Littlestone dimension, characterizes online learnability in this setting. Along the way, we show that the trichotomy of possible minimax rates of the expected number of mistakes established by Hanneke et al. [2023b] for finite label spaces in the realizable setting continues to hold even when the label space is unbounded. In particular, if the learner plays for $T \in \mathbb{N}$ rounds, its minimax expected number of mistakes can only grow like $Θ(T)$, $Θ(\log T)$, or $Θ(1)$. To prove this result, we give another combinatorial dimension, termed the Level-constrained Branching dimension, and show that its finiteness characterizes constant minimax expected mistake-bounds. The trichotomy is then determined by a combination of the Level-constrained Littlestone and Branching dimensions. Quantitatively, our upper bounds improve upon existing multiclass upper bounds in Hanneke et al. [2023b] by removing the dependence on the label set size. In doing so, we explicitly construct learning algorithms that can handle extremely large or unbounded label spaces. A key and novel component of our algorithm is a new notion of shattering that exploits the sequential nature of transductive online learning. Finally, we complete our results by proving expected regret bounds in the agnostic setting, extending the result of Hanneke et al. [2023b].
△ Less
Submitted 3 November, 2024;
originally announced November 2024.
-
Sample Compression Scheme Reductions
Authors:
Idan Attias,
Steve Hanneke,
Arvind Ramaswami
Abstract:
We present novel reductions from sample compression schemes in multiclass classification, regression, and adversarially robust learning settings to binary sample compression schemes. Assuming we have a compression scheme for binary classes of size $f(d_\mathrm{VC})$, where $d_\mathrm{VC}$ is the VC dimension, then we have the following results: (1) If the binary compression scheme is a majority-vo…
▽ More
We present novel reductions from sample compression schemes in multiclass classification, regression, and adversarially robust learning settings to binary sample compression schemes. Assuming we have a compression scheme for binary classes of size $f(d_\mathrm{VC})$, where $d_\mathrm{VC}$ is the VC dimension, then we have the following results: (1) If the binary compression scheme is a majority-vote or a stable compression scheme, then there exists a multiclass compression scheme of size $O(f(d_\mathrm{G}))$, where $d_\mathrm{G}$ is the graph dimension. Moreover, for general binary compression schemes, we obtain a compression of size $O(f(d_\mathrm{G})\log|Y|)$, where $Y$ is the label space. (2) If the binary compression scheme is a majority-vote or a stable compression scheme, then there exists an $ε$-approximate compression scheme for regression over $[0,1]$-valued functions of size $O(f(d_\mathrm{P}))$, where $d_\mathrm{P}$ is the pseudo-dimension. For general binary compression schemes, we obtain a compression of size $O(f(d_\mathrm{P})\log(1/ε))$. These results would have significant implications if the sample compression conjecture, which posits that any binary concept class with a finite VC dimension admits a binary compression scheme of size $O(d_\mathrm{VC})$, is resolved (Littlestone and Warmuth, 1986; Floyd and Warmuth, 1995; Warmuth, 2003). Our results would then extend the proof of the conjecture immediately to other settings. We establish similar results for adversarially robust learning and also provide an example of a concept class that is robustly learnable but has no bounded-size compression scheme, demonstrating that learnability is not equivalent to having a compression scheme independent of the sample size, unlike in binary classification, where compression of size $2^{O(d_\mathrm{VC})}$ is attainable (Moran and Yehudayoff, 2016).
△ Less
Submitted 7 April, 2025; v1 submitted 16 October, 2024;
originally announced October 2024.
-
A Complete Characterization of Learnability for Stochastic Noisy Bandits
Authors:
Steve Hanneke,
Kun Wang
Abstract:
We study the stochastic noisy bandit problem with an unknown reward function $f^*$ in a known function class $\mathcal{F}$. Formally, a model $M$ maps arms $π$ to a probability distribution $M(π)$ of reward. A model class $\mathcal{M}$ is a collection of models. For each model $M$, define its mean reward function $f^M(π)=\mathbb{E}_{r \sim M(π)}[r]$. In the bandit learning problem, we proceed in r…
▽ More
We study the stochastic noisy bandit problem with an unknown reward function $f^*$ in a known function class $\mathcal{F}$. Formally, a model $M$ maps arms $π$ to a probability distribution $M(π)$ of reward. A model class $\mathcal{M}$ is a collection of models. For each model $M$, define its mean reward function $f^M(π)=\mathbb{E}_{r \sim M(π)}[r]$. In the bandit learning problem, we proceed in rounds, pulling one arm $π$ each round and observing a reward sampled from $M(π)$. With knowledge of $\mathcal{M}$, supposing that the true model $M\in \mathcal{M}$, the objective is to identify an arm $\hatπ$ of near-maximal mean reward $f^M(\hatπ)$ with high probability in a bounded number of rounds. If this is possible, then the model class is said to be learnable.
Importantly, a result of \cite{hanneke2023bandit} shows there exist model classes for which learnability is undecidable. However, the model class they consider features deterministic rewards, and they raise the question of whether learnability is decidable for classes containing sufficiently noisy models. For the first time, we answer this question in the positive by giving a complete characterization of learnability for model classes with arbitrary noise. In addition to that, we also describe the full spectrum of possible optimal query complexities. Further, we prove adaptivity is sometimes necessary to achieve the optimal query complexity. Last, we revisit an important complexity measure for interactive decision making, the Decision-Estimation-Coefficient \citep{foster2021statistical,foster2023tight}, and propose a new variant of the DEC which also characterizes learnability in this setting.
△ Less
Submitted 16 January, 2025; v1 submitted 12 October, 2024;
originally announced October 2024.
-
Adaptive Sample Aggregation In Transfer Learning
Authors:
Steve Hanneke,
Samory Kpotufe
Abstract:
Transfer Learning aims to optimally aggregate samples from a target distribution, with related samples from a so-called source distribution to improve target risk. Multiple procedures have been proposed over the last two decades to address this problem, each driven by one of a multitude of possible divergence measures between source and target distributions. A first question asked in this work is…
▽ More
Transfer Learning aims to optimally aggregate samples from a target distribution, with related samples from a so-called source distribution to improve target risk. Multiple procedures have been proposed over the last two decades to address this problem, each driven by one of a multitude of possible divergence measures between source and target distributions. A first question asked in this work is whether there exist unified algorithmic approaches that automatically adapt to many of these divergence measures simultaneously.
We show that this is indeed the case for a large family of divergences proposed across classification and regression tasks, as they all happen to upper-bound the same measure of continuity between source and target risks, which we refer to as a weak modulus of transfer. This more unified view allows us, first, to identify algorithmic approaches that are simultaneously adaptive to these various divergence measures via a reduction to particular confidence sets. Second, it allows for a more refined understanding of the statistical limits of transfer under such divergences, and in particular, reveals regimes with faster rates than might be expected under coarser lenses.
We then turn to situations that are not well captured by the weak modulus and corresponding divergences: these are situations where the aggregate of source and target data can improve target performance significantly beyond what's possible with either source or target data alone. We show that common such situations -- as may arise, e.g., under certain causal models with spurious correlations -- are well described by a so-called strong modulus of transfer which supersedes the weak modulus. We finally show that the strong modulus also admits adaptive procedures, which achieve near optimal rates in terms of the unknown strong modulus, and therefore apply in more general settings.
△ Less
Submitted 27 April, 2025; v1 submitted 28 August, 2024;
originally announced August 2024.
-
Revisiting Agnostic PAC Learning
Authors:
Steve Hanneke,
Kasper Green Larsen,
Nikita Zhivotovskiy
Abstract:
PAC learning, dating back to Valiant'84 and Vapnik and Chervonenkis'64,'74, is a classic model for studying supervised learning. In the agnostic setting, we have access to a hypothesis set $\mathcal{H}$ and a training set of labeled samples $(x_1,y_1),\dots,(x_n,y_n) \in \mathcal{X} \times \{-1,1\}$ drawn i.i.d. from an unknown distribution $\mathcal{D}$. The goal is to produce a classifier…
▽ More
PAC learning, dating back to Valiant'84 and Vapnik and Chervonenkis'64,'74, is a classic model for studying supervised learning. In the agnostic setting, we have access to a hypothesis set $\mathcal{H}$ and a training set of labeled samples $(x_1,y_1),\dots,(x_n,y_n) \in \mathcal{X} \times \{-1,1\}$ drawn i.i.d. from an unknown distribution $\mathcal{D}$. The goal is to produce a classifier $h : \mathcal{X} \to \{-1,1\}$ that is competitive with the hypothesis $h^\star_{\mathcal{D}} \in \mathcal{H}$ having the least probability of mispredicting the label $y$ of a new sample $(x,y)\sim \mathcal{D}$.
Empirical Risk Minimization (ERM) is a natural learning algorithm, where one simply outputs the hypothesis from $\mathcal{H}$ making the fewest mistakes on the training data. This simple algorithm is known to have an optimal error in terms of the VC-dimension of $\mathcal{H}$ and the number of samples $n$.
In this work, we revisit agnostic PAC learning and first show that ERM is in fact sub-optimal if we treat the performance of the best hypothesis, denoted $τ:=\Pr_{\mathcal{D}}[h^\star_{\mathcal{D}}(x) \neq y]$, as a parameter. Concretely we show that ERM, and any other proper learning algorithm, is sub-optimal by a $\sqrt{\ln(1/τ)}$ factor. We then complement this lower bound with the first learning algorithm achieving an optimal error for nearly the full range of $τ$. Our algorithm introduces several new ideas that we hope may find further applications in learning theory.
△ Less
Submitted 29 July, 2024;
originally announced July 2024.
-
Ramsey Theorems for Trees and a General 'Private Learning Implies Online Learning' Theorem
Authors:
Simone Fioravanti,
Steve Hanneke,
Shay Moran,
Hilla Schefler,
Iska Tsubari
Abstract:
This work continues to investigate the link between differentially private (DP) and online learning. Alon, Livni, Malliaris, and Moran (2019) showed that for binary concept classes, DP learnability of a given class implies that it has a finite Littlestone dimension (equivalently, that it is online learnable). Their proof relies on a model-theoretic result by Hodges (1997), which demonstrates that…
▽ More
This work continues to investigate the link between differentially private (DP) and online learning. Alon, Livni, Malliaris, and Moran (2019) showed that for binary concept classes, DP learnability of a given class implies that it has a finite Littlestone dimension (equivalently, that it is online learnable). Their proof relies on a model-theoretic result by Hodges (1997), which demonstrates that any binary concept class with a large Littlestone dimension contains a large subclass of thresholds. In a follow-up work, Jung, Kim, and Tewari (2020) extended this proof to multiclass PAC learning with a bounded number of labels. Unfortunately, Hodges's result does not apply in other natural settings such as multiclass PAC learning with an unbounded label space, and PAC learning of partial concept classes.
This naturally raises the question of whether DP learnability continues to imply online learnability in more general scenarios: indeed, Alon, Hanneke, Holzman, and Moran (2021) explicitly leave it as an open question in the context of partial concept classes, and the same question is open in the general multiclass setting. In this work, we give a positive answer to these questions showing that for general classification tasks, DP learnability implies online learnability. Our proof reasons directly about Littlestone trees, without relying on thresholds. We achieve this by establishing several Ramsey-type theorems for trees, which might be of independent interest.
△ Less
Submitted 14 August, 2024; v1 submitted 10 July, 2024;
originally announced July 2024.
-
Dual VC Dimension Obstructs Sample Compression by Embeddings
Authors:
Zachary Chase,
Bogdan Chornomaz,
Steve Hanneke,
Shay Moran,
Amir Yehudayoff
Abstract:
This work studies embedding of arbitrary VC classes in well-behaved VC classes, focusing particularly on extremal classes. Our main result expresses an impossibility: such embeddings necessarily require a significant increase in dimension. In particular, we prove that for every $d$ there is a class with VC dimension $d$ that cannot be embedded in any extremal class of VC dimension smaller than exp…
▽ More
This work studies embedding of arbitrary VC classes in well-behaved VC classes, focusing particularly on extremal classes. Our main result expresses an impossibility: such embeddings necessarily require a significant increase in dimension. In particular, we prove that for every $d$ there is a class with VC dimension $d$ that cannot be embedded in any extremal class of VC dimension smaller than exponential in $d$.
In addition to its independent interest, this result has an important implication in learning theory, as it reveals a fundamental limitation of one of the most extensively studied approaches to tackling the long-standing sample compression conjecture. Concretely, the approach proposed by Floyd and Warmuth entails embedding any given VC class into an extremal class of a comparable dimension, and then applying an optimal sample compression scheme for extremal classes. However, our results imply that this strategy would in some cases result in a sample compression scheme at least exponentially larger than what is predicted by the sample compression conjecture.
The above implications follow from a general result we prove: any extremal class with VC dimension $d$ has dual VC dimension at most $2d+1$. This bound is exponentially smaller than the classical bound $2^{d+1}-1$ of Assouad, which applies to general concept classes (and is known to be unimprovable for some classes). We in fact prove a stronger result, establishing that $2d+1$ upper bounds the dual Radon number of extremal classes. This theorem represents an abstraction of the classical Radon theorem for convex sets, extending its applicability to a wider combinatorial framework, without relying on the specifics of Euclidean convexity. The proof utilizes the topological method and is primarily based on variants of the Topological Radon Theorem.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
List Sample Compression and Uniform Convergence
Authors:
Steve Hanneke,
Shay Moran,
Tom Waknine
Abstract:
List learning is a variant of supervised classification where the learner outputs multiple plausible labels for each instance rather than just one. We investigate classical principles related to generalization within the context of list learning. Our primary goal is to determine whether classical principles in the PAC setting retain their applicability in the domain of list PAC learning. We focus…
▽ More
List learning is a variant of supervised classification where the learner outputs multiple plausible labels for each instance rather than just one. We investigate classical principles related to generalization within the context of list learning. Our primary goal is to determine whether classical principles in the PAC setting retain their applicability in the domain of list PAC learning. We focus on uniform convergence (which is the basis of Empirical Risk Minimization) and on sample compression (which is a powerful manifestation of Occam's Razor). In classical PAC learning, both uniform convergence and sample compression satisfy a form of `completeness': whenever a class is learnable, it can also be learned by a learning rule that adheres to these principles. We ask whether the same completeness holds true in the list learning setting.
We show that uniform convergence remains equivalent to learnability in the list PAC learning setting. In contrast, our findings reveal surprising results regarding sample compression: we prove that when the label space is $Y=\{0,1,2\}$, then there are 2-list-learnable classes that cannot be compressed. This refutes the list version of the sample compression conjecture by Littlestone and Warmuth (1986). We prove an even stronger impossibility result, showing that there are $2$-list-learnable classes that cannot be compressed even when the reconstructed function can work with lists of arbitrarily large size. We prove a similar result for (1-list) PAC learnable classes when the label space is unbounded. This generalizes a recent result by arXiv:2308.06424.
△ Less
Submitted 16 March, 2024;
originally announced March 2024.
-
The Dimension of Self-Directed Learning
Authors:
Pramith Devulapalli,
Steve Hanneke
Abstract:
Understanding the self-directed learning complexity has been an important problem that has captured the attention of the online learning theory community since the early 1990s. Within this framework, the learner is allowed to adaptively choose its next data point in making predictions unlike the setting in adversarial online learning.
In this paper, we study the self-directed learning complexity…
▽ More
Understanding the self-directed learning complexity has been an important problem that has captured the attention of the online learning theory community since the early 1990s. Within this framework, the learner is allowed to adaptively choose its next data point in making predictions unlike the setting in adversarial online learning.
In this paper, we study the self-directed learning complexity in both the binary and multi-class settings, and we develop a dimension, namely $SDdim$, that exactly characterizes the self-directed learning mistake-bound for any concept class. The intuition behind $SDdim$ can be understood as a two-player game called the "labelling game". Armed with this two-player game, we calculate $SDdim$ on a whole host of examples with notable results on axis-aligned rectangles, VC dimension $1$ classes, and linear separators. We demonstrate several learnability gaps with a central focus on self-directed learning and offline sequence learning models that include either the best or worst ordering. Finally, we extend our analysis to the self-directed binary agnostic setting where we derive upper and lower bounds.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
Bandit-Feedback Online Multiclass Classification: Variants and Tradeoffs
Authors:
Yuval Filmus,
Steve Hanneke,
Idan Mehalel,
Shay Moran
Abstract:
Consider the domain of multiclass classification within the adversarial online setting. What is the price of relying on bandit feedback as opposed to full information? To what extent can an adaptive adversary amplify the loss compared to an oblivious one? To what extent can a randomized learner reduce the loss compared to a deterministic one? We study these questions in the mistake bound model and…
▽ More
Consider the domain of multiclass classification within the adversarial online setting. What is the price of relying on bandit feedback as opposed to full information? To what extent can an adaptive adversary amplify the loss compared to an oblivious one? To what extent can a randomized learner reduce the loss compared to a deterministic one? We study these questions in the mistake bound model and provide nearly tight answers.
We demonstrate that the optimal mistake bound under bandit feedback is at most $O(k)$ times higher than the optimal mistake bound in the full information case, where $k$ represents the number of labels. This bound is tight and provides an answer to an open question previously posed and studied by Daniely and Helbertal ['13] and by Long ['17, '20], who focused on deterministic learners.
Moreover, we present nearly optimal bounds of $\tildeΘ(k)$ on the gap between randomized and deterministic learners, as well as between adaptive and oblivious adversaries in the bandit feedback setting. This stands in contrast to the full information scenario, where adaptive and oblivious adversaries are equivalent, and the gap in mistake bounds between randomized and deterministic learners is a constant multiplicative factor of $2$.
In addition, our results imply that in some cases the optimal randomized mistake bound is approximately the square-root of its deterministic parallel. Previous results show that this is essentially the smallest it can get.
△ Less
Submitted 12 February, 2024;
originally announced February 2024.
-
A Trichotomy for Transductive Online Learning
Authors:
Steve Hanneke,
Shay Moran,
Jonathan Shafer
Abstract:
We present new upper and lower bounds on the number of learner mistakes in the `transductive' online learning setting of Ben-David, Kushilevitz and Mansour (1997). This setting is similar to standard online learning, except that the adversary fixes a sequence of instances $x_1,\dots,x_n$ to be labeled at the start of the game, and this sequence is known to the learner. Qualitatively, we prove a tr…
▽ More
We present new upper and lower bounds on the number of learner mistakes in the `transductive' online learning setting of Ben-David, Kushilevitz and Mansour (1997). This setting is similar to standard online learning, except that the adversary fixes a sequence of instances $x_1,\dots,x_n$ to be labeled at the start of the game, and this sequence is known to the learner. Qualitatively, we prove a trichotomy, stating that the minimal number of mistakes made by the learner as $n$ grows can take only one of precisely three possible values: $n$, $Θ\left(\log (n)\right)$, or $Θ(1)$. Furthermore, this behavior is determined by a combination of the VC dimension and the Littlestone dimension. Quantitatively, we show a variety of bounds relating the number of mistakes to well-known combinatorial dimensions. In particular, we improve the known lower bound on the constant in the $Θ(1)$ case from $Ω\left(\sqrt{\log(d)}\right)$ to $Ω(\log(d))$ where $d$ is the Littlestone dimension. Finally, we extend our results to cover multiclass classification and the agnostic setting.
△ Less
Submitted 29 November, 2023; v1 submitted 10 November, 2023;
originally announced November 2023.
-
Efficient Agnostic Learning with Average Smoothness
Authors:
Steve Hanneke,
Aryeh Kontorovich,
Guy Kornowski
Abstract:
We study distribution-free nonparametric regression following a notion of average smoothness initiated by Ashlagi et al. (2021), which measures the "effective" smoothness of a function with respect to an arbitrary unknown underlying distribution. While the recent work of Hanneke et al. (2023) established tight uniform convergence bounds for average-smooth functions in the realizable case and provi…
▽ More
We study distribution-free nonparametric regression following a notion of average smoothness initiated by Ashlagi et al. (2021), which measures the "effective" smoothness of a function with respect to an arbitrary unknown underlying distribution. While the recent work of Hanneke et al. (2023) established tight uniform convergence bounds for average-smooth functions in the realizable case and provided a computationally efficient realizable learning algorithm, both of these results currently lack analogs in the general agnostic (i.e. noisy) case.
In this work, we fully close these gaps. First, we provide a distribution-free uniform convergence bound for average-smoothness classes in the agnostic setting. Second, we match the derived sample complexity with a computationally efficient agnostic learning algorithm. Our results, which are stated in terms of the intrinsic geometry of the data and hold over any totally bounded metric space, show that the guarantees recently obtained for realizable learning of average-smooth functions transfer to the agnostic setting. At the heart of our proof, we establish the uniform convergence rate of a function class in terms of its bracketing entropy, which may be of independent interest.
△ Less
Submitted 13 February, 2024; v1 submitted 29 September, 2023;
originally announced September 2023.
-
Optimal Learners for Realizable Regression: PAC Learning and Online Learning
Authors:
Idan Attias,
Steve Hanneke,
Alkis Kalavasis,
Amin Karbasi,
Grigoris Velegkas
Abstract:
In this work, we aim to characterize the statistical complexity of realizable regression both in the PAC learning setting and the online learning setting. Previous work had established the sufficiency of finiteness of the fat shattering dimension for PAC learnability and the necessity of finiteness of the scaled Natarajan dimension, but little progress had been made towards a more complete charact…
▽ More
In this work, we aim to characterize the statistical complexity of realizable regression both in the PAC learning setting and the online learning setting. Previous work had established the sufficiency of finiteness of the fat shattering dimension for PAC learnability and the necessity of finiteness of the scaled Natarajan dimension, but little progress had been made towards a more complete characterization since the work of Simon (SICOMP '97). To this end, we first introduce a minimax instance optimal learner for realizable regression and propose a novel dimension that both qualitatively and quantitatively characterizes which classes of real-valued predictors are learnable. We then identify a combinatorial dimension related to the Graph dimension that characterizes ERM learnability in the realizable setting. Finally, we establish a necessary condition for learnability based on a combinatorial dimension related to the DS dimension, and conjecture that it may also be sufficient in this context. Additionally, in the context of online learning we provide a dimension that characterizes the minimax instance optimal cumulative loss up to a constant factor and design an optimal online learner for realizable regression, thus resolving an open question raised by Daskalakis and Golowich in STOC '22.
△ Less
Submitted 2 October, 2024; v1 submitted 7 July, 2023;
originally announced July 2023.
-
Universal Rates for Multiclass Learning
Authors:
Steve Hanneke,
Shay Moran,
Qian Zhang
Abstract:
We study universal rates for multiclass classification, establishing the optimal rates (up to log factors) for all hypothesis classes. This generalizes previous results on binary classification (Bousquet, Hanneke, Moran, van Handel, and Yehudayoff, 2021), and resolves an open question studied by Kalavasis, Velegkas, and Karbasi (2022) who handled the multiclass setting with a bounded number of cla…
▽ More
We study universal rates for multiclass classification, establishing the optimal rates (up to log factors) for all hypothesis classes. This generalizes previous results on binary classification (Bousquet, Hanneke, Moran, van Handel, and Yehudayoff, 2021), and resolves an open question studied by Kalavasis, Velegkas, and Karbasi (2022) who handled the multiclass setting with a bounded number of class labels. In contrast, our result applies for any countable label space. Even for finite label space, our proofs provide a more precise bounds on the learning curves, as they do not depend on the number of labels. Specifically, we show that any class admits exponential rates if and only if it has no infinite Littlestone tree, and admits (near-)linear rates if and only if it has no infinite Daniely-Shalev-Shwartz-Littleston (DSL) tree, and otherwise requires arbitrarily slow rates. DSL trees are a new structure we define in this work, in which each node of the tree is given by a pseudo-cube of possible classifications of a given set of points. Pseudo-cubes are a structure, rooted in the work of Daniely and Shalev-Shwartz (2014), and recently shown by Brukhim, Carmon, Dinur, Moran, and Yehudayoff (2022) to characterize PAC learnability (i.e., uniform rates) for multiclass classification. We also resolve an open question of Kalavasis, Velegkas, and Karbasi (2022) regarding the equivalence of classes having infinite Graph-Littlestone (GL) trees versus infinite Natarajan-Littlestone (NL) trees, showing that they are indeed equivalent.
△ Less
Submitted 5 July, 2023;
originally announced July 2023.
-
Adversarial Resilience in Sequential Prediction via Abstention
Authors:
Surbhi Goel,
Steve Hanneke,
Shay Moran,
Abhishek Shetty
Abstract:
We study the problem of sequential prediction in the stochastic setting with an adversary that is allowed to inject clean-label adversarial (or out-of-distribution) examples. Algorithms designed to handle purely stochastic data tend to fail in the presence of such adversarial examples, often leading to erroneous predictions. This is undesirable in many high-stakes applications such as medical reco…
▽ More
We study the problem of sequential prediction in the stochastic setting with an adversary that is allowed to inject clean-label adversarial (or out-of-distribution) examples. Algorithms designed to handle purely stochastic data tend to fail in the presence of such adversarial examples, often leading to erroneous predictions. This is undesirable in many high-stakes applications such as medical recommendations, where abstaining from predictions on adversarial examples is preferable to misclassification. On the other hand, assuming fully adversarial data leads to very pessimistic bounds that are often vacuous in practice.
To capture this motivation, we propose a new model of sequential prediction that sits between the purely stochastic and fully adversarial settings by allowing the learner to abstain from making a prediction at no cost on adversarial examples. Assuming access to the marginal distribution on the non-adversarial examples, we design a learner whose error scales with the VC dimension (mirroring the stochastic setting) of the hypothesis class, as opposed to the Littlestone dimension which characterizes the fully adversarial setting. Furthermore, we design a learner for VC dimension~1 classes, which works even in the absence of access to the marginal distribution. Our key technical contribution is a novel measure for quantifying uncertainty for learning VC classes, which may be of independent interest.
△ Less
Submitted 24 January, 2024; v1 submitted 22 June, 2023;
originally announced June 2023.
-
Limits of Model Selection under Transfer Learning
Authors:
Steve Hanneke,
Samory Kpotufe,
Yasaman Mahdaviyeh
Abstract:
Theoretical studies on transfer learning or domain adaptation have so far focused on situations with a known hypothesis class or model; however in practice, some amount of model selection is usually involved, often appearing under the umbrella term of hyperparameter-tuning: for example, one may think of the problem of tuning for the right neural network architecture towards a target task, while le…
▽ More
Theoretical studies on transfer learning or domain adaptation have so far focused on situations with a known hypothesis class or model; however in practice, some amount of model selection is usually involved, often appearing under the umbrella term of hyperparameter-tuning: for example, one may think of the problem of tuning for the right neural network architecture towards a target task, while leveraging data from a related source task.
Now, in addition to the usual tradeoffs on approximation vs estimation errors involved in model selection, this problem brings in a new complexity term, namely, the transfer distance between source and target distributions, which is known to vary with the choice of hypothesis class.
We present a first study of this problem, focusing on classification; in particular, the analysis reveals some remarkable phenomena: adaptive rates, i.e., those achievable with no distributional information, can be arbitrarily slower than oracle rates, i.e., when given knowledge on distances.
△ Less
Submitted 12 October, 2023; v1 submitted 28 April, 2023;
originally announced May 2023.
-
Reliable learning in challenging environments
Authors:
Maria-Florina Balcan,
Steve Hanneke,
Rattana Pukdee,
Dravyansh Sharma
Abstract:
The problem of designing learners that provide guarantees that their predictions are provably correct is of increasing importance in machine learning. However, learning theoretic guarantees have only been considered in very specific settings. In this work, we consider the design and analysis of reliable learners in challenging test-time environments as encountered in modern machine learning proble…
▽ More
The problem of designing learners that provide guarantees that their predictions are provably correct is of increasing importance in machine learning. However, learning theoretic guarantees have only been considered in very specific settings. In this work, we consider the design and analysis of reliable learners in challenging test-time environments as encountered in modern machine learning problems: namely `adversarial' test-time attacks (in several variations) and `natural' distribution shifts. In this work, we provide a reliable learner with provably optimal guarantees in such settings. We discuss computationally feasible implementations of the learner and further show that our algorithm achieves strong positive performance guarantees on several natural examples: for example, linear separators under log-concave distributions or smooth boundary classifiers under smooth probability distributions.
△ Less
Submitted 29 October, 2023; v1 submitted 6 April, 2023;
originally announced April 2023.
-
Multiclass Online Learning and Uniform Convergence
Authors:
Steve Hanneke,
Shay Moran,
Vinod Raman,
Unique Subedi,
Ambuj Tewari
Abstract:
We study multiclass classification in the agnostic adversarial online learning setting. As our main result, we prove that any multiclass concept class is agnostically learnable if and only if its Littlestone dimension is finite. This solves an open problem studied by Daniely, Sabato, Ben-David, and Shalev-Shwartz (2011,2015) who handled the case when the number of classes (or labels) is bounded. W…
▽ More
We study multiclass classification in the agnostic adversarial online learning setting. As our main result, we prove that any multiclass concept class is agnostically learnable if and only if its Littlestone dimension is finite. This solves an open problem studied by Daniely, Sabato, Ben-David, and Shalev-Shwartz (2011,2015) who handled the case when the number of classes (or labels) is bounded. We also prove a separation between online learnability and online uniform convergence by exhibiting an easy-to-learn class whose sequential Rademacher complexity is unbounded.
Our learning algorithm uses the multiplicative weights algorithm, with a set of experts defined by executions of the Standard Optimal Algorithm on subsequences of size Littlestone dimension. We argue that the best expert has regret at most Littlestone dimension relative to the best concept in the class. This differs from the well-known covering technique of Ben-David, Pál, and Shalev-Shwartz (2009) for binary classification, where the best expert has regret zero.
△ Less
Submitted 7 July, 2023; v1 submitted 30 March, 2023;
originally announced March 2023.
-
Optimal Prediction Using Expert Advice and Randomized Littlestone Dimension
Authors:
Yuval Filmus,
Steve Hanneke,
Idan Mehalel,
Shay Moran
Abstract:
A classical result in online learning characterizes the optimal mistake bound achievable by deterministic learners using the Littlestone dimension (Littlestone '88). We prove an analogous result for randomized learners: we show that the optimal expected mistake bound in learning a class $\mathcal{H}$ equals its randomized Littlestone dimension, which is the largest $d$ for which there exists a tre…
▽ More
A classical result in online learning characterizes the optimal mistake bound achievable by deterministic learners using the Littlestone dimension (Littlestone '88). We prove an analogous result for randomized learners: we show that the optimal expected mistake bound in learning a class $\mathcal{H}$ equals its randomized Littlestone dimension, which is the largest $d$ for which there exists a tree shattered by $\mathcal{H}$ whose average depth is $2d$. We further study optimal mistake bounds in the agnostic case, as a function of the number of mistakes made by the best function in $\mathcal{H}$, denoted by $k$. We show that the optimal randomized mistake bound for learning a class with Littlestone dimension $d$ is $k + Θ(\sqrt{k d} + d )$. This also implies an optimal deterministic mistake bound of $2k + Θ(d) + O(\sqrt{k d})$, thus resolving an open question which was studied by Auer and Long ['99].
As an application of our theory, we revisit the classical problem of prediction using expert advice: about 30 years ago Cesa-Bianchi, Freund, Haussler, Helmbold, Schapire and Warmuth studied prediction using expert advice, provided that the best among the $n$ experts makes at most $k$ mistakes, and asked what are the optimal mistake bounds. Cesa-Bianchi, Freund, Helmbold, and Warmuth ['93, '96] provided a nearly optimal bound for deterministic learners, and left the randomized case as an open problem. We resolve this question by providing an optimal learning rule in the randomized case, and showing that its expected mistake bound equals half of the deterministic bound of Cesa-Bianchi et al. ['93,'96], up to negligible additive terms. In contrast with previous works by Abernethy, Langford, and Warmuth ['06], and by Brânzei and Peres ['19], our result applies to all pairs $n,k$.
△ Less
Submitted 17 August, 2023; v1 submitted 27 February, 2023;
originally announced February 2023.
-
Adversarial Rewards in Universal Learning for Contextual Bandits
Authors:
Moise Blanchard,
Steve Hanneke,
Patrick Jaillet
Abstract:
We study the fundamental limits of learning in contextual bandits, where a learner's rewards depend on their actions and a known context, which extends the canonical multi-armed bandit to the case where side-information is available. We are interested in universally consistent algorithms, which achieve sublinear regret compared to any measurable fixed policy, without any function class restriction…
▽ More
We study the fundamental limits of learning in contextual bandits, where a learner's rewards depend on their actions and a known context, which extends the canonical multi-armed bandit to the case where side-information is available. We are interested in universally consistent algorithms, which achieve sublinear regret compared to any measurable fixed policy, without any function class restriction. For stationary contextual bandits, when the underlying reward mechanism is time-invariant, Blanchard et. al (2022) characterized learnable context processes for which universal consistency is achievable; and further gave algorithms ensuring universal consistency whenever this is achievable, a property known as optimistic universal consistency. It is well understood, however, that reward mechanisms can evolve over time, possibly adversarially, and depending on the learner's actions. We show that optimistic universal learning for contextual bandits with adversarial rewards is impossible in general, contrary to all previously studied settings in online learning -- including standard supervised learning. We also give necessary and sufficient conditions for universal learning under various adversarial reward models, and an exact characterization for online rewards. In particular, the set of learnable processes for these reward models is still extremely general -- larger than i.i.d., stationary or ergodic -- but in general strictly smaller than that for supervised learning or stationary contextual bandits, shedding light on new adversarial phenomena.
△ Less
Submitted 12 June, 2023; v1 submitted 14 February, 2023;
originally announced February 2023.
-
Near-optimal learning with average Hölder smoothness
Authors:
Steve Hanneke,
Aryeh Kontorovich,
Guy Kornowski
Abstract:
We generalize the notion of average Lipschitz smoothness proposed by Ashlagi et al. (COLT 2021) by extending it to Hölder smoothness. This measure of the "effective smoothness" of a function is sensitive to the underlying distribution and can be dramatically smaller than its classic "worst-case" Hölder constant. We consider both the realizable and the agnostic (noisy) regression settings, proving…
▽ More
We generalize the notion of average Lipschitz smoothness proposed by Ashlagi et al. (COLT 2021) by extending it to Hölder smoothness. This measure of the "effective smoothness" of a function is sensitive to the underlying distribution and can be dramatically smaller than its classic "worst-case" Hölder constant. We consider both the realizable and the agnostic (noisy) regression settings, proving upper and lower risk bounds in terms of the average Hölder smoothness; these rates improve upon both previously known rates even in the special case of average Lipschitz smoothness. Moreover, our lower bound is tight in the realizable setting up to log factors, thus we establish the minimax rate. From an algorithmic perspective, since our notion of average smoothness is defined with respect to the unknown underlying distribution, the learner does not have an explicit representation of the function class, hence is unable to execute ERM. Nevertheless, we provide distinct learning algorithms that achieve both (nearly) optimal learning rates. Our results hold in any totally bounded metric space, and are stated in terms of its intrinsic geometry. Overall, our results show that the classic worst-case notion of Hölder smoothness can be essentially replaced by its average, yielding considerably sharper guarantees.
△ Less
Submitted 30 October, 2023; v1 submitted 12 February, 2023;
originally announced February 2023.
-
Contextual Bandits and Optimistically Universal Learning
Authors:
Moise Blanchard,
Steve Hanneke,
Patrick Jaillet
Abstract:
We consider the contextual bandit problem on general action and context spaces, where the learner's rewards depend on their selected actions and an observable context. This generalizes the standard multi-armed bandit to the case where side information is available, e.g., patients' records or customers' history, which allows for personalized treatment. We focus on consistency -- vanishing regret co…
▽ More
We consider the contextual bandit problem on general action and context spaces, where the learner's rewards depend on their selected actions and an observable context. This generalizes the standard multi-armed bandit to the case where side information is available, e.g., patients' records or customers' history, which allows for personalized treatment. We focus on consistency -- vanishing regret compared to the optimal policy -- and show that for large classes of non-i.i.d. contexts, consistency can be achieved regardless of the time-invariant reward mechanism, a property known as universal consistency. Precisely, we first give necessary and sufficient conditions on the context-generating process for universal consistency to be possible. Second, we show that there always exists an algorithm that guarantees universal consistency whenever this is achievable, called an optimistically universal learning rule. Interestingly, for finite action spaces, learnable processes for universal learning are exactly the same as in the full-feedback setting of supervised learning, previously studied in the literature. In other words, learning can be performed with partial feedback without any generalization cost. The algorithms balance a trade-off between generalization (similar to structural risk minimization) and personalization (tailoring actions to specific contexts). Lastly, we consider the case of added continuity assumptions on rewards and show that these lead to universal consistency for significantly larger classes of data-generating processes.
△ Less
Submitted 31 December, 2022;
originally announced January 2023.
-
On Optimal Learning Under Targeted Data Poisoning
Authors:
Steve Hanneke,
Amin Karbasi,
Mohammad Mahmoody,
Idan Mehalel,
Shay Moran
Abstract:
Consider the task of learning a hypothesis class $\mathcal{H}$ in the presence of an adversary that can replace up to an $η$ fraction of the examples in the training set with arbitrary adversarial examples. The adversary aims to fail the learner on a particular target test point $x$ which is known to the adversary but not to the learner. In this work we aim to characterize the smallest achievable…
▽ More
Consider the task of learning a hypothesis class $\mathcal{H}$ in the presence of an adversary that can replace up to an $η$ fraction of the examples in the training set with arbitrary adversarial examples. The adversary aims to fail the learner on a particular target test point $x$ which is known to the adversary but not to the learner. In this work we aim to characterize the smallest achievable error $ε=ε(η)$ by the learner in the presence of such an adversary in both realizable and agnostic settings. We fully achieve this in the realizable setting, proving that $ε=Θ(\mathtt{VC}(\mathcal{H})\cdot η)$, where $\mathtt{VC}(\mathcal{H})$ is the VC dimension of $\mathcal{H}$. Remarkably, we show that the upper bound can be attained by a deterministic learner. In the agnostic setting we reveal a more elaborate landscape: we devise a deterministic learner with a multiplicative regret guarantee of $ε\leq C\cdot\mathtt{OPT} + O(\mathtt{VC}(\mathcal{H})\cdot η)$, where $C > 1$ is a universal numerical constant. We complement this by showing that for any deterministic learner there is an attack which worsens its error to at least $2\cdot \mathtt{OPT}$. This implies that a multiplicative deterioration in the regret is unavoidable in this case. Finally, the algorithms we develop for achieving the optimal rates are inherently improper. Nevertheless, we show that for a variety of natural concept classes, such as linear classifiers, it is possible to retain the dependence $ε=Θ_{\mathcal{H}}(η)$ by a proper algorithm in the realizable setting. Here $Θ_{\mathcal{H}}$ conceals a polynomial dependence on $\mathtt{VC}(\mathcal{H})$.
△ Less
Submitted 12 October, 2022; v1 submitted 6 October, 2022;
originally announced October 2022.
-
Adversarially Robust Learning: A Generic Minimax Optimal Learner and Characterization
Authors:
Omar Montasser,
Steve Hanneke,
Nathan Srebro
Abstract:
We present a minimax optimal learner for the problem of learning predictors robust to adversarial examples at test-time. Interestingly, we find that this requires new algorithmic ideas and approaches to adversarially robust learning. In particular, we show, in a strong negative sense, the suboptimality of the robust learner proposed by Montasser, Hanneke, and Srebro (2019) and a broader family of…
▽ More
We present a minimax optimal learner for the problem of learning predictors robust to adversarial examples at test-time. Interestingly, we find that this requires new algorithmic ideas and approaches to adversarially robust learning. In particular, we show, in a strong negative sense, the suboptimality of the robust learner proposed by Montasser, Hanneke, and Srebro (2019) and a broader family of learners we identify as local learners. Our results are enabled by adopting a global perspective, specifically, through a key technical contribution: the global one-inclusion graph, which may be of independent interest, that generalizes the classical one-inclusion graph due to Haussler, Littlestone, and Warmuth (1994). Finally, as a byproduct, we identify a dimension characterizing qualitatively and quantitatively what classes of predictors $\mathcal{H}$ are robustly learnable. This resolves an open problem due to Montasser et al. (2019), and closes a (potentially) infinite gap between the established upper and lower bounds on the sample complexity of adversarially robust learning.
△ Less
Submitted 15 September, 2022;
originally announced September 2022.
-
Fine-Grained Distribution-Dependent Learning Curves
Authors:
Olivier Bousquet,
Steve Hanneke,
Shay Moran,
Jonathan Shafer,
Ilya Tolstikhin
Abstract:
Learning curves plot the expected error of a learning algorithm as a function of the number of labeled samples it receives from a target distribution. They are widely used as a measure of an algorithm's performance, but classic PAC learning theory cannot explain their behavior.
As observed by Antos and Lugosi (1996 , 1998), the classic `No Free Lunch' lower bounds only trace the upper envelope a…
▽ More
Learning curves plot the expected error of a learning algorithm as a function of the number of labeled samples it receives from a target distribution. They are widely used as a measure of an algorithm's performance, but classic PAC learning theory cannot explain their behavior.
As observed by Antos and Lugosi (1996 , 1998), the classic `No Free Lunch' lower bounds only trace the upper envelope above all learning curves of specific target distributions. For a concept class with VC dimension $d$ the classic bound decays like $d/n$, yet it is possible that the learning curve for \emph{every} specific distribution decays exponentially. In this case, for each $n$ there exists a different `hard' distribution requiring $d/n$ samples. Antos and Lugosi asked which concept classes admit a `strong minimax lower bound' -- a lower bound of $d'/n$ that holds for a fixed distribution for infinitely many $n$.
We solve this problem in a principled manner, by introducing a combinatorial dimension called VCL that characterizes the best $d'$ for which $d'/n$ is a strong minimax lower bound. Our characterization strengthens the lower bounds of Bousquet, Hanneke, Moran, van Handel, and Yehudayoff (2021), and it refines their theory of learning curves, by showing that for classes with finite VCL the learning rate can be decomposed into a linear component that depends only on the hypothesis class and an exponential component that depends also on the target distribution. As a corollary, we recover the lower bound of Antos and Lugosi (1996 , 1998) for half-spaces in $\mathbb{R}^d$.
Finally, to provide another viewpoint on our work and how it compares to traditional PAC learning bounds, we also present an alternative formulation of our results in a language that is closer to the PAC setting.
△ Less
Submitted 10 November, 2022; v1 submitted 30 August, 2022;
originally announced August 2022.
-
Adversarially Robust PAC Learnability of Real-Valued Functions
Authors:
Idan Attias,
Steve Hanneke
Abstract:
We study robustness to test-time adversarial attacks in the regression setting with $\ell_p$ losses and arbitrary perturbation sets. We address the question of which function classes are PAC learnable in this setting. We show that classes of finite fat-shattering dimension are learnable in both realizable and agnostic settings. Moreover, for convex function classes, they are even properly learnabl…
▽ More
We study robustness to test-time adversarial attacks in the regression setting with $\ell_p$ losses and arbitrary perturbation sets. We address the question of which function classes are PAC learnable in this setting. We show that classes of finite fat-shattering dimension are learnable in both realizable and agnostic settings. Moreover, for convex function classes, they are even properly learnable. In contrast, some non-convex function classes provably require improper learning algorithms. Our main technique is based on a construction of an adversarially robust sample compression scheme of a size determined by the fat-shattering dimension. Along the way, we introduce a novel agnostic sample compression scheme for real-valued functions, which may be of independent interest.
△ Less
Submitted 5 May, 2024; v1 submitted 26 June, 2022;
originally announced June 2022.
-
Universally Consistent Online Learning with Arbitrarily Dependent Responses
Authors:
Steve Hanneke
Abstract:
This work provides an online learning rule that is universally consistent under processes on (X,Y) pairs, under conditions only on the X process. As a special case, the conditions admit all processes on (X,Y) such that the process on X is stationary. This generalizes past results which required stationarity for the joint process on (X,Y), and additionally required this process to be ergodic. In pa…
▽ More
This work provides an online learning rule that is universally consistent under processes on (X,Y) pairs, under conditions only on the X process. As a special case, the conditions admit all processes on (X,Y) such that the process on X is stationary. This generalizes past results which required stationarity for the joint process on (X,Y), and additionally required this process to be ergodic. In particular, this means that ergodicity is superfluous for the purpose of universally consistent online learning.
△ Less
Submitted 11 March, 2022;
originally announced March 2022.
-
Robustly-reliable learners under poisoning attacks
Authors:
Maria-Florina Balcan,
Avrim Blum,
Steve Hanneke,
Dravyansh Sharma
Abstract:
Data poisoning attacks, in which an adversary corrupts a training set with the goal of inducing specific desired mistakes, have raised substantial concern: even just the possibility of such an attack can make a user no longer trust the results of a learning system. In this work, we show how to achieve strong robustness guarantees in the face of such attacks across multiple axes.
We provide robus…
▽ More
Data poisoning attacks, in which an adversary corrupts a training set with the goal of inducing specific desired mistakes, have raised substantial concern: even just the possibility of such an attack can make a user no longer trust the results of a learning system. In this work, we show how to achieve strong robustness guarantees in the face of such attacks across multiple axes.
We provide robustly-reliable predictions, in which the predicted label is guaranteed to be correct so long as the adversary has not exceeded a given corruption budget, even in the presence of instance targeted attacks, where the adversary knows the test example in advance and aims to cause a specific failure on that example. Our guarantees are substantially stronger than those in prior approaches, which were only able to provide certificates that the prediction of the learning algorithm does not change, as opposed to certifying that the prediction is correct, as we are able to achieve in our work. Remarkably, we provide a complete characterization of learnability in this setting, in particular, nearly-tight matching upper and lower bounds on the region that can be certified, as well as efficient algorithms for computing this region given an ERM oracle. Moreover, for the case of linear separators over logconcave distributions, we provide efficient truly polynomial time algorithms (i.e., non-oracle algorithms) for such robustly-reliable predictions.
We also extend these results to the active setting where the algorithm adaptively asks for labels of specific informative examples, and the difficulty is that the adversary might even be adaptive to this interaction, as well as to the agnostic learning setting where there is no perfect classifier even over the uncorrupted data.
△ Less
Submitted 8 March, 2022;
originally announced March 2022.
-
A Characterization of Semi-Supervised Adversarially-Robust PAC Learnability
Authors:
Idan Attias,
Steve Hanneke,
Yishay Mansour
Abstract:
We study the problem of learning an adversarially robust predictor to test time attacks in the semi-supervised PAC model. We address the question of how many labeled and unlabeled examples are required to ensure learning. We show that having enough unlabeled data (the size of a labeled sample that a fully-supervised method would require), the labeled sample complexity can be arbitrarily smaller co…
▽ More
We study the problem of learning an adversarially robust predictor to test time attacks in the semi-supervised PAC model. We address the question of how many labeled and unlabeled examples are required to ensure learning. We show that having enough unlabeled data (the size of a labeled sample that a fully-supervised method would require), the labeled sample complexity can be arbitrarily smaller compared to previous works, and is sharply characterized by a different complexity measure. We prove nearly matching upper and lower bounds on this sample complexity. This shows that there is a significant benefit in semi-supervised robust learning even in the worst-case distribution-free model, and establishes a gap between the supervised and semi-supervised label complexities which is known not to hold in standard non-robust PAC learning.
△ Less
Submitted 5 May, 2024; v1 submitted 10 February, 2022;
originally announced February 2022.
-
Universal Online Learning with Unbounded Losses: Memory Is All You Need
Authors:
Moise Blanchard,
Romain Cosson,
Steve Hanneke
Abstract:
We resolve an open problem of Hanneke on the subject of universally consistent online learning with non-i.i.d. processes and unbounded losses. The notion of an optimistically universal learning rule was defined by Hanneke in an effort to study learning theory under minimal assumptions. A given learning rule is said to be optimistically universal if it achieves a low long-run average loss whenever…
▽ More
We resolve an open problem of Hanneke on the subject of universally consistent online learning with non-i.i.d. processes and unbounded losses. The notion of an optimistically universal learning rule was defined by Hanneke in an effort to study learning theory under minimal assumptions. A given learning rule is said to be optimistically universal if it achieves a low long-run average loss whenever the data generating process makes this goal achievable by some learning rule. Hanneke posed as an open problem whether, for every unbounded loss, the family of processes admitting universal learning are precisely those having a finite number of distinct values almost surely. In this paper, we completely resolve this problem, showing that this is indeed the case. As a consequence, this also offers a dramatically simpler formulation of an optimistically universal learning rule for any unbounded loss: namely, the simple memorization rule already suffices. Our proof relies on constructing random measurable partitions of the instance space and could be of independent interest for solving other open questions. We extend the results to the non-realizable setting thereby providing an optimistically universal Bayes consistent learning rule.
△ Less
Submitted 21 January, 2022;
originally announced January 2022.
-
Transductive Robust Learning Guarantees
Authors:
Omar Montasser,
Steve Hanneke,
Nathan Srebro
Abstract:
We study the problem of adversarially robust learning in the transductive setting. For classes $\mathcal{H}$ of bounded VC dimension, we propose a simple transductive learner that when presented with a set of labeled training examples and a set of unlabeled test examples (both sets possibly adversarially perturbed), it correctly labels the test examples with a robust error rate that is linear in t…
▽ More
We study the problem of adversarially robust learning in the transductive setting. For classes $\mathcal{H}$ of bounded VC dimension, we propose a simple transductive learner that when presented with a set of labeled training examples and a set of unlabeled test examples (both sets possibly adversarially perturbed), it correctly labels the test examples with a robust error rate that is linear in the VC dimension and is adaptive to the complexity of the perturbation set. This result provides an exponential improvement in dependence on VC dimension over the best known upper bound on the robust error in the inductive setting, at the expense of competing with a more restrictive notion of optimal robust error.
△ Less
Submitted 20 October, 2021;
originally announced October 2021.
-
Open Problem: Is There an Online Learning Algorithm That Learns Whenever Online Learning Is Possible?
Authors:
Steve Hanneke
Abstract:
This open problem asks whether there exists an online learning algorithm for binary classification that guarantees, for all target concepts, to make a sublinear number of mistakes, under only the assumption that the (possibly random) sequence of points X allows that such a learning algorithm can exist for that sequence. As a secondary problem, it also asks whether a specific concise condition comp…
▽ More
This open problem asks whether there exists an online learning algorithm for binary classification that guarantees, for all target concepts, to make a sublinear number of mistakes, under only the assumption that the (possibly random) sequence of points X allows that such a learning algorithm can exist for that sequence. As a secondary problem, it also asks whether a specific concise condition completely determines whether a given (possibly random) sequence of points X admits the existence of online learning algorithms guaranteeing a sublinear number of mistakes for all target concepts.
△ Less
Submitted 20 July, 2021;
originally announced July 2021.
-
A Theory of PAC Learnability of Partial Concept Classes
Authors:
Noga Alon,
Steve Hanneke,
Ron Holzman,
Shay Moran
Abstract:
We extend the theory of PAC learning in a way which allows to model a rich variety of learning tasks where the data satisfy special properties that ease the learning process. For example, tasks where the distance of the data from the decision boundary is bounded away from zero. The basic and simple idea is to consider partial concepts: these are functions that can be undefined on certain parts of…
▽ More
We extend the theory of PAC learning in a way which allows to model a rich variety of learning tasks where the data satisfy special properties that ease the learning process. For example, tasks where the distance of the data from the decision boundary is bounded away from zero. The basic and simple idea is to consider partial concepts: these are functions that can be undefined on certain parts of the space. When learning a partial concept, we assume that the source distribution is supported only on points where the partial concept is defined.
This way, one can naturally express assumptions on the data such as lying on a lower dimensional surface or margin conditions. In contrast, it is not at all clear that such assumptions can be expressed by the traditional PAC theory. In fact we exhibit easy-to-learn partial concept classes which provably cannot be captured by the traditional PAC theory. This also resolves a question posed by Attias, Kontorovich, and Mansour 2019.
We characterize PAC learnability of partial concept classes and reveal an algorithmic landscape which is fundamentally different than the classical one. For example, in the classical PAC model, learning boils down to Empirical Risk Minimization (ERM). In stark contrast, we show that the ERM principle fails in explaining learnability of partial concept classes. In fact, we demonstrate classes that are incredibly easy to learn, but such that any algorithm that learns them must use an hypothesis space with unbounded VC dimension. We also find that the sample compression conjecture fails in this setting.
Thus, this theory features problems that cannot be represented nor solved in the traditional way. We view this as evidence that it might provide insights on the nature of learnability in realistic scenarios which the classical theory fails to explain.
△ Less
Submitted 20 July, 2021; v1 submitted 18 July, 2021;
originally announced July 2021.
-
Robust learning under clean-label attack
Authors:
Avrim Blum,
Steve Hanneke,
Jian Qian,
Han Shao
Abstract:
We study the problem of robust learning under clean-label data-poisoning attacks, where the attacker injects (an arbitrary set of) correctly-labeled examples to the training set to fool the algorithm into making mistakes on specific test instances at test time. The learning goal is to minimize the attackable rate (the probability mass of attackable test instances), which is more difficult than opt…
▽ More
We study the problem of robust learning under clean-label data-poisoning attacks, where the attacker injects (an arbitrary set of) correctly-labeled examples to the training set to fool the algorithm into making mistakes on specific test instances at test time. The learning goal is to minimize the attackable rate (the probability mass of attackable test instances), which is more difficult than optimal PAC learning. As we show, any robust algorithm with diminishing attackable rate can achieve the optimal dependence on $ε$ in its PAC sample complexity, i.e., $O(1/ε)$. On the other hand, the attackable rate might be large even for some optimal PAC learners, e.g., SVM for linear classifiers. Furthermore, we show that the class of linear hypotheses is not robustly learnable when the data distribution has zero margin and is robustly learnable in the case of positive margin but requires sample complexity exponential in the dimension. For a general hypothesis class with bounded VC dimension, if the attacker is limited to add at most $t>0$ poison examples, the optimal robust learning sample complexity grows almost linearly with $t$.
△ Less
Submitted 6 July, 2021; v1 submitted 28 February, 2021;
originally announced March 2021.
-
Adversarially Robust Learning with Unknown Perturbation Sets
Authors:
Omar Montasser,
Steve Hanneke,
Nathan Srebro
Abstract:
We study the problem of learning predictors that are robust to adversarial examples with respect to an unknown perturbation set, relying instead on interaction with an adversarial attacker or access to attack oracles, examining different models for such interactions. We obtain upper bounds on the sample complexity and upper and lower bounds on the number of required interactions, or number of succ…
▽ More
We study the problem of learning predictors that are robust to adversarial examples with respect to an unknown perturbation set, relying instead on interaction with an adversarial attacker or access to attack oracles, examining different models for such interactions. We obtain upper bounds on the sample complexity and upper and lower bounds on the number of required interactions, or number of successful attacks, in different interaction models, in terms of the VC and Littlestone dimensions of the hypothesis class of predictors, and without any assumptions on the perturbation set.
△ Less
Submitted 3 February, 2021;
originally announced February 2021.
-
Online Learning with Simple Predictors and a Combinatorial Characterization of Minimax in 0/1 Games
Authors:
Steve Hanneke,
Roi Livni,
Shay Moran
Abstract:
Which classes can be learned properly in the online model? -- that is, by an algorithm that at each round uses a predictor from the concept class. While there are simple and natural cases where improper learning is necessary, it is natural to ask how complex must the improper predictors be in such cases. Can one always achieve nearly optimal mistake/regret bounds using "simple" predictors?
In th…
▽ More
Which classes can be learned properly in the online model? -- that is, by an algorithm that at each round uses a predictor from the concept class. While there are simple and natural cases where improper learning is necessary, it is natural to ask how complex must the improper predictors be in such cases. Can one always achieve nearly optimal mistake/regret bounds using "simple" predictors?
In this work, we give a complete characterization of when this is possible, thus settling an open problem which has been studied since the pioneering works of Angluin (1987) and Littlestone (1988). More precisely, given any concept class C and any hypothesis class H, we provide nearly tight bounds (up to a log factor) on the optimal mistake bounds for online learning C using predictors from H. Our bound yields an exponential improvement over the previously best known bound by Chase and Freitag (2020).
As applications, we give constructive proofs showing that (i) in the realizable setting, a near-optimal mistake bound (up to a constant factor) can be attained by a sparse majority-vote of proper predictors, and (ii) in the agnostic setting, a near-optimal regret bound (up to a log factor) can be attained by a randomized proper algorithm.
A technical ingredient of our proof which may be of independent interest is a generalization of the celebrated Minimax Theorem (von Neumann, 1928) for binary zero-sum games. A simple game which fails to satisfy Minimax is "Guess the Larger Number", where each player picks a number and the larger number wins. The payoff matrix is infinite triangular. We show this is the only obstruction: if a game does not contain triangular submatrices of unbounded sizes then the Minimax Theorem holds. This generalizes von Neumann's Minimax Theorem by removing requirements of finiteness (or compactness), and captures precisely the games of interest in online learning.
△ Less
Submitted 2 February, 2021;
originally announced February 2021.
-
Stable Sample Compression Schemes: New Applications and an Optimal SVM Margin Bound
Authors:
Steve Hanneke,
Aryeh Kontorovich
Abstract:
We analyze a family of supervised learning algorithms based on sample compression schemes that are stable, in the sense that removing points from the training set which were not selected for the compression set does not alter the resulting classifier. We use this technique to derive a variety of novel or improved data-dependent generalization bounds for several learning algorithms. In particular,…
▽ More
We analyze a family of supervised learning algorithms based on sample compression schemes that are stable, in the sense that removing points from the training set which were not selected for the compression set does not alter the resulting classifier. We use this technique to derive a variety of novel or improved data-dependent generalization bounds for several learning algorithms. In particular, we prove a new margin bound for SVM, removing a log factor. The new bound is provably optimal. This resolves a long-standing open question about the PAC margin bounds achievable by SVM.
△ Less
Submitted 9 November, 2020;
originally announced November 2020.
-
A Theory of Universal Learning
Authors:
Olivier Bousquet,
Steve Hanneke,
Shay Moran,
Ramon van Handel,
Amir Yehudayoff
Abstract:
How quickly can a given class of concepts be learned from examples? It is common to measure the performance of a supervised machine learning algorithm by plotting its "learning curve", that is, the decay of the error rate as a function of the number of training examples. However, the classical theoretical framework for understanding learnability, the PAC model of Vapnik-Chervonenkis and Valiant, d…
▽ More
How quickly can a given class of concepts be learned from examples? It is common to measure the performance of a supervised machine learning algorithm by plotting its "learning curve", that is, the decay of the error rate as a function of the number of training examples. However, the classical theoretical framework for understanding learnability, the PAC model of Vapnik-Chervonenkis and Valiant, does not explain the behavior of learning curves: the distribution-free PAC model of learning can only bound the upper envelope of the learning curves over all possible data distributions. This does not match the practice of machine learning, where the data source is typically fixed in any given scenario, while the learner may choose the number of training examples on the basis of factors such as computational resources and desired accuracy.
In this paper, we study an alternative learning model that better captures such practical aspects of machine learning, but still gives rise to a complete theory of the learnable in the spirit of the PAC model. More precisely, we consider the problem of universal learning, which aims to understand the performance of learning algorithms on every data distribution, but without requiring uniformity over the distribution. The main result of this paper is a remarkable trichotomy: there are only three possible rates of universal learning. More precisely, we show that the learning curves of any given concept class decay either at an exponential, linear, or arbitrarily slow rates. Moreover, each of these cases is completely characterized by appropriate combinatorial parameters, and we exhibit optimal learning algorithms that achieve the best possible rate in each case.
For concreteness, we consider in this paper only the realizable case, though analogous results are expected to extend to more general learning scenarios.
△ Less
Submitted 9 November, 2020;
originally announced November 2020.
-
Reducing Adversarially Robust Learning to Non-Robust PAC Learning
Authors:
Omar Montasser,
Steve Hanneke,
Nathan Srebro
Abstract:
We study the problem of reducing adversarially robust learning to standard PAC learning, i.e. the complexity of learning adversarially robust predictors using access to only a black-box non-robust learner. We give a reduction that can robustly learn any hypothesis class $\mathcal{C}$ using any non-robust learner $\mathcal{A}$ for $\mathcal{C}$. The number of calls to $\mathcal{A}$ depends logarith…
▽ More
We study the problem of reducing adversarially robust learning to standard PAC learning, i.e. the complexity of learning adversarially robust predictors using access to only a black-box non-robust learner. We give a reduction that can robustly learn any hypothesis class $\mathcal{C}$ using any non-robust learner $\mathcal{A}$ for $\mathcal{C}$. The number of calls to $\mathcal{A}$ depends logarithmically on the number of allowed adversarial perturbations per example, and we give a lower bound showing this is unavoidable.
△ Less
Submitted 22 October, 2020;
originally announced October 2020.
-
A No-Free-Lunch Theorem for MultiTask Learning
Authors:
Steve Hanneke,
Samory Kpotufe
Abstract:
Multitask learning and related areas such as multi-source domain adaptation address modern settings where datasets from $N$ related distributions $\{P_t\}$ are to be combined towards improving performance on any single such distribution ${\cal D}$. A perplexing fact remains in the evolving theory on the subject: while we would hope for performance bounds that account for the contribution from mult…
▽ More
Multitask learning and related areas such as multi-source domain adaptation address modern settings where datasets from $N$ related distributions $\{P_t\}$ are to be combined towards improving performance on any single such distribution ${\cal D}$. A perplexing fact remains in the evolving theory on the subject: while we would hope for performance bounds that account for the contribution from multiple tasks, the vast majority of analyses result in bounds that improve at best in the number $n$ of samples per task, but most often do not improve in $N$. As such, it might seem at first that the distributional settings or aggregation procedures considered in such analyses might be somehow unfavorable; however, as we show, the picture happens to be more nuanced, with interestingly hard regimes that might appear otherwise favorable.
In particular, we consider a seemingly favorable classification scenario where all tasks $P_t$ share a common optimal classifier $h^*,$ and which can be shown to admit a broad range of regimes with improved oracle rates in terms of $N$ and $n$. Some of our main results are as follows:
$\bullet$ We show that, even though such regimes admit minimax rates accounting for both $n$ and $N$, no adaptive algorithm exists; that is, without access to distributional information, no algorithm can guarantee rates that improve with large $N$ for $n$ fixed.
$\bullet$ With a bit of additional information, namely, a ranking of tasks $\{P_t\}$ according to their distance to a target ${\cal D}$, a simple rank-based procedure can achieve near optimal aggregations of tasks' datasets, despite a search space exponential in $N$. Interestingly, the optimal aggregation might exclude certain tasks, even though they all share the same $h^*$.
△ Less
Submitted 5 August, 2020; v1 submitted 28 June, 2020;
originally announced June 2020.