Search | arXiv e-print repository

The Empirical Mean is Minimax Optimal for Local Glivenko-Cantelli

Authors: Doron Cohen, Aryeh Kontorovich, Roi Weiss

Abstract: We revisit the recently introduced Local Glivenko-Cantelli setting, which studies distribution-dependent uniform convergence rates of the Empirical Mean Estimator (EME). In this work, we investigate generalizations of this setting where arbitrary estimators are allowed rather than just the EME. Can a strictly larger class of measures be learned? Can better risk decay rates be obtained? We provide… ▽ More We revisit the recently introduced Local Glivenko-Cantelli setting, which studies distribution-dependent uniform convergence rates of the Empirical Mean Estimator (EME). In this work, we investigate generalizations of this setting where arbitrary estimators are allowed rather than just the EME. Can a strictly larger class of measures be learned? Can better risk decay rates be obtained? We provide exhaustive answers to these questions, which are both negative, provided the learner is barred from exploiting some infinite-dimensional pathologies. On the other hand, allowing such exploits does lead to a strictly larger class of learnable measures. △ Less

Submitted 28 May, 2025; v1 submitted 2 October, 2024; originally announced October 2024.

arXiv:2407.16642 [pdf, ps, other]

Sharp bounds on aggregate expert error

Authors: Aryeh Kontorovich, Ariel Avital

Abstract: We revisit the classic problem of aggregating binary advice from conditionally independent experts, also known as the Naive Bayes setting. Our quantity of interest is the error probability of the optimal decision rule. In the case of symmetric errors (sensitivity = specificity), reasonably tight bounds on the optimal error probability are known. In the general asymmetric case, we are not aware of… ▽ More We revisit the classic problem of aggregating binary advice from conditionally independent experts, also known as the Naive Bayes setting. Our quantity of interest is the error probability of the optimal decision rule. In the case of symmetric errors (sensitivity = specificity), reasonably tight bounds on the optimal error probability are known. In the general asymmetric case, we are not aware of any nontrivial estimates on this quantity. Our contribution consists of sharp upper and lower bounds on the optimal error probability in the general case, which recover and sharpen the best known results in the symmetric special case. Since this turns out to be equivalent to estimating the total variation distance between two product distributions, our results also have bearing on this important and challenging problem. △ Less

Submitted 23 December, 2024; v1 submitted 23 July, 2024; originally announced July 2024.

arXiv:2309.17016 [pdf, other]

Efficient Agnostic Learning with Average Smoothness

Authors: Steve Hanneke, Aryeh Kontorovich, Guy Kornowski

Abstract: We study distribution-free nonparametric regression following a notion of average smoothness initiated by Ashlagi et al. (2021), which measures the "effective" smoothness of a function with respect to an arbitrary unknown underlying distribution. While the recent work of Hanneke et al. (2023) established tight uniform convergence bounds for average-smooth functions in the realizable case and provi… ▽ More We study distribution-free nonparametric regression following a notion of average smoothness initiated by Ashlagi et al. (2021), which measures the "effective" smoothness of a function with respect to an arbitrary unknown underlying distribution. While the recent work of Hanneke et al. (2023) established tight uniform convergence bounds for average-smooth functions in the realizable case and provided a computationally efficient realizable learning algorithm, both of these results currently lack analogs in the general agnostic (i.e. noisy) case. In this work, we fully close these gaps. First, we provide a distribution-free uniform convergence bound for average-smoothness classes in the agnostic setting. Second, we match the derived sample complexity with a computationally efficient agnostic learning algorithm. Our results, which are stated in terms of the intrinsic geometry of the data and hold over any totally bounded metric space, show that the guarantees recently obtained for realizable learning of average-smooth functions transfer to the agnostic setting. At the heart of our proof, we establish the uniform convergence rate of a function class in terms of its bracketing entropy, which may be of independent interest. △ Less

Submitted 13 February, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

Comments: ALT 2024 camera ready version. arXiv admin note: text overlap with arXiv:2302.06005

arXiv:2302.06005 [pdf, other]

Near-optimal learning with average Hölder smoothness

Authors: Steve Hanneke, Aryeh Kontorovich, Guy Kornowski

Abstract: We generalize the notion of average Lipschitz smoothness proposed by Ashlagi et al. (COLT 2021) by extending it to Hölder smoothness. This measure of the "effective smoothness" of a function is sensitive to the underlying distribution and can be dramatically smaller than its classic "worst-case" Hölder constant. We consider both the realizable and the agnostic (noisy) regression settings, proving… ▽ More We generalize the notion of average Lipschitz smoothness proposed by Ashlagi et al. (COLT 2021) by extending it to Hölder smoothness. This measure of the "effective smoothness" of a function is sensitive to the underlying distribution and can be dramatically smaller than its classic "worst-case" Hölder constant. We consider both the realizable and the agnostic (noisy) regression settings, proving upper and lower risk bounds in terms of the average Hölder smoothness; these rates improve upon both previously known rates even in the special case of average Lipschitz smoothness. Moreover, our lower bound is tight in the realizable setting up to log factors, thus we establish the minimax rate. From an algorithmic perspective, since our notion of average smoothness is defined with respect to the unknown underlying distribution, the learner does not have an explicit representation of the function class, hence is unable to execute ERM. Nevertheless, we provide distinct learning algorithms that achieve both (nearly) optimal learning rates. Our results hold in any totally bounded metric space, and are stated in terms of its intrinsic geometry. Overall, our results show that the classic worst-case notion of Hölder smoothness can be essentially replaced by its average, yielding considerably sharper guarantees. △ Less

Submitted 30 October, 2023; v1 submitted 12 February, 2023; originally announced February 2023.

Comments: NeurIPS 2023 camera ready version

arXiv:2212.04216 [pdf, ps, other]

Differentially-Private Bayes Consistency

Authors: Olivier Bousquet, Haim Kaplan, Aryeh Kontorovich, Yishay Mansour, Shay Moran, Menachem Sadigurschi, Uri Stemmer

Abstract: We construct a universally Bayes consistent learning rule that satisfies differential privacy (DP). We first handle the setting of binary classification and then extend our rule to the more general setting of density estimation (with respect to the total variation metric). The existence of a universally consistent DP learner reveals a stark difference with the distribution-free PAC model. Indeed,… ▽ More We construct a universally Bayes consistent learning rule that satisfies differential privacy (DP). We first handle the setting of binary classification and then extend our rule to the more general setting of density estimation (with respect to the total variation metric). The existence of a universally consistent DP learner reveals a stark difference with the distribution-free PAC model. Indeed, in the latter DP learning is extremely limited: even one-dimensional linear classifiers are not privately learnable in this stringent model. Our result thus demonstrates that by allowing the learning rate to depend on the target distribution, one can circumvent the above-mentioned impossibility result and in fact, learn \emph{arbitrary} distributions by a single DP algorithm. As an application, we prove that any VC class can be privately learned in a semi-supervised setting with a near-optimal \emph{labeled} sample complexity of $\tilde{O}(d/\varepsilon)$ labeled examples (and with an unlabeled sample complexity that can depend on the target distribution). △ Less

Submitted 8 December, 2022; originally announced December 2022.

arXiv:2202.03045 [pdf, ps, other]

Metric-valued regression

Authors: Dan Tsir Cohen, Aryeh Kontorovich

Abstract: We propose an efficient algorithm for learning mappings between two metric spaces, $\X$ and $\Y$. Our procedure is strongly Bayes-consistent whenever $\X$ and $\Y$ are topologically separable and $\Y$ is "bounded in expectation" (our term; the separability assumption can be somewhat weakened). At this level of generality, ours is the first such learnability result for unbounded loss in the agnosti… ▽ More We propose an efficient algorithm for learning mappings between two metric spaces, $\X$ and $\Y$. Our procedure is strongly Bayes-consistent whenever $\X$ and $\Y$ are topologically separable and $\Y$ is "bounded in expectation" (our term; the separability assumption can be somewhat weakened). At this level of generality, ours is the first such learnability result for unbounded loss in the agnostic setting. Our technique is based on metric medoids (a variant of Fréchet means) and presents a significant departure from existing methods, which, as we demonstrate, fail to achieve Bayes-consistency on general instance- and label-space metrics. Our proofs introduce the technique of {\em semi-stable compression}, which may be of independent interest. △ Less

Submitted 7 February, 2022; originally announced February 2022.

arXiv:2111.11971 [pdf, ps, other]

Tree density estimation

Authors: László Györfi, Aryeh Kontorovich, Roi Weiss

Abstract: We study the problem of estimating the density $f(\boldsymbol x)$ of a random vector ${\boldsymbol X}$ in $\mathbb R^d$. For a spanning tree $T$ defined on the vertex set $\{1,\dots ,d\}$, the tree density $f_{T}$ is a product of bivariate conditional densities. An optimal spanning tree minimizes the Kullback-Leibler divergence between $f$ and $f_{T}$. From i.i.d. data we identify an optimal tree… ▽ More We study the problem of estimating the density $f(\boldsymbol x)$ of a random vector ${\boldsymbol X}$ in $\mathbb R^d$. For a spanning tree $T$ defined on the vertex set $\{1,\dots ,d\}$, the tree density $f_{T}$ is a product of bivariate conditional densities. An optimal spanning tree minimizes the Kullback-Leibler divergence between $f$ and $f_{T}$. From i.i.d. data we identify an optimal tree $T^*$ and efficiently construct a tree density estimate $f_n$ such that, without any regularity conditions on the density $f$, one has $\lim_{n\to \infty} \int |f_n(\boldsymbol x)-f_{T^*}(\boldsymbol x)|d\boldsymbol x=0$ a.s. For Lipschitz $f$ with bounded support, $\mathbb E \left\{ \int |f_n(\boldsymbol x)-f_{T^*}(\boldsymbol x)|d\boldsymbol x\right\}=O\big(n^{-1/4}\big)$, a dimension-free rate. △ Less

Submitted 21 September, 2022; v1 submitted 23 November, 2021; originally announced November 2021.

arXiv:2110.04763 [pdf, ps, other]

Fat-Shattering Dimension of $k$-fold Aggregations

Authors: Idan Attias, Aryeh Kontorovich

Abstract: We provide estimates on the fat-shattering dimension of aggregation rules of real-valued function classes. The latter consists of all ways of choosing $k$ functions, one from each of the $k$ classes, and computing a pointwise function of them, such as the median, mean, and maximum. The bound is stated in terms of the fat-shattering dimensions of the component classes. For linear and affine functio… ▽ More We provide estimates on the fat-shattering dimension of aggregation rules of real-valued function classes. The latter consists of all ways of choosing $k$ functions, one from each of the $k$ classes, and computing a pointwise function of them, such as the median, mean, and maximum. The bound is stated in terms of the fat-shattering dimensions of the component classes. For linear and affine function classes, we provide a considerably sharper upper bound and a matching lower bound, achieving, in particular, an optimal dependence on $k$. Along the way, we improve several known results in addition to pointing out and correcting a number of erroneous claims in the literature. △ Less

Submitted 9 September, 2023; v1 submitted 10 October, 2021; originally announced October 2021.

arXiv:2011.04586 [pdf, ps, other]

Stable Sample Compression Schemes: New Applications and an Optimal SVM Margin Bound

Authors: Steve Hanneke, Aryeh Kontorovich

Abstract: We analyze a family of supervised learning algorithms based on sample compression schemes that are stable, in the sense that removing points from the training set which were not selected for the compression set does not alter the resulting classifier. We use this technique to derive a variety of novel or improved data-dependent generalization bounds for several learning algorithms. In particular,… ▽ More We analyze a family of supervised learning algorithms based on sample compression schemes that are stable, in the sense that removing points from the training set which were not selected for the compression set does not alter the resulting classifier. We use this technique to derive a variety of novel or improved data-dependent generalization bounds for several learning algorithms. In particular, we prove a new margin bound for SVM, removing a log factor. The new bound is provably optimal. This resolves a long-standing open question about the PAC margin bounds achievable by SVM. △ Less

Submitted 9 November, 2020; originally announced November 2020.

arXiv:2003.13561 [pdf, other]

On Biased Random Walks, Corrupted Intervals, and Learning Under Adversarial Design

Authors: Daniel Berend, Aryeh Kontorovich, Lev Reyzin, Thomas Robinson

Abstract: We tackle some fundamental problems in probability theory on corrupted random processes on the integer line. We analyze when a biased random walk is expected to reach its bottommost point and when intervals of integer points can be detected under a natural model of noise. We apply these results to problems in learning thresholds and intervals under a new model for learning under adversarial design… ▽ More We tackle some fundamental problems in probability theory on corrupted random processes on the integer line. We analyze when a biased random walk is expected to reach its bottommost point and when intervals of integer points can be detected under a natural model of noise. We apply these results to problems in learning thresholds and intervals under a new model for learning under adversarial design. △ Less

Submitted 30 March, 2020; originally announced March 2020.

Comments: 18 pages

arXiv:2002.01999 [pdf, other]

Nested Barycentric Coordinate System as an Explicit Feature Map

Authors: Lee-Ad Gottlieb, Eran Kaufman, Aryeh Kontorovich, Gabriel Nivasch, Ofir Pele

Abstract: We propose a new embedding method which is particularly well-suited for settings where the sample size greatly exceeds the ambient dimension. Our technique consists of partitioning the space into simplices and then embedding the data points into features corresponding to the simplices' barycentric coordinates. We then train a linear classifier in the rich feature space obtained from the simplices.… ▽ More We propose a new embedding method which is particularly well-suited for settings where the sample size greatly exceeds the ambient dimension. Our technique consists of partitioning the space into simplices and then embedding the data points into features corresponding to the simplices' barycentric coordinates. We then train a linear classifier in the rich feature space obtained from the simplices. The decision boundary may be highly non-linear, though it is linear within each simplex (and hence piecewise-linear overall). Further, our method can approximate any convex body. We give generalization bounds based on empirical margin and a novel hybrid sample compression technique. An extensive empirical evaluation shows that our method consistently outperforms a range of popular kernel embedding methods. △ Less

Submitted 5 February, 2020; originally announced February 2020.

arXiv:2002.01408 [pdf, other]

Apportioned Margin Approach for Cost Sensitive Large Margin Classifiers

Authors: Lee-Ad Gottlieb, Eran Kaufman, Aryeh Kontorovich

Abstract: We consider the problem of cost sensitive multiclass classification, where we would like to increase the sensitivity of an important class at the expense of a less important one. We adopt an {\em apportioned margin} framework to address this problem, which enables an efficient margin shift between classes that share the same boundary. The decision boundary between all pairs of classes divides the… ▽ More We consider the problem of cost sensitive multiclass classification, where we would like to increase the sensitivity of an important class at the expense of a less important one. We adopt an {\em apportioned margin} framework to address this problem, which enables an efficient margin shift between classes that share the same boundary. The decision boundary between all pairs of classes divides the margin between them in accordance to a given prioritization vector, which yields a tighter error bound for the important classes while also reducing the overall out-of-sample error. In addition to demonstrating an efficient implementation of our framework, we derive generalization bounds, demonstrate Fisher consistency, adapt the framework to Mercer's kernel and to neural networks, and report promising empirical results on all accounts. △ Less

Submitted 4 February, 2020; originally announced February 2020.

arXiv:1910.05270 [pdf, ps, other]

Fast and Bayes-consistent nearest neighbors

Authors: Klim Efremenko, Aryeh Kontorovich, Moshe Noivirt

Abstract: Research on nearest-neighbor methods tends to focus somewhat dichotomously either on the statistical or the computational aspects -- either on, say, Bayes consistency and rates of convergence or on techniques for speeding up the proximity search. This paper aims at bridging these realms: to reap the advantages of fast evaluation time while maintaining Bayes consistency, and further without sacrifi… ▽ More Research on nearest-neighbor methods tends to focus somewhat dichotomously either on the statistical or the computational aspects -- either on, say, Bayes consistency and rates of convergence or on techniques for speeding up the proximity search. This paper aims at bridging these realms: to reap the advantages of fast evaluation time while maintaining Bayes consistency, and further without sacrificing too much in the risk decay rate. We combine the locality-sensitive hashing (LSH) technique with a novel missing-mass argument to obtain a fast and Bayes-consistent classifier. Our algorithm's prediction runtime compares favorably against state of the art approximate NN methods, while maintaining Bayes-consistency and attaining rates comparable to minimax. On samples of size $n$ in $\R^d$, our pre-processing phase has runtime $O(d n \log n)$, while the evaluation phase has runtime $O(d\log n)$ per query point. △ Less

Submitted 15 April, 2020; v1 submitted 7 October, 2019; originally announced October 2019.

arXiv:1906.09855 [pdf, other]

Universal Bayes consistency in metric spaces

Authors: Steve Hanneke, Aryeh Kontorovich, Sivan Sabato, Roi Weiss

Abstract: We extend a recently proposed 1-nearest-neighbor based multiclass learning algorithm and prove that our modification is universally strongly Bayes-consistent in all metric spaces admitting any such learner, making it an "optimistically universal" Bayes-consistent learner. This is the first learning algorithm known to enjoy this property; by comparison, the $k$-NN classifier and its variants are no… ▽ More We extend a recently proposed 1-nearest-neighbor based multiclass learning algorithm and prove that our modification is universally strongly Bayes-consistent in all metric spaces admitting any such learner, making it an "optimistically universal" Bayes-consistent learner. This is the first learning algorithm known to enjoy this property; by comparison, the $k$-NN classifier and its variants are not generally universally Bayes-consistent, except under additional structural assumptions, such as an inner product, a norm, finite dimension, or a Besicovitch-type property. The metric spaces in which universal Bayes consistency is possible are the "essentially separable" ones -- a notion that we define, which is more general than standard separability. The existence of metric spaces that are not essentially separable is widely believed to be independent of the ZFC axioms of set theory. We prove that essential separability exactly characterizes the existence of a universal Bayes-consistent learner for the given metric space. In particular, this yields the first impossibility result for universal Bayes consistency. Taken together, our results completely characterize strong and weak universal Bayes consistency in metric spaces. △ Less

Submitted 6 January, 2021; v1 submitted 24 June, 2019; originally announced June 2019.

Comments: To appear in Annals of Statistics

Journal ref: Annals of Statistics 2021, Vol. 49, No. 4, 2129-2150, August 2021

arXiv:1905.11930 [pdf, other]

Efficient Kirszbraun Extension with Applications to Regression

Authors: Hanan Zaichyk, Armin Biess, Aryeh Kontorovich, Yury Makarychev

Abstract: We introduce a framework for performing regression between two Hilbert spaces. This is done based on Kirszbraun's extension theorem, to the best of our knowledge, the first application of this technique to supervised learning. We analyze the statistical and computational aspects of this method. We decompose this task into two stages: training (which corresponds operationally to smoothing/regulariz… ▽ More We introduce a framework for performing regression between two Hilbert spaces. This is done based on Kirszbraun's extension theorem, to the best of our knowledge, the first application of this technique to supervised learning. We analyze the statistical and computational aspects of this method. We decompose this task into two stages: training (which corresponds operationally to smoothing/regularization) and prediction (which is achieved via Kirszbraun extension). Both are solved algorithmically via a novel multiplicative weight updates (MWU) scheme, which, for our problem formulation, achieves a quadratic runtime improvement over the state of the art. Our empirical results indicate a dramatic improvement over standard off-the-shelf solvers in our setting. △ Less

Submitted 8 March, 2022; v1 submitted 28 May, 2019; originally announced May 2019.

arXiv:1902.01224 [pdf, other]

Estimating the Mixing Time of Ergodic Markov Chains

Authors: Geoffrey Wolfer, Aryeh Kontorovich

Abstract: We address the problem of estimating the mixing time $t_{\mathsf{mix}}$ of an arbitrary ergodic finite-state Markov chain from a single trajectory of length $m$. The reversible case was addressed by Hsu et al. [2019], who left the general case as an open problem. In the reversible case, the analysis is greatly facilitated by the fact that the Markov operator is self-adjoint, and Weyl's inequality… ▽ More We address the problem of estimating the mixing time $t_{\mathsf{mix}}$ of an arbitrary ergodic finite-state Markov chain from a single trajectory of length $m$. The reversible case was addressed by Hsu et al. [2019], who left the general case as an open problem. In the reversible case, the analysis is greatly facilitated by the fact that the Markov operator is self-adjoint, and Weyl's inequality allows for a dimension-free perturbation analysis of the empirical eigenvalues. As Hsu et al. point out, in the absence of reversibility (which induces asymmetric pair probabilities matrices), the existing perturbation analysis has a worst-case exponential dependence on the number of states $d$. Furthermore, even if an eigenvalue perturbation analysis with better dependence on $d$ were available, in the non-reversible case the connection between the spectral gap and the mixing time is not nearly as straightforward as in the reversible case. Our key insight is to estimate the pseudo-spectral gap $γ_{\mathsf{ps}}$ instead, which allows us to overcome the loss of symmetry and to achieve a polynomial dependence on the minimal stationary probability $π_\star$ and $γ_{\mathsf{ps}}$. Additionally, in the reversible case, we obtain simultaneous nearly (up to logarithmic factors) minimax rates in $t_{\mathsf{mix}}$ and precision $\varepsilon$, closing a gap in Hsu et al., who treated $\varepsilon$ as constant in the lower bounds. Finally, we construct fully empirical confidence intervals for $γ_{\mathsf{ps}}$, which shrink to zero at a rate of roughly $1/\sqrt{m}$, and improve the state of the art in even the reversible case. △ Less

Submitted 16 August, 2022; v1 submitted 1 February, 2019; originally announced February 2019.

Comments: COLT'19 conference manuscript, with minor fixes

arXiv:1902.00080 [pdf, ps, other]

Minimax Testing of Identity to a Reference Ergodic Markov Chain

Authors: Geoffrey Wolfer, Aryeh Kontorovich

Abstract: We exhibit an efficient procedure for testing, based on a single long state sequence, whether an unknown Markov chain is identical to or $\varepsilon$-far from a given reference chain. We obtain nearly matching (up to logarithmic factors) upper and lower sample complexity bounds for our notion of distance, which is based on total variation. Perhaps surprisingly, we discover that the sample complex… ▽ More We exhibit an efficient procedure for testing, based on a single long state sequence, whether an unknown Markov chain is identical to or $\varepsilon$-far from a given reference chain. We obtain nearly matching (up to logarithmic factors) upper and lower sample complexity bounds for our notion of distance, which is based on total variation. Perhaps surprisingly, we discover that the sample complexity depends solely on the properties of the known reference chain and does not involve the unknown chain at all, which is not even assumed to be ergodic. △ Less

Submitted 24 September, 2019; v1 submitted 31 January, 2019; originally announced February 2019.

Comments: A previous version of this print contained a mistake in a proof. We have now fixed it

arXiv:1810.02180 [pdf, other]

Improved Generalization Bounds for Adversarially Robust Learning

Authors: Idan Attias, Aryeh Kontorovich, Yishay Mansour

Abstract: We consider a model of robust learning in an adversarial environment. The learner gets uncorrupted training data with access to possible corruptions that may be affected by the adversary during testing. The learner's goal is to build a robust classifier, which will be tested on future adversarial examples. The adversary is limited to $k$ possible corruptions for each input. We model the learner-ad… ▽ More We consider a model of robust learning in an adversarial environment. The learner gets uncorrupted training data with access to possible corruptions that may be affected by the adversary during testing. The learner's goal is to build a robust classifier, which will be tested on future adversarial examples. The adversary is limited to $k$ possible corruptions for each input. We model the learner-adversary interaction as a zero-sum game. This model is closely related to the adversarial examples model of Schmidt et al. (2018); Madry et al. (2017). Our main results consist of generalization bounds for the binary and multiclass classification, as well as the real-valued case (regression). For the binary classification setting, we both tighten the generalization bound of Feige et al. (2015), and are also able to handle infinite hypothesis classes. The sample complexity is improved from $O(\frac{1}{ε^4}\log(\frac{|H|}δ))$ to $O\big(\frac{1}{ε^2}(kVC(H)\log^{\frac{3}{2}+α}(kVC(H))+\log(\frac{1}δ)\big)$ for any $α> 0$. Additionally, we extend the algorithm and generalization bound from the binary to the multiclass and real-valued cases. Along the way, we obtain results on fat-shattering dimension and Rademacher complexity of $k$-fold maxima over function classes; these may be of independent interest. For binary classification, the algorithm of Feige et al. (2015) uses a regret minimization algorithm and an ERM oracle as a black box; we adapt it for the multiclass and regression settings. The algorithm provides us with near-optimal policies for the players on a given training sample. △ Less

Submitted 1 July, 2022; v1 submitted 4 October, 2018; originally announced October 2018.

Comments: JMLR camera ready

arXiv:1810.01864 [pdf, other]

Agnostic Sample Compression Schemes for Regression

Authors: Idan Attias, Steve Hanneke, Aryeh Kontorovich, Menachem Sadigurschi

Abstract: We obtain the first positive results for bounded sample compression in the agnostic regression setting with the $\ell_p$ loss, where $p\in [1,\infty]$. We construct a generic approximate sample compression scheme for real-valued function classes exhibiting exponential size in the fat-shattering dimension but independent of the sample size. Notably, for linear regression, an approximate compression… ▽ More We obtain the first positive results for bounded sample compression in the agnostic regression setting with the $\ell_p$ loss, where $p\in [1,\infty]$. We construct a generic approximate sample compression scheme for real-valued function classes exhibiting exponential size in the fat-shattering dimension but independent of the sample size. Notably, for linear regression, an approximate compression of size linear in the dimension is constructed. Moreover, for $\ell_1$ and $\ell_\infty$ losses, we can even exhibit an efficient exact sample compression scheme of size linear in the dimension. We further show that for every other $\ell_p$ loss, $p\in (1,\infty)$, there does not exist an exact agnostic compression scheme of bounded size. This refines and generalizes a negative result of David, Moran, and Yehudayoff for the $\ell_2$ loss. We close by posing general open questions: for agnostic regression with $\ell_1$ loss, does every function class admits an exact compression scheme of size equal to its pseudo-dimension? For the $\ell_2$ loss, does every function class admit an approximate compression scheme of polynomial size in the fat-shattering dimension? These questions generalize Warmuth's classic sample compression conjecture for realizable-case classification. △ Less

Submitted 3 February, 2024; v1 submitted 3 October, 2018; originally announced October 2018.

Comments: New results in this version: (1) Approximate agnostic sample compression scheme for function classes with finite fat-shattering dimension and the $\ell_p$ loss (section 3), (2) Near-optimal approximate compression for linear functions and the $\ell_p$ loss (section 4.1) The results in sections 4.2 and 4.3 appear in the previous version

arXiv:1809.05014 [pdf, other]

Statistical Estimation of Ergodic Markov Chain Kernel over Discrete State Space

Authors: Geoffrey Wolfer, Aryeh Kontorovich

Abstract: We investigate the statistical complexity of estimating the parameters of a discrete-state Markov chain kernel from a single long sequence of state observations. In the finite case, we characterize (modulo logarithmic factors) the minimax sample complexity of estimation with respect to the operator infinity norm, while in the countably infinite case, we analyze the problem with respect to a natura… ▽ More We investigate the statistical complexity of estimating the parameters of a discrete-state Markov chain kernel from a single long sequence of state observations. In the finite case, we characterize (modulo logarithmic factors) the minimax sample complexity of estimation with respect to the operator infinity norm, while in the countably infinite case, we analyze the problem with respect to a natural entry-wise norm derived from total variation. We show that in both cases, the sample complexity is governed by the mixing properties of the unknown chain, for which, in the finite-state case, there are known finite-sample estimators with fully empirical confidence intervals. △ Less

Submitted 13 August, 2020; v1 submitted 13 September, 2018; originally announced September 2018.

Comments: Journal version of the extended abstract (ALT'19), to appear in Bernoulli 2020+

arXiv:1805.09719 [pdf, other]

Learning convex polyhedra with margin

Authors: Lee-Ad Gottlieb, Eran Kaufman, Aryeh Kontorovich, Gabriel Nivasch

Abstract: We present an improved algorithm for {\em quasi-properly} learning convex polyhedra in the realizable PAC setting from data with a margin. Our learning algorithm constructs a consistent polyhedron as an intersection of about $t \log t$ halfspaces with constant-size margins in time polynomial in $t$ (where $t$ is the number of halfspaces forming an optimal polyhedron). We also identify distinct gen… ▽ More We present an improved algorithm for {\em quasi-properly} learning convex polyhedra in the realizable PAC setting from data with a margin. Our learning algorithm constructs a consistent polyhedron as an intersection of about $t \log t$ halfspaces with constant-size margins in time polynomial in $t$ (where $t$ is the number of halfspaces forming an optimal polyhedron). We also identify distinct generalizations of the notion of margin from hyperplanes to polyhedra and investigate how they relate geometrically; this result may have ramifications beyond the learning setting. △ Less

Submitted 2 November, 2021; v1 submitted 24 May, 2018; originally announced May 2018.

arXiv:1805.08254 [pdf, ps, other]

Sample Compression for Real-Valued Learners

Authors: Steve Hanneke, Aryeh Kontorovich, Menachem Sadigurschi

Abstract: We give an algorithmically efficient version of the learner-to-compression scheme conversion in Moran and Yehudayoff (2016). In extending this technique to real-valued hypotheses, we also obtain an efficient regression-to-bounded sample compression converter. To our knowledge, this is the first general compressed regression result (regardless of efficiency or boundedness) guaranteeing uniform appr… ▽ More We give an algorithmically efficient version of the learner-to-compression scheme conversion in Moran and Yehudayoff (2016). In extending this technique to real-valued hypotheses, we also obtain an efficient regression-to-bounded sample compression converter. To our knowledge, this is the first general compressed regression result (regardless of efficiency or boundedness) guaranteeing uniform approximate reconstruction. Along the way, we develop a generic procedure for constructing weak real-valued learners out of abstract regressors; this may be of independent interest. In particular, this result sheds new light on an open question of H. Simon (1997). We show applications to two regression problems: learning Lipschitz and bounded-variation functions. △ Less

Submitted 21 May, 2018; originally announced May 2018.

arXiv:1805.08140 [pdf, ps, other]

A New Lower Bound for Agnostic Learning with Sample Compression Schemes

Authors: Steve Hanneke, Aryeh Kontorovich

Abstract: We establish a tight characterization of the worst-case rates for the excess risk of agnostic learning with sample compression schemes and for uniform convergence for agnostic sample compression schemes. In particular, we find that the optimal rates of convergence for size-$k$ agnostic sample compression schemes are of the form $\sqrt{\frac{k \log(n/k)}{n}}$, which contrasts with agnostic learning… ▽ More We establish a tight characterization of the worst-case rates for the excess risk of agnostic learning with sample compression schemes and for uniform convergence for agnostic sample compression schemes. In particular, we find that the optimal rates of convergence for size-$k$ agnostic sample compression schemes are of the form $\sqrt{\frac{k \log(n/k)}{n}}$, which contrasts with agnostic learning with classes of VC dimension $k$, where the optimal rates are of the form $\sqrt{\frac{k}{n}}$. △ Less

Submitted 21 May, 2018; originally announced May 2018.

arXiv:1708.07367 [pdf, ps, other]

Mixing time estimation in reversible Markov chains from a single sample path

Authors: Daniel Hsu, Aryeh Kontorovich, David A. Levin, Yuval Peres, Csaba Szepesvári

Abstract: The spectral gap $γ$ of a finite, ergodic, and reversible Markov chain is an important parameter measuring the asymptotic rate of convergence. In applications, the transition matrix $P$ may be unknown, yet one sample of the chain up to a fixed time $n$ may be observed. We consider here the problem of estimating $γ$ from this data. Let $π$ be the stationary distribution of $P$, and… ▽ More The spectral gap $γ$ of a finite, ergodic, and reversible Markov chain is an important parameter measuring the asymptotic rate of convergence. In applications, the transition matrix $P$ may be unknown, yet one sample of the chain up to a fixed time $n$ may be observed. We consider here the problem of estimating $γ$ from this data. Let $π$ be the stationary distribution of $P$, and $π_\star = \min_x π(x)$. We show that if $n = \tilde{O}\bigl(\frac{1}{γπ_\star}\bigr)$, then $γ$ can be estimated to within multiplicative constants with high probability. When $π$ is uniform on $d$ states, this matches (up to logarithmic correction) a lower bound of $\tildeΩ\bigl(\frac{d}γ\bigr)$ steps required for precise estimation of $γ$. Moreover, we provide the first procedure for computing a fully data-dependent interval, from a single finite-length trajectory of the chain, that traps the mixing time $t_{\text{mix}}$ of the chain at a prescribed confidence level. The interval does not require the knowledge of any parameters of the chain. This stands in contrast to previous approaches, which either only provide point estimates, or require a reset mechanism, or additional prior knowledge. The interval is constructed around the relaxation time $t_{\text{relax}} = 1/γ$, which is strongly related to the mixing time, and the width of the interval converges to zero roughly at a $1/\sqrt{n}$ rate, where $n$ is the length of the sample path. △ Less

Submitted 24 August, 2017; originally announced August 2017.

Comments: 34 pages, merges results of arXiv:1506.02903 and arXiv:1612.05330

arXiv:1506.02903 [pdf, ps, other]

Mixing Time Estimation in Reversible Markov Chains from a Single Sample Path

Authors: Daniel Hsu, Aryeh Kontorovich, Csaba Szepesvári

Abstract: This article provides the first procedure for computing a fully data-dependent interval that traps the mixing time $t_{\text{mix}}$ of a finite reversible ergodic Markov chain at a prescribed confidence level. The interval is computed from a single finite-length sample path from the Markov chain, and does not require the knowledge of any parameters of the chain. This stands in contrast to previous… ▽ More This article provides the first procedure for computing a fully data-dependent interval that traps the mixing time $t_{\text{mix}}$ of a finite reversible ergodic Markov chain at a prescribed confidence level. The interval is computed from a single finite-length sample path from the Markov chain, and does not require the knowledge of any parameters of the chain. This stands in contrast to previous approaches, which either only provide point estimates, or require a reset mechanism, or additional prior knowledge. The interval is constructed around the relaxation time $t_{\text{relax}}$, which is strongly related to the mixing time, and the width of the interval converges to zero roughly at a $\sqrt{n}$ rate, where $n$ is the length of the sample path. Upper and lower bounds are given on the number of samples required to achieve constant-factor multiplicative accuracy. The lower bounds indicate that, unless further restrictions are placed on the chain, no procedure can achieve this accuracy level before seeing each state at least $Ω(t_{\text{relax}})$ times on the average. Finally, future directions of research are identified. △ Less

Submitted 2 November, 2015; v1 submitted 9 June, 2015; originally announced June 2015.

Comments: 28 pages; minor clarification in Appendix A concerning lower bounds

arXiv:1407.0208 [pdf, ps, other]

A Bayes consistent 1-NN classifier

Authors: Aryeh Kontorovich, Roi Weiss

Abstract: We show that a simple modification of the 1-nearest neighbor classifier yields a strongly Bayes consistent learner. Prior to this work, the only strongly Bayes consistent proximity-based method was the k-nearest neighbor classifier, for k growing appropriately with sample size. We will argue that a margin-regularized 1-NN enjoys considerable statistical and algorithmic advantages over the k-NN cla… ▽ More We show that a simple modification of the 1-nearest neighbor classifier yields a strongly Bayes consistent learner. Prior to this work, the only strongly Bayes consistent proximity-based method was the k-nearest neighbor classifier, for k growing appropriately with sample size. We will argue that a margin-regularized 1-NN enjoys considerable statistical and algorithmic advantages over the k-NN classifier. These include user-friendly finite-sample error bounds, as well as time- and memory-efficient learning and test-point evaluation algorithms with a principled speed-accuracy tradeoff. Encouraging empirical results are reported. △ Less

Submitted 17 August, 2018; v1 submitted 1 July, 2014; originally announced July 2014.

arXiv:1312.0451 [pdf, ps, other]

Consistency of weighted majority votes

Authors: Daniel Berend, Aryeh Kontorovich

Abstract: We revisit the classical decision-theoretic problem of weighted expert voting from a statistical learning perspective. In particular, we examine the consistency (both asymptotic and finitary) of the optimal Nitzan-Paroush weighted majority and related rules. In the case of known expert competence levels, we give sharp error estimates for the optimal rule. When the competence levels are unknown, th… ▽ More We revisit the classical decision-theoretic problem of weighted expert voting from a statistical learning perspective. In particular, we examine the consistency (both asymptotic and finitary) of the optimal Nitzan-Paroush weighted majority and related rules. In the case of known expert competence levels, we give sharp error estimates for the optimal rule. When the competence levels are unknown, they must be empirically estimated. We provide frequentist and Bayesian analyses for this situation. Some of our proof techniques are non-standard and may be of independent interest. The bounds we derive are nearly optimal, and several challenging open problems are posed. Experimental results are provided to illustrate the theory. △ Less

Submitted 21 January, 2014; v1 submitted 2 December, 2013; originally announced December 2013.

MSC Class: 60C05; 60F15

arXiv:1309.4859 [pdf, ps, other]

Predictive PAC Learning and Process Decompositions

Authors: Cosma Rohilla Shalizi, Aryeh Kontorovich

Abstract: We informally call a stochastic process learnable if it admits a generalization error approaching zero in probability for any concept class with finite VC-dimension (IID processes are the simplest example). A mixture of learnable processes need not be learnable itself, and certainly its generalization error need not decay at the same rate. In this paper, we argue that it is natural in predictive P… ▽ More We informally call a stochastic process learnable if it admits a generalization error approaching zero in probability for any concept class with finite VC-dimension (IID processes are the simplest example). A mixture of learnable processes need not be learnable itself, and certainly its generalization error need not decay at the same rate. In this paper, we argue that it is natural in predictive PAC to condition not on the past observations but on the mixture component of the sample path. This definition not only matches what a realistic learner might demand, but also allows us to sidestep several otherwise grave problems in learning from dependent data. In particular, we give a novel PAC generalization bound for mixtures of learnable processes with a generalization error that is not worse than that of each mixture component. We also provide a characterization of mixtures of absolutely regular ($β$-mixing) processes, of independent probability-theoretic interest. △ Less

Submitted 19 September, 2013; originally announced September 2013.

Comments: 9 pages, accepted in NIPS 2013

Journal ref: Advances in Neural Information Processing Systems 26 [NIPS 2013], pp.1619--1627

arXiv:1306.2547 [pdf, ps, other]

Efficient Classification for Metric Data

Authors: Lee-Ad Gottlieb, Aryeh Kontorovich, Robert Krauthgamer

Abstract: Recent advances in large-margin classification of data residing in general metric spaces (rather than Hilbert spaces) enable classification under various natural metrics, such as string edit and earthmover distance. A general framework developed for this purpose by von Luxburg and Bousquet [JMLR, 2004] left open the questions of computational efficiency and of providing direct bounds on generaliza… ▽ More Recent advances in large-margin classification of data residing in general metric spaces (rather than Hilbert spaces) enable classification under various natural metrics, such as string edit and earthmover distance. A general framework developed for this purpose by von Luxburg and Bousquet [JMLR, 2004] left open the questions of computational efficiency and of providing direct bounds on generalization error. We design a new algorithm for classification in general metric spaces, whose runtime and accuracy depend on the doubling dimension of the data points, and can thus achieve superior classification performance in many common scenarios. The algorithmic core of our approach is an approximate (rather than exact) solution to the classical problems of Lipschitz extension and of Nearest Neighbor Search. The algorithm's generalization performance is guaranteed via the fat-shattering dimension of Lipschitz classifiers, and we present experimental evidence of its superiority to some common kernel methods. As a by-product, we offer a new perspective on the nearest neighbor classifier, which yields significantly sharper risk asymptotics than the classic analysis of Cover and Hart [IEEE Trans. Info. Theory, 1967]. △ Less

Submitted 10 July, 2014; v1 submitted 11 June, 2013; originally announced June 2013.

Comments: This is the full version of an extended abstract that appeared in Proceedings of the 23rd COLT, 2010

arXiv:1302.6009 [pdf, ps, other]

On learning parametric-output HMMs

Authors: Aryeh Kontorovich, Boaz Nadler, Roi Weiss

Abstract: We present a novel approach for learning an HMM whose outputs are distributed according to a parametric family. This is done by {\em decoupling} the learning task into two steps: first estimating the output parameters, and then estimating the hidden states transition probabilities. The first step is accomplished by fitting a mixture model to the output stationary distribution. Given the parameters… ▽ More We present a novel approach for learning an HMM whose outputs are distributed according to a parametric family. This is done by {\em decoupling} the learning task into two steps: first estimating the output parameters, and then estimating the hidden states transition probabilities. The first step is accomplished by fitting a mixture model to the output stationary distribution. Given the parameters of this mixture model, the second step is formulated as the solution of an easily solvable convex quadratic program. We provide an error analysis for the estimated transition probabilities and show they are robust to small perturbations in the estimates of the mixture parameters. Finally, we support our analysis with some encouraging empirical results. △ Less

Submitted 25 February, 2013; originally announced February 2013.

arXiv:1302.2752 [pdf, ps, other]

Adaptive Metric Dimensionality Reduction

Authors: Lee-Ad Gottlieb, Aryeh Kontorovich, Robert Krauthgamer

Abstract: We study adaptive data-dependent dimensionality reduction in the context of supervised learning in general metric spaces. Our main statistical contribution is a generalization bound for Lipschitz functions in metric spaces that are doubling, or nearly doubling. On the algorithmic front, we describe an analogue of PCA for metric spaces: namely an efficient procedure that approximates the data's int… ▽ More We study adaptive data-dependent dimensionality reduction in the context of supervised learning in general metric spaces. Our main statistical contribution is a generalization bound for Lipschitz functions in metric spaces that are doubling, or nearly doubling. On the algorithmic front, we describe an analogue of PCA for metric spaces: namely an efficient procedure that approximates the data's intrinsic dimension, which is often much lower than the ambient dimension. Our approach thus leverages the dual benefits of low dimensionality: (1) more efficient algorithms, e.g., for proximity search, and (2) more optimistic generalization bounds. △ Less

Submitted 25 March, 2015; v1 submitted 12 February, 2013; originally announced February 2013.

Showing 1–31 of 31 results for author: Kontorovich, A