Search | arXiv e-print repository

Uniform Mean Estimation for Heavy-Tailed Distributions via Median-of-Means

Authors: Mikael Møller Høgsgaard, Andrea Paudice

Abstract: The Median of Means (MoM) is a mean estimator that has gained popularity in the context of heavy-tailed data. In this work, we analyze its performance in the task of simultaneously estimating the mean of each function in a class $\mathcal{F}$ when the data distribution possesses only the first $p$ moments for $p \in (1,2]$. We prove a new sample complexity bound using a novel symmetrization techni… ▽ More The Median of Means (MoM) is a mean estimator that has gained popularity in the context of heavy-tailed data. In this work, we analyze its performance in the task of simultaneously estimating the mean of each function in a class $\mathcal{F}$ when the data distribution possesses only the first $p$ moments for $p \in (1,2]$. We prove a new sample complexity bound using a novel symmetrization technique that may be of independent interest. Additionally, we present applications of our result to $k$-means clustering with unbounded inputs and linear regression with general losses, improving upon existing works. △ Less

Submitted 19 June, 2025; v1 submitted 17 June, 2025; originally announced June 2025.

arXiv:2503.09384 [pdf, ps, other]

Revisiting Agnostic Boosting

Authors: Arthur da Cunha, Mikael Møller Høgsgaard, Andrea Paudice, Yuxin Sun

Abstract: Boosting is a key method in statistical learning, allowing for converting weak learners into strong ones. While well studied in the realizable case, the statistical properties of weak-to-strong learning remains less understood in the agnostic setting, where there are no assumptions on the distribution of the labels. In this work, we propose a new agnostic boosting algorithm with substantially impr… ▽ More Boosting is a key method in statistical learning, allowing for converting weak learners into strong ones. While well studied in the realizable case, the statistical properties of weak-to-strong learning remains less understood in the agnostic setting, where there are no assumptions on the distribution of the labels. In this work, we propose a new agnostic boosting algorithm with substantially improved sample complexity compared to prior works under very general assumptions. Our approach is based on a reduction to the realizable case, followed by a margin-based filtering step to select high-quality hypotheses. We conjecture that the error rate achieved by our proposed method is optimal up to logarithmic factors. △ Less

Submitted 12 March, 2025; originally announced March 2025.

arXiv:2502.16462 [pdf, ps, other]

Improved Margin Generalization Bounds for Voting Classifiers

Authors: Mikael Møller Høgsgaard, Kasper Green Larsen

Abstract: In this paper we establish a new margin-based generalization bound for voting classifiers, refining existing results and yielding tighter generalization guarantees for widely used boosting algorithms such as AdaBoost (Freund and Schapire, 1997). Furthermore, the new margin-based generalization bound enables the derivation of an optimal weak-to-strong learner: a Majority-of-3 large-margin classifie… ▽ More In this paper we establish a new margin-based generalization bound for voting classifiers, refining existing results and yielding tighter generalization guarantees for widely used boosting algorithms such as AdaBoost (Freund and Schapire, 1997). Furthermore, the new margin-based generalization bound enables the derivation of an optimal weak-to-strong learner: a Majority-of-3 large-margin classifiers with an expected error matching the theoretical lower bound. This result provides a more natural alternative to the Majority-of-5 algorithm by (Høgsgaard et al., 2024), and matches the Majority-of-3 result by (Aden-Ali et al., 2024) for the realizable prediction model. △ Less

Submitted 3 June, 2025; v1 submitted 23 February, 2025; originally announced February 2025.

arXiv:2502.09496 [pdf, ps, other]

On Agnostic PAC Learning in the Small Error Regime

Authors: Julian Asilis, Mikael Møller Høgsgaard, Grigoris Velegkas

Abstract: Binary classification in the classic PAC model exhibits a curious phenomenon: Empirical Risk Minimization (ERM) learners are suboptimal in the realizable case yet optimal in the agnostic case. Roughly speaking, this owes itself to the fact that non-realizable distributions $\mathcal{D}$ are simply more difficult to learn than realizable distributions -- even when one discounts a learner's error by… ▽ More Binary classification in the classic PAC model exhibits a curious phenomenon: Empirical Risk Minimization (ERM) learners are suboptimal in the realizable case yet optimal in the agnostic case. Roughly speaking, this owes itself to the fact that non-realizable distributions $\mathcal{D}$ are simply more difficult to learn than realizable distributions -- even when one discounts a learner's error by $\mathrm{err}(h^*_{\mathcal{D}})$, the error of the best hypothesis in $\mathcal{H}$ for $\mathcal{D}$. Thus, optimal agnostic learners are permitted to incur excess error on (easier-to-learn) distributions $\mathcal{D}$ for which $τ= \mathrm{err}(h^*_{\mathcal{D}})$ is small. Recent work of Hanneke, Larsen, and Zhivotovskiy (FOCS `24) addresses this shortcoming by including $τ$ itself as a parameter in the agnostic error term. In this more fine-grained model, they demonstrate tightness of the error lower bound $τ+ Ω\left(\sqrt{\frac{τ(d + \log(1 / δ))}{m}} + \frac{d + \log(1 / δ)}{m} \right)$ in a regime where $τ> d/m$, and leave open the question of whether there may be a higher lower bound when $τ\approx d/m$, with $d$ denoting $\mathrm{VC}(\mathcal{H})$. In this work, we resolve this question by exhibiting a learner which achieves error $c \cdot τ+ O \left(\sqrt{\frac{τ(d + \log(1 / δ))}{m}} + \frac{d + \log(1 / δ)}{m} \right)$ for a constant $c \leq 2.1$, thus matching the lower bound when $τ\approx d/m$. Further, our learner is computationally efficient and is based upon careful aggregations of ERM classifiers, making progress on two other questions of Hanneke, Larsen, and Zhivotovskiy (FOCS `24). We leave open the interesting question of whether our approach can be refined to lower the constant from 2.1 to 1, which would completely settle the complexity of agnostic learning. △ Less

Submitted 13 February, 2025; originally announced February 2025.

Comments: 44 pages

arXiv:2502.03620 [pdf, other]

Efficient Optimal PAC Learning

Authors: Mikael Møller Høgsgaard

Abstract: Recent advances in the binary classification setting by Hanneke [2016b] and Larsen [2023] have resulted in optimal PAC learners. These learners leverage, respectively, a clever deterministic subsampling scheme and the classic heuristic of bagging Breiman [1996]. Both optimal PAC learners use, as a subroutine, the natural algorithm of empirical risk minimization. Consequently, the computational cos… ▽ More Recent advances in the binary classification setting by Hanneke [2016b] and Larsen [2023] have resulted in optimal PAC learners. These learners leverage, respectively, a clever deterministic subsampling scheme and the classic heuristic of bagging Breiman [1996]. Both optimal PAC learners use, as a subroutine, the natural algorithm of empirical risk minimization. Consequently, the computational cost of these optimal PAC learners is tied to that of the empirical risk minimizer algorithm. In this work, we seek to provide an alternative perspective on the computational cost imposed by the link to the empirical risk minimizer algorithm. To this end, we show the existence of an optimal PAC learner, which offers a different tradeoff in terms of the computational cost induced by the empirical risk minimizer. △ Less

Submitted 7 February, 2025; v1 submitted 5 February, 2025; originally announced February 2025.

arXiv:2410.22749 [pdf, ps, other]

Understanding Aggregations of Proper Learners in Multiclass Classification

Authors: Julian Asilis, Mikael Møller Høgsgaard, Grigoris Velegkas

Abstract: Multiclass learnability is known to exhibit a properness barrier: there are learnable classes which cannot be learned by any proper learner. Binary classification faces no such barrier for learnability, but a similar one for optimal learning, which can in general only be achieved by improper learners. Fortunately, recent advances in binary classification have demonstrated that this requirement can… ▽ More Multiclass learnability is known to exhibit a properness barrier: there are learnable classes which cannot be learned by any proper learner. Binary classification faces no such barrier for learnability, but a similar one for optimal learning, which can in general only be achieved by improper learners. Fortunately, recent advances in binary classification have demonstrated that this requirement can be satisfied using aggregations of proper learners, some of which are strikingly simple. This raises a natural question: to what extent can simple aggregations of proper learners overcome the properness barrier in multiclass classification? We give a positive answer to this question for classes which have finite Graph dimension, $d_G$. Namely, we demonstrate that the optimal binary learners of Hanneke, Larsen, and Aden-Ali et al. (appropriately generalized to the multiclass setting) achieve sample complexity $O\left(\frac{d_G + \ln(1 / δ)}ε\right)$. This forms a strict improvement upon the sample complexity of ERM. We complement this with a lower bound demonstrating that for certain classes of Graph dimension $d_G$, majorities of ERM learners require $Ω\left( \frac{d_G + \ln(1 / δ)}ε\right)$ samples. Furthermore, we show that a single ERM requires $Ω\left(\frac{d_G \ln(1 / ε) + \ln(1 / δ)}ε\right)$ samples on such classes, exceeding the lower bound of Daniely et al. (2015) by a factor of $\ln(1 / ε)$. For multiclass learning in full generality -- i.e., for classes of finite DS dimension but possibly infinite Graph dimension -- we give a strong refutation to these learning strategies, by exhibiting a learnable class which cannot be learned to constant error by any aggregation of a finite number of proper learners. △ Less

Submitted 30 October, 2024; originally announced October 2024.

Comments: 23 pages

arXiv:2408.17148 [pdf, other]

The Many Faces of Optimal Weak-to-Strong Learning

Authors: Mikael Møller Høgsgaard, Kasper Green Larsen, Markus Engelund Mathiasen

Abstract: Boosting is an extremely successful idea, allowing one to combine multiple low accuracy classifiers into a much more accurate voting classifier. In this work, we present a new and surprisingly simple Boosting algorithm that obtains a provably optimal sample complexity. Sample optimal Boosting algorithms have only recently been developed, and our new algorithm has the fastest runtime among all such… ▽ More Boosting is an extremely successful idea, allowing one to combine multiple low accuracy classifiers into a much more accurate voting classifier. In this work, we present a new and surprisingly simple Boosting algorithm that obtains a provably optimal sample complexity. Sample optimal Boosting algorithms have only recently been developed, and our new algorithm has the fastest runtime among all such algorithms and is the simplest to describe: Partition your training data into 5 disjoint pieces of equal size, run AdaBoost on each, and combine the resulting classifiers via a majority vote. In addition to this theoretical contribution, we also perform the first empirical comparison of the proposed sample optimal Boosting algorithms. Our pilot empirical study suggests that our new algorithm might outperform previous algorithms on large data sets. △ Less

Submitted 30 August, 2024; originally announced August 2024.

arXiv:2408.16653 [pdf, other]

Optimal Parallelization of Boosting

Authors: Arthur da Cunha, Mikael Møller Høgsgaard, Kasper Green Larsen

Abstract: Recent works on the parallel complexity of Boosting have established strong lower bounds on the tradeoff between the number of training rounds $p$ and the total parallel work per round $t$. These works have also presented highly non-trivial parallel algorithms that shed light on different regions of this tradeoff. Despite these advancements, a significant gap persists between the theoretical lower… ▽ More Recent works on the parallel complexity of Boosting have established strong lower bounds on the tradeoff between the number of training rounds $p$ and the total parallel work per round $t$. These works have also presented highly non-trivial parallel algorithms that shed light on different regions of this tradeoff. Despite these advancements, a significant gap persists between the theoretical lower bounds and the performance of these algorithms across much of the tradeoff space. In this work, we essentially close this gap by providing both improved lower bounds on the parallel complexity of weak-to-strong learners, and a parallel Boosting algorithm whose performance matches these bounds across the entire $p$ vs.~$t$ compromise spectrum, up to logarithmic factors. Ultimately, this work settles the true parallel complexity of Boosting algorithms that are nearly sample-optimal. △ Less

Submitted 29 August, 2024; originally announced August 2024.

arXiv:2403.08831 [pdf, ps, other]

Majority-of-Three: The Simplest Optimal Learner?

Authors: Ishaq Aden-Ali, Mikael Møller Høgsgaard, Kasper Green Larsen, Nikita Zhivotovskiy

Abstract: Developing an optimal PAC learning algorithm in the realizable setting, where empirical risk minimization (ERM) is suboptimal, was a major open problem in learning theory for decades. The problem was finally resolved by Hanneke a few years ago. Unfortunately, Hanneke's algorithm is quite complex as it returns the majority vote of many ERM classifiers that are trained on carefully selected subsets… ▽ More Developing an optimal PAC learning algorithm in the realizable setting, where empirical risk minimization (ERM) is suboptimal, was a major open problem in learning theory for decades. The problem was finally resolved by Hanneke a few years ago. Unfortunately, Hanneke's algorithm is quite complex as it returns the majority vote of many ERM classifiers that are trained on carefully selected subsets of the data. It is thus a natural goal to determine the simplest algorithm that is optimal. In this work we study the arguably simplest algorithm that could be optimal: returning the majority vote of three ERM classifiers. We show that this algorithm achieves the optimal in-expectation bound on its error which is provably unattainable by a single ERM classifier. Furthermore, we prove a near-optimal high-probability bound on this algorithm's error. We conjecture that a better analysis will prove that this algorithm is in fact optimal in the high-probability regime. △ Less

Submitted 12 March, 2024; originally announced March 2024.

Comments: 22 pages

arXiv:2302.06165 [pdf, ps, other]

Sparse Dimensionality Reduction Revisited

Authors: Mikael Møller Høgsgaard, Lion Kamma, Kasper Green Larsen, Jelani Nelson, Chris Schwiegelshohn

Abstract: The sparse Johnson-Lindenstrauss transform is one of the central techniques in dimensionality reduction. It supports embedding a set of $n$ points in $\mathbb{R}^d$ into $m=O(\varepsilon^{-2} \lg n)$ dimensions while preserving all pairwise distances to within $1 \pm \varepsilon$. Each input point $x$ is embedded to $Ax$, where $A$ is an $m \times d$ matrix having $s$ non-zeros per column, allowin… ▽ More The sparse Johnson-Lindenstrauss transform is one of the central techniques in dimensionality reduction. It supports embedding a set of $n$ points in $\mathbb{R}^d$ into $m=O(\varepsilon^{-2} \lg n)$ dimensions while preserving all pairwise distances to within $1 \pm \varepsilon$. Each input point $x$ is embedded to $Ax$, where $A$ is an $m \times d$ matrix having $s$ non-zeros per column, allowing for an embedding time of $O(s \|x\|_0)$. Since the sparsity of $A$ governs the embedding time, much work has gone into improving the sparsity $s$. The current state-of-the-art by Kane and Nelson (JACM'14) shows that $s = O(\varepsilon ^{-1} \lg n)$ suffices. This is almost matched by a lower bound of $s = Ω(\varepsilon ^{-1} \lg n/\lg(1/\varepsilon))$ by Nelson and Nguyen (STOC'13). Previous work thus suggests that we have near-optimal embeddings. In this work, we revisit sparse embeddings and identify a loophole in the lower bound. Concretely, it requires $d \geq n$, which in many applications is unrealistic. We exploit this loophole to give a sparser embedding when $d = o(n)$, achieving $s = O(\varepsilon^{-1}(\lg n/\lg(1/\varepsilon)+\lg^{2/3}n \lg^{1/3} d))$. We also complement our analysis by strengthening the lower bound of Nelson and Nguyen to hold also when $d \ll n$, thereby matching the first term in our new sparsity upper bound. Finally, we also improve the sparsity of the best oblivious subspace embeddings for optimal embedding dimensionality. △ Less

Submitted 13 February, 2023; originally announced February 2023.

arXiv:2302.03071 [pdf, ps, other]

Optimally Interpolating between Ex-Ante Fairness and Welfare

Authors: Mikael Møller Høgsgaard, Panagiotis Karras, Wenyue Ma, Nidhi Rathi, Chris Schwiegelshohn

Abstract: For the fundamental problem of allocating a set of resources among individuals with varied preferences, the quality of an allocation relates to the degree of fairness and the collective welfare achieved. Unfortunately, in many resource-allocation settings, it is computationally hard to maximize welfare while achieving fairness goals. In this work, we consider ex-ante notions of fairness; popular… ▽ More For the fundamental problem of allocating a set of resources among individuals with varied preferences, the quality of an allocation relates to the degree of fairness and the collective welfare achieved. Unfortunately, in many resource-allocation settings, it is computationally hard to maximize welfare while achieving fairness goals. In this work, we consider ex-ante notions of fairness; popular examples include the \emph{randomized round-robin algorithm} and \emph{sortition mechanism}. We propose a general framework to systematically study the \emph{interpolation} between fairness and welfare goals in a multi-criteria setting. We develop two efficient algorithms ($\varepsilon-Mix$ and $Simple-Mix$) that achieve different trade-off guarantees with respect to fairness and welfare. $\varepsilon-Mix$ achieves an optimal multi-criteria approximation with respect to fairness and welfare, while $Simple-Mix$ achieves optimality up to a constant factor with zero computational overhead beyond the underlying \emph{welfare-maximizing mechanism} and the \emph{ex-ante fair mechanism}. Our framework makes no assumptions on either of the two underlying mechanisms, other than that the fair mechanism produces a distribution over the set of all allocations. Indeed, if these mechanisms are themselves approximation algorithms, our framework will retain the approximation factor, guaranteeing sensitivity to the quality of the underlying mechanisms, while being \emph{oblivious} to them. We also give an extensive experimental analysis for the aforementioned ex-ante fair mechanisms on real data sets, confirming our theoretical analysis. △ Less

Submitted 6 February, 2023; originally announced February 2023.

arXiv:2301.11571 [pdf, other]

AdaBoost is not an Optimal Weak to Strong Learner

Authors: Mikael Møller Høgsgaard, Kasper Green Larsen, Martin Ritzert

Abstract: AdaBoost is a classic boosting algorithm for combining multiple inaccurate classifiers produced by a weak learner, to produce a strong learner with arbitrarily high accuracy when given enough training data. Determining the optimal number of samples necessary to obtain a given accuracy of the strong learner, is a basic learning theoretic question. Larsen and Ritzert (NeurIPS'22) recently presented… ▽ More AdaBoost is a classic boosting algorithm for combining multiple inaccurate classifiers produced by a weak learner, to produce a strong learner with arbitrarily high accuracy when given enough training data. Determining the optimal number of samples necessary to obtain a given accuracy of the strong learner, is a basic learning theoretic question. Larsen and Ritzert (NeurIPS'22) recently presented the first provably optimal weak-to-strong learner. However, their algorithm is somewhat complicated and it remains an intriguing question whether the prototypical boosting algorithm AdaBoost also makes optimal use of training samples. In this work, we answer this question in the negative. Concretely, we show that the sample complexity of AdaBoost, and other classic variations thereof, are sub-optimal by at least one logarithmic factor in the desired accuracy of the strong learner. △ Less

Submitted 27 January, 2023; originally announced January 2023.

arXiv:2207.03304 [pdf, ps, other]

Barriers for Faster Dimensionality Reduction

Authors: Ora Nova Fandina, Mikael Møller Høgsgaard, Kasper Green Larsen

Abstract: The Johnson-Lindenstrauss transform allows one to embed a dataset of $n$ points in $\mathbb{R}^d$ into $\mathbb{R}^m,$ while preserving the pairwise distance between any pair of points up to a factor $(1 \pm \varepsilon)$, provided that $m = Ω(\varepsilon^{-2} \lg n)$. The transform has found an overwhelming number of algorithmic applications, allowing to speed up algorithms and reducing memory co… ▽ More The Johnson-Lindenstrauss transform allows one to embed a dataset of $n$ points in $\mathbb{R}^d$ into $\mathbb{R}^m,$ while preserving the pairwise distance between any pair of points up to a factor $(1 \pm \varepsilon)$, provided that $m = Ω(\varepsilon^{-2} \lg n)$. The transform has found an overwhelming number of algorithmic applications, allowing to speed up algorithms and reducing memory consumption at the price of a small loss in accuracy. A central line of research on such transforms, focus on developing fast embedding algorithms, with the classic example being the Fast JL transform by Ailon and Chazelle. All known such algorithms have an embedding time of $Ω(d \lg d)$, but no lower bounds rule out a clean $O(d)$ embedding time. In this work, we establish the first non-trivial lower bounds (of magnitude $Ω(m \lg m)$) for a large class of embedding algorithms, including in particular most known upper bounds. △ Less

Submitted 7 July, 2022; originally announced July 2022.

arXiv:2204.01800 [pdf, ps, other]

The Fast Johnson-Lindenstrauss Transform is Even Faster

Authors: Ora Nova Fandina, Mikael Møller Høgsgaard, Kasper Green Larsen

Abstract: The seminal Fast Johnson-Lindenstrauss (Fast JL) transform by Ailon and Chazelle (SICOMP'09) embeds a set of $n$ points in $d$-dimensional Euclidean space into optimal $k=O(\varepsilon^{-2} \ln n)$ dimensions, while preserving all pairwise distances to within a factor $(1 \pm \varepsilon)$. The Fast JL transform supports computing the embedding of a data point in $O(d \ln d +k \ln^2 n)$ time, wher… ▽ More The seminal Fast Johnson-Lindenstrauss (Fast JL) transform by Ailon and Chazelle (SICOMP'09) embeds a set of $n$ points in $d$-dimensional Euclidean space into optimal $k=O(\varepsilon^{-2} \ln n)$ dimensions, while preserving all pairwise distances to within a factor $(1 \pm \varepsilon)$. The Fast JL transform supports computing the embedding of a data point in $O(d \ln d +k \ln^2 n)$ time, where the $d \ln d$ term comes from multiplication with a $d \times d$ Hadamard matrix and the $k \ln^2 n$ term comes from multiplication with a sparse $k \times d$ matrix. Despite the Fast JL transform being more than a decade old, it is one of the fastest dimensionality reduction techniques for many tradeoffs between $\varepsilon, d$ and $n$. In this work, we give a surprising new analysis of the Fast JL transform, showing that the $k \ln^2 n$ term in the embedding time can be improved to $(k \ln^2 n)/α$ for an $α= Ω(\min\{\varepsilon^{-1}\ln(1/\varepsilon), \ln n\})$. The improvement follows by using an even sparser matrix. We also complement our improved analysis with a lower bound showing that our new analysis is in fact tight. △ Less

Submitted 4 April, 2022; originally announced April 2022.

Showing 1–14 of 14 results for author: Høgsgaard, M M