-
Improved Margin Generalization Bounds for Voting Classifiers
Authors:
Mikael Møller Høgsgaard,
Kasper Green Larsen
Abstract:
In this paper we establish a new margin-based generalization bound for voting classifiers, refining existing results and yielding tighter generalization guarantees for widely used boosting algorithms such as AdaBoost (Freund and Schapire, 1997). Furthermore, the new margin-based generalization bound enables the derivation of an optimal weak-to-strong learner: a Majority-of-3 large-margin classifie…
▽ More
In this paper we establish a new margin-based generalization bound for voting classifiers, refining existing results and yielding tighter generalization guarantees for widely used boosting algorithms such as AdaBoost (Freund and Schapire, 1997). Furthermore, the new margin-based generalization bound enables the derivation of an optimal weak-to-strong learner: a Majority-of-3 large-margin classifiers with an expected error matching the theoretical lower bound. This result provides a more natural alternative to the Majority-of-5 algorithm by (Høgsgaard et al., 2024), and matches the Majority-of-3 result by (Aden-Ali et al., 2024) for the realizable prediction model.
△ Less
Submitted 3 June, 2025; v1 submitted 23 February, 2025;
originally announced February 2025.
-
Tight Generalization Bounds for Large-Margin Halfspaces
Authors:
Kasper Green Larsen,
Natascha Schalburg
Abstract:
We prove the first generalization bound for large-margin halfspaces that is asymptotically tight in the tradeoff between the margin, the fraction of training points with the given margin, the failure probability and the number of training points.
We prove the first generalization bound for large-margin halfspaces that is asymptotically tight in the tradeoff between the margin, the fraction of training points with the given margin, the failure probability and the number of training points.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
Derandomizing Multi-Distribution Learning
Authors:
Kasper Green Larsen,
Omar Montasser,
Nikita Zhivotovskiy
Abstract:
Multi-distribution or collaborative learning involves learning a single predictor that works well across multiple data distributions, using samples from each during training. Recent research on multi-distribution learning, focusing on binary loss and finite VC dimension classes, has shown near-optimal sample complexity that is achieved with oracle efficient algorithms. That is, these algorithms ar…
▽ More
Multi-distribution or collaborative learning involves learning a single predictor that works well across multiple data distributions, using samples from each during training. Recent research on multi-distribution learning, focusing on binary loss and finite VC dimension classes, has shown near-optimal sample complexity that is achieved with oracle efficient algorithms. That is, these algorithms are computationally efficient given an efficient ERM for the class. Unlike in classical PAC learning, where the optimal sample complexity is achieved with deterministic predictors, current multi-distribution learning algorithms output randomized predictors. This raises the question: can these algorithms be derandomized to produce a deterministic predictor for multiple distributions? Through a reduction to discrepancy minimization, we show that derandomizing multi-distribution learning is computationally hard, even when ERM is computationally efficient. On the positive side, we identify a structural condition enabling an efficient black-box reduction, converting existing randomized multi-distribution predictors into deterministic ones.
△ Less
Submitted 26 September, 2024;
originally announced September 2024.
-
Revisiting Agnostic PAC Learning
Authors:
Steve Hanneke,
Kasper Green Larsen,
Nikita Zhivotovskiy
Abstract:
PAC learning, dating back to Valiant'84 and Vapnik and Chervonenkis'64,'74, is a classic model for studying supervised learning. In the agnostic setting, we have access to a hypothesis set $\mathcal{H}$ and a training set of labeled samples $(x_1,y_1),\dots,(x_n,y_n) \in \mathcal{X} \times \{-1,1\}$ drawn i.i.d. from an unknown distribution $\mathcal{D}$. The goal is to produce a classifier…
▽ More
PAC learning, dating back to Valiant'84 and Vapnik and Chervonenkis'64,'74, is a classic model for studying supervised learning. In the agnostic setting, we have access to a hypothesis set $\mathcal{H}$ and a training set of labeled samples $(x_1,y_1),\dots,(x_n,y_n) \in \mathcal{X} \times \{-1,1\}$ drawn i.i.d. from an unknown distribution $\mathcal{D}$. The goal is to produce a classifier $h : \mathcal{X} \to \{-1,1\}$ that is competitive with the hypothesis $h^\star_{\mathcal{D}} \in \mathcal{H}$ having the least probability of mispredicting the label $y$ of a new sample $(x,y)\sim \mathcal{D}$.
Empirical Risk Minimization (ERM) is a natural learning algorithm, where one simply outputs the hypothesis from $\mathcal{H}$ making the fewest mistakes on the training data. This simple algorithm is known to have an optimal error in terms of the VC-dimension of $\mathcal{H}$ and the number of samples $n$.
In this work, we revisit agnostic PAC learning and first show that ERM is in fact sub-optimal if we treat the performance of the best hypothesis, denoted $τ:=\Pr_{\mathcal{D}}[h^\star_{\mathcal{D}}(x) \neq y]$, as a parameter. Concretely we show that ERM, and any other proper learning algorithm, is sub-optimal by a $\sqrt{\ln(1/τ)}$ factor. We then complement this lower bound with the first learning algorithm achieving an optimal error for nearly the full range of $τ$. Our algorithm introduces several new ideas that we hope may find further applications in learning theory.
△ Less
Submitted 29 July, 2024;
originally announced July 2024.
-
Majority-of-Three: The Simplest Optimal Learner?
Authors:
Ishaq Aden-Ali,
Mikael Møller Høgsgaard,
Kasper Green Larsen,
Nikita Zhivotovskiy
Abstract:
Developing an optimal PAC learning algorithm in the realizable setting, where empirical risk minimization (ERM) is suboptimal, was a major open problem in learning theory for decades. The problem was finally resolved by Hanneke a few years ago. Unfortunately, Hanneke's algorithm is quite complex as it returns the majority vote of many ERM classifiers that are trained on carefully selected subsets…
▽ More
Developing an optimal PAC learning algorithm in the realizable setting, where empirical risk minimization (ERM) is suboptimal, was a major open problem in learning theory for decades. The problem was finally resolved by Hanneke a few years ago. Unfortunately, Hanneke's algorithm is quite complex as it returns the majority vote of many ERM classifiers that are trained on carefully selected subsets of the data. It is thus a natural goal to determine the simplest algorithm that is optimal. In this work we study the arguably simplest algorithm that could be optimal: returning the majority vote of three ERM classifiers. We show that this algorithm achieves the optimal in-expectation bound on its error which is provably unattainable by a single ERM classifier. Furthermore, we prove a near-optimal high-probability bound on this algorithm's error. We conjecture that a better analysis will prove that this algorithm is in fact optimal in the high-probability regime.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
Diagonalization Games
Authors:
Noga Alon,
Olivier Bousquet,
Kasper Green Larsen,
Shay Moran,
Shlomo Moran
Abstract:
We study several variants of a combinatorial game which is based on Cantor's diagonal argument.
The game is between two players called Kronecker and Cantor. The names of the players are motivated by the known fact that Leopold Kronecker did not appreciate Georg Cantor's arguments about the infinite, and even referred to him as a "scientific charlatan". In the game Kronecker maintains a list of m…
▽ More
We study several variants of a combinatorial game which is based on Cantor's diagonal argument.
The game is between two players called Kronecker and Cantor. The names of the players are motivated by the known fact that Leopold Kronecker did not appreciate Georg Cantor's arguments about the infinite, and even referred to him as a "scientific charlatan". In the game Kronecker maintains a list of m binary vectors, each of length n, and Cantor's goal is to produce a new binary vector which is different from each of Kronecker's vectors, or prove that no such vector exists. Cantor does not see Kronecker's vectors but he is allowed to ask queries of the form"What is bit number j of vector number i?" What is the minimal number of queries with which Cantor can achieve his goal? How much better can Cantor do if he is allowed to pick his queries \emph{adaptively}, based on Kronecker's previous replies? The case when m=n is solved by diagonalization using n (non-adaptive) queries. We study this game more generally, and prove an optimal bound in the adaptive case and nearly tight upper and lower bounds in the non-adaptive case.
△ Less
Submitted 22 January, 2023; v1 submitted 5 January, 2023;
originally announced January 2023.
-
Optimal Learning of Joint Alignments with a Faulty Oracle
Authors:
Kasper Green Larsen,
Michael Mitzenmacher,
Charalampos E. Tsourakakis
Abstract:
We consider the following problem, which is useful in applications such as joint image and shape alignment. The goal is to recover $n$ discrete variables $g_i \in \{0, \ldots, k-1\}$ (up to some global offset) given noisy observations of a set of their pairwise differences $\{(g_i - g_j) \bmod k\}$; specifically, with probability $\frac{1}{k}+δ$ for some $δ> 0$ one obtains the correct answer, and…
▽ More
We consider the following problem, which is useful in applications such as joint image and shape alignment. The goal is to recover $n$ discrete variables $g_i \in \{0, \ldots, k-1\}$ (up to some global offset) given noisy observations of a set of their pairwise differences $\{(g_i - g_j) \bmod k\}$; specifically, with probability $\frac{1}{k}+δ$ for some $δ> 0$ one obtains the correct answer, and with the remaining probability one obtains a uniformly random incorrect answer. We consider a learning-based formulation where one can perform a query to observe a pairwise difference, and the goal is to perform as few queries as possible while obtaining the exact joint alignment. We provide an easy-to-implement, time efficient algorithm that performs $O\big(\frac{n \lg n}{k δ^2}\big)$ queries, and recovers the joint alignment with high probability. We also show that our algorithm is optimal by proving a general lower bound that holds for all non-adaptive algorithms. Our work improves significantly recent work by Chen and Candés \cite{chen2016projected}, who view the problem as a constrained principal components analysis problem that can be solved using the power method. Specifically, our approach is simpler both in the algorithm and the analysis, and provides additional insights into the problem structure.
△ Less
Submitted 21 September, 2019;
originally announced September 2019.
-
Fully Understanding the Hashing Trick
Authors:
Casper Benjamin Freksen,
Lior Kamma,
Kasper Green Larsen
Abstract:
Feature hashing, also known as {\em the hashing trick}, introduced by Weinberger et al. (2009), is one of the key techniques used in scaling-up machine learning algorithms. Loosely speaking, feature hashing uses a random sparse projection matrix $A : \mathbb{R}^n \to \mathbb{R}^m$ (where $m \ll n$) in order to reduce the dimension of the data from $n$ to $m$ while approximately preserving the Eucl…
▽ More
Feature hashing, also known as {\em the hashing trick}, introduced by Weinberger et al. (2009), is one of the key techniques used in scaling-up machine learning algorithms. Loosely speaking, feature hashing uses a random sparse projection matrix $A : \mathbb{R}^n \to \mathbb{R}^m$ (where $m \ll n$) in order to reduce the dimension of the data from $n$ to $m$ while approximately preserving the Euclidean norm. Every column of $A$ contains exactly one non-zero entry, equals to either $-1$ or $1$.
Weinberger et al. showed tail bounds on $\|Ax\|_2^2$. Specifically they showed that for every $\varepsilon, δ$, if $\|x\|_{\infty} / \|x\|_2$ is sufficiently small, and $m$ is sufficiently large, then $$\Pr[ \; | \;\|Ax\|_2^2 - \|x\|_2^2\; | < \varepsilon \|x\|_2^2 \;] \ge 1 - δ\;.$$ These bounds were later extended by Dasgupta \etal (2010) and most recently refined by Dahlgaard et al. (2017), however, the true nature of the performance of this key technique, and specifically the correct tradeoff between the pivotal parameters $\|x\|_{\infty} / \|x\|_2, m, \varepsilon, δ$ remained an open question.
We settle this question by giving tight asymptotic bounds on the exact tradeoff between the central parameters, thus providing a complete understanding of the performance of feature hashing. We complement the asymptotic bound with empirical data, which shows that the constants "hiding" in the asymptotic notation are, in fact, very close to $1$, thus further illustrating the tightness of the presented bounds in practice.
△ Less
Submitted 22 May, 2018;
originally announced May 2018.
-
Constructive Discrepancy Minimization with Hereditary L2 Guarantees
Authors:
Kasper Green Larsen
Abstract:
In discrepancy minimization problems, we are given a family of sets $\mathcal{S} = \{S_1,\dots,S_m\}$, with each $S_i \in \mathcal{S}$ a subset of some universe $U = \{u_1,\dots,u_n\}$ of $n$ elements. The goal is to find a coloring $χ: U \to \{-1,+1\}$ of the elements of $U$ such that each set $S \in \mathcal{S}$ is colored as evenly as possible. Two classic measures of discrepancy are…
▽ More
In discrepancy minimization problems, we are given a family of sets $\mathcal{S} = \{S_1,\dots,S_m\}$, with each $S_i \in \mathcal{S}$ a subset of some universe $U = \{u_1,\dots,u_n\}$ of $n$ elements. The goal is to find a coloring $χ: U \to \{-1,+1\}$ of the elements of $U$ such that each set $S \in \mathcal{S}$ is colored as evenly as possible. Two classic measures of discrepancy are $\ell_\infty$-discrepancy defined as $\textrm{disc}_\infty(\mathcal{S},χ):=\max_{S \in \mathcal{S}} | \sum_{u_i \in S} χ(u_i) |$ and $\ell_2$-discrepancy defined as $\textrm{disc}_2(\mathcal{S},χ):=\sqrt{(1/|\mathcal{S}|)\sum_{S \in \mathcal{S}} \left(\sum_{u_i \in S}χ(u_i)\right)^2}$. Breakthrough work by Bansal gave a polynomial time algorithm, based on rounding an SDP, for finding a coloring $χ$ such that $\textrm{disc}_\infty(\mathcal{S},χ) = O(\lg n \cdot \textrm{herdisc}_\infty(\mathcal{S}))$ where $\textrm{herdisc}_\infty(\mathcal{S})$ is the hereditary $\ell_\infty$-discrepancy of $\mathcal{S}$. We complement his work by giving a simple $O((m+n)n^2)$ time algorithm for finding a coloring $χ$ such $\textrm{disc}_2(\mathcal{S},χ) = O(\sqrt{\lg n} \cdot \textrm{herdisc}_2(\mathcal{S}))$ where $\textrm{herdisc}_2(\mathcal{S})$ is the hereditary $\ell_2$-discrepancy of $\mathcal{S}$. Interestingly, our algorithm avoids solving an SDP and instead relies on computing eigendecompositions of matrices. Moreover, we use our ideas to speed up the Edge-Walk algorithm by Lovett and Meka [SICOMP'15]. To prove that our algorithm has the claimed guarantees, we show new inequalities relating $\textrm{herdisc}_\infty$ and $\textrm{herdisc}_2$ to the eigenvalues of the matrix corresponding to $\mathcal{S}$. Our inequalities improve over previous work by Chazelle and Lvov, and by Matousek et al. Finally, we also implement our algorithm and show that it far outperforms random sampling.
△ Less
Submitted 13 December, 2018; v1 submitted 8 November, 2017;
originally announced November 2017.
-
Predicting Positive and Negative Links with Noisy Queries: Theory & Practice
Authors:
Charalampos E. Tsourakakis,
Michael Mitzenmacher,
Kasper Green Larsen,
Jarosław Błasiok,
Ben Lawson,
Preetum Nakkiran,
Vasileios Nakos
Abstract:
Social networks involve both positive and negative relationships, which can be captured in signed graphs. The {\em edge sign prediction problem} aims to predict whether an interaction between a pair of nodes will be positive or negative. We provide theoretical results for this problem that motivate natural improvements to recent heuristics.
The edge sign prediction problem is related to correlat…
▽ More
Social networks involve both positive and negative relationships, which can be captured in signed graphs. The {\em edge sign prediction problem} aims to predict whether an interaction between a pair of nodes will be positive or negative. We provide theoretical results for this problem that motivate natural improvements to recent heuristics.
The edge sign prediction problem is related to correlation clustering; a positive relationship means being in the same cluster. We consider the following model for two clusters: we are allowed to query any pair of nodes whether they belong to the same cluster or not, but the answer to the query is corrupted with some probability $0<q<\frac{1}{2}$. Let $δ=1-2q$ be the bias. We provide an algorithm that recovers all signs correctly with high probability in the presence of noise with $O(\frac{n\log n}{δ^2}+\frac{\log^2 n}{δ^6})$ queries. This is the best known result for this problem for all but tiny $δ$, improving on the recent work of Mazumdar and Saha \cite{mazumdar2017clustering}. We also provide an algorithm that performs $O(\frac{n\log n}{δ^4})$ queries, and uses breadth first search as its main algorithmic primitive. While both the running time and the number of queries for this algorithm are sub-optimal, our result relies on novel theoretical techniques, and naturally suggests the use of edge-disjoint paths as a feature for predicting signs in online social networks. Correspondingly, we experiment with using edge disjoint $s-t$ paths of short length as a feature for predicting the sign of edge $(s,t)$ in real-world signed networks. Empirical findings suggest that the use of such paths improves the classification accuracy, especially for pairs of nodes with no common neighbors.
△ Less
Submitted 6 December, 2020; v1 submitted 19 September, 2017;
originally announced September 2017.
-
On Using Toeplitz and Circulant Matrices for Johnson-Lindenstrauss Transforms
Authors:
Casper Benjamin Freksen,
Kasper Green Larsen
Abstract:
The Johnson-Lindenstrauss lemma is one of the corner stone results in dimensionality reduction. It says that given $N$, for any set of $N$ vectors $X \subset \mathbb{R}^n$, there exists a mapping $f : X \to \mathbb{R}^m$ such that $f(X)$ preserves all pairwise distances between vectors in $X$ to within $(1 \pm \varepsilon)$ if $m = O(\varepsilon^{-2} \lg N)$. Much effort has gone into developing f…
▽ More
The Johnson-Lindenstrauss lemma is one of the corner stone results in dimensionality reduction. It says that given $N$, for any set of $N$ vectors $X \subset \mathbb{R}^n$, there exists a mapping $f : X \to \mathbb{R}^m$ such that $f(X)$ preserves all pairwise distances between vectors in $X$ to within $(1 \pm \varepsilon)$ if $m = O(\varepsilon^{-2} \lg N)$. Much effort has gone into developing fast embedding algorithms, with the Fast Johnson-Lindenstrauss transform of Ailon and Chazelle being one of the most well-known techniques. The current fastest algorithm that yields the optimal $m = O(\varepsilon^{-2}\lg N)$ dimensions has an embedding time of $O(n \lg n + \varepsilon^{-2} \lg^3 N)$. An exciting approach towards improving this, due to Hinrichs and Vybíral, is to use a random $m \times n$ Toeplitz matrix for the embedding. Using Fast Fourier Transform, the embedding of a vector can then be computed in $O(n \lg m)$ time. The big question is of course whether $m = O(\varepsilon^{-2} \lg N)$ dimensions suffice for this technique. If so, this would end a decades long quest to obtain faster and faster Johnson-Lindenstrauss transforms. The current best analysis of the embedding of Hinrichs and Vybíral shows that $m = O(\varepsilon^{-2}\lg^2 N)$ dimensions suffices. The main result of this paper, is a proof that this analysis unfortunately cannot be tightened any further, i.e., there exists a set of $N$ vectors requiring $m = Ω(\varepsilon^{-2} \lg^2 N)$ for the Toeplitz approach to work.
△ Less
Submitted 8 November, 2017; v1 submitted 30 June, 2017;
originally announced June 2017.
-
Optimality of the Johnson-Lindenstrauss Lemma
Authors:
Kasper Green Larsen,
Jelani Nelson
Abstract:
For any integers $d, n \geq 2$ and $1/({\min\{n,d\}})^{0.4999} < \varepsilon<1$, we show the existence of a set of $n$ vectors $X\subset \mathbb{R}^d$ such that any embedding $f:X\rightarrow \mathbb{R}^m$ satisfying $$ \forall x,y\in X,\ (1-\varepsilon)\|x-y\|_2^2\le \|f(x)-f(y)\|_2^2 \le (1+\varepsilon)\|x-y\|_2^2 $$ must have $$ m = Ω(\varepsilon^{-2} \lg n). $$ This lower bound matches the uppe…
▽ More
For any integers $d, n \geq 2$ and $1/({\min\{n,d\}})^{0.4999} < \varepsilon<1$, we show the existence of a set of $n$ vectors $X\subset \mathbb{R}^d$ such that any embedding $f:X\rightarrow \mathbb{R}^m$ satisfying $$ \forall x,y\in X,\ (1-\varepsilon)\|x-y\|_2^2\le \|f(x)-f(y)\|_2^2 \le (1+\varepsilon)\|x-y\|_2^2 $$ must have $$ m = Ω(\varepsilon^{-2} \lg n). $$ This lower bound matches the upper bound given by the Johnson-Lindenstrauss lemma [JL84]. Furthermore, our lower bound holds for nearly the full range of $\varepsilon$ of interest, since there is always an isometric embedding into dimension $\min\{d, n\}$ (either the identity map, or projection onto $\mathop{span}(X)$).
Previously such a lower bound was only known to hold against linear maps $f$, and not for such a wide range of parameters $\varepsilon, n, d$ [LN16]. The best previously known lower bound for general $f$ was $m = Ω(\varepsilon^{-2}\lg n/\lg(1/\varepsilon))$ [Wel74, Lev83, Alo03], which is suboptimal for any $\varepsilon = o(1)$.
△ Less
Submitted 8 November, 2017; v1 submitted 7 September, 2016;
originally announced September 2016.
-
The Johnson-Lindenstrauss lemma is optimal for linear dimensionality reduction
Authors:
Kasper Green Larsen,
Jelani Nelson
Abstract:
For any $n>1$ and $0<\varepsilon<1/2$, we show the existence of an $n^{O(1)}$-point subset $X$ of $\mathbb{R}^n$ such that any linear map from $(X,\ell_2)$ to $\ell_2^m$ with distortion at most $1+\varepsilon$ must have $m = Ω(\min\{n, \varepsilon^{-2}\log n\})$. Our lower bound matches the upper bounds provided by the identity matrix and the Johnson-Lindenstrauss lemma, improving the previous low…
▽ More
For any $n>1$ and $0<\varepsilon<1/2$, we show the existence of an $n^{O(1)}$-point subset $X$ of $\mathbb{R}^n$ such that any linear map from $(X,\ell_2)$ to $\ell_2^m$ with distortion at most $1+\varepsilon$ must have $m = Ω(\min\{n, \varepsilon^{-2}\log n\})$. Our lower bound matches the upper bounds provided by the identity matrix and the Johnson-Lindenstrauss lemma, improving the previous lower bound of Alon by a $\log(1/\varepsilon)$ factor.
△ Less
Submitted 10 November, 2014;
originally announced November 2014.