-
Better Neural Network Expressivity: Subdividing the Simplex
Authors:
Egor Bakaev,
Florestan Brunck,
Christoph Hertrich,
Jack Stade,
Amir Yehudayoff
Abstract:
This work studies the expressivity of ReLU neural networks with a focus on their depth. A sequence of previous works showed that $\lceil \log_2(n+1) \rceil$ hidden layers are sufficient to compute all continuous piecewise linear (CPWL) functions on $\mathbb{R}^n$. Hertrich, Basu, Di Summa, and Skutella (NeurIPS'21) conjectured that this result is optimal in the sense that there are CPWL functions…
▽ More
This work studies the expressivity of ReLU neural networks with a focus on their depth. A sequence of previous works showed that $\lceil \log_2(n+1) \rceil$ hidden layers are sufficient to compute all continuous piecewise linear (CPWL) functions on $\mathbb{R}^n$. Hertrich, Basu, Di Summa, and Skutella (NeurIPS'21) conjectured that this result is optimal in the sense that there are CPWL functions on $\mathbb{R}^n$, like the maximum function, that require this depth. We disprove the conjecture and show that $\lceil\log_3(n-1)\rceil+1$ hidden layers are sufficient to compute all CPWL functions on $\mathbb{R}^n$.
A key step in the proof is that ReLU neural networks with two hidden layers can exactly represent the maximum function of five inputs. More generally, we show that $\lceil\log_3(n-2)\rceil+1$ hidden layers are sufficient to compute the maximum of $n\geq 4$ numbers. Our constructions almost match the $\lceil\log_3(n)\rceil$ lower bound of Averkov, Hojny, and Merkert (ICLR'25) in the special case of ReLU networks with weights that are decimal fractions. The constructions have a geometric interpretation via polyhedral subdivisions of the simplex into ``easier'' polytopes.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
On the Depth of Monotone ReLU Neural Networks and ICNNs
Authors:
Egor Bakaev,
Florestan Brunck,
Christoph Hertrich,
Daniel Reichman,
Amir Yehudayoff
Abstract:
We study two models of ReLU neural networks: monotone networks (ReLU$^+$) and input convex neural networks (ICNN). Our focus is on expressivity, mostly in terms of depth, and we prove the following lower bounds. For the maximum function MAX$_n$ computing the maximum of $n$ real numbers, we show that ReLU$^+$ networks cannot compute MAX$_n$, or even approximate it. We prove a sharp $n$ lower bound…
▽ More
We study two models of ReLU neural networks: monotone networks (ReLU$^+$) and input convex neural networks (ICNN). Our focus is on expressivity, mostly in terms of depth, and we prove the following lower bounds. For the maximum function MAX$_n$ computing the maximum of $n$ real numbers, we show that ReLU$^+$ networks cannot compute MAX$_n$, or even approximate it. We prove a sharp $n$ lower bound on the ICNN depth complexity of MAX$_n$. We also prove depth separations between ReLU networks and ICNNs; for every $k$, there is a depth-2 ReLU network of size $O(k^2)$ that cannot be simulated by a depth-$k$ ICNN. The proofs are based on deep connections between neural networks and polyhedral geometry, and also use isoperimetric properties of triangulations.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
On the Space Complexity of Online Convolution
Authors:
Joel Daniel Andersson,
Amir Yehudayoff
Abstract:
We study a discrete convolution streaming problem. An input arrives as a stream of numbers $z = (z_0,z_1,z_2,\ldots)$, and at time $t$ our goal is to output $(Tz)_t$ where $T$ is a lower-triangular Toeplitz matrix. We focus on space complexity; the algorithm can store a buffer of $β(t)$ numbers in order to achieve this goal.
We characterize space complexity when algorithms perform continuous ope…
▽ More
We study a discrete convolution streaming problem. An input arrives as a stream of numbers $z = (z_0,z_1,z_2,\ldots)$, and at time $t$ our goal is to output $(Tz)_t$ where $T$ is a lower-triangular Toeplitz matrix. We focus on space complexity; the algorithm can store a buffer of $β(t)$ numbers in order to achieve this goal.
We characterize space complexity when algorithms perform continuous operations. The matrix $T$ corresponds to a generating function $G(x)$. If $G(x)$ is rational of degree $d$, then it is known that the space complexity is at most $O(d)$. We prove a corresponding lower bound; the space complexity is at least $Ω(d)$. In addition, for irrational $G(x)$, we prove that the space complexity is infinite. We also provide finite-time guarantees. For example, for the generating function $\frac{1}{\sqrt{1-x}}$ that was studied in various previous works in the context of differentially private continual counting, we prove a sharp lower bound on the space complexity; at time $t$, it is at least $Ω(t)$.
△ Less
Submitted 13 May, 2025; v1 submitted 30 April, 2025;
originally announced May 2025.
-
Data Selection for ERMs
Authors:
Steve Hanneke,
Shay Moran,
Alexander Shlimovich,
Amir Yehudayoff
Abstract:
Learning theory has traditionally followed a model-centric approach, focusing on designing optimal algorithms for a fixed natural learning task (e.g., linear classification or regression). In this paper, we adopt a complementary data-centric perspective, whereby we fix a natural learning rule and focus on optimizing the training data. Specifically, we study the following question: given a learning…
▽ More
Learning theory has traditionally followed a model-centric approach, focusing on designing optimal algorithms for a fixed natural learning task (e.g., linear classification or regression). In this paper, we adopt a complementary data-centric perspective, whereby we fix a natural learning rule and focus on optimizing the training data. Specifically, we study the following question: given a learning rule $\mathcal{A}$ and a data selection budget $n$, how well can $\mathcal{A}$ perform when trained on at most $n$ data points selected from a population of $N$ points? We investigate when it is possible to select $n \ll N$ points and achieve performance comparable to training on the entire population.
We address this question across a variety of empirical risk minimizers. Our results include optimal data-selection bounds for mean estimation, linear classification, and linear regression. Additionally, we establish two general results: a taxonomy of error rates in binary classification and in stochastic convex optimization. Finally, we propose several open questions and directions for future research.
△ Less
Submitted 25 April, 2025; v1 submitted 20 April, 2025;
originally announced April 2025.
-
The Algebraic Cost of a Boolean Sum
Authors:
Ian Orzel,
Srikanth Srinivasan,
Sébastien Tavenas,
Amir Yehudayoff
Abstract:
The P versus NP problem is about the computational power of an existential $\exists_{w \in \{0,1\}^n}$ quantifier. The VP versus VNP problem is about the power of a boolean sum $\sum_{w \in \{0,1\}^n}$ operation. We study the power of a single boolean sum $\sum_{w \in \{0,1\}}$, and prove that in some cases the cost of eliminating this sum is large. This identifies a fundamental difference between…
▽ More
The P versus NP problem is about the computational power of an existential $\exists_{w \in \{0,1\}^n}$ quantifier. The VP versus VNP problem is about the power of a boolean sum $\sum_{w \in \{0,1\}^n}$ operation. We study the power of a single boolean sum $\sum_{w \in \{0,1\}}$, and prove that in some cases the cost of eliminating this sum is large. This identifies a fundamental difference between the permanent and the determinant. This investigation also leads to the simplest proof we are aware of that the permanent is VNP-complete.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
Intuitive norms are Euclidean
Authors:
Shay Moran,
Alexander Shlimovich,
Amir Yehudayoff
Abstract:
We call a norm on $\mathbb{R}^n$ intuitive if for every points $p_1,\ldots,p_m$ in $\mathbb{R}^n$, one of the geometric medians of the points over the norm is in their convex hull. We characterize all intuitive norms.
We call a norm on $\mathbb{R}^n$ intuitive if for every points $p_1,\ldots,p_m$ in $\mathbb{R}^n$, one of the geometric medians of the points over the norm is in their convex hull. We characterize all intuitive norms.
△ Less
Submitted 7 January, 2025; v1 submitted 5 January, 2025;
originally announced January 2025.
-
A Blaschke-Santaló inequality for unconditional log-concave measures
Authors:
Emanuel Milman,
Amir Yehudayoff
Abstract:
The Blaschke-Santaló inequality states that the volume product $|K| \cdot |K^o|$ of a symmetric convex body $K \subset \mathbb{R}^n$ is maximized by the standard Euclidean unit-ball. Cordero-Erausquin asked whether the inequality remains true for all even log-concave measures. We verify that the inequality is true for all unconditional log-concave measures.
The Blaschke-Santaló inequality states that the volume product $|K| \cdot |K^o|$ of a symmetric convex body $K \subset \mathbb{R}^n$ is maximized by the standard Euclidean unit-ball. Cordero-Erausquin asked whether the inequality remains true for all even log-concave measures. We verify that the inequality is true for all unconditional log-concave measures.
△ Less
Submitted 28 October, 2024;
originally announced October 2024.
-
Fixed and Periodic Points of the Intersection Body Operator
Authors:
Emanuel Milman,
Shahar Shabelman,
Amir Yehudayoff
Abstract:
The intersection body $IK$ of a star-body $K$ in $\mathbb{R}^n$ was introduced by E. Lutwak following the work of H. Busemann, and plays a central role in the dual Brunn-Minkowski theory. We show that when $n \geq 3$, $I^2 K = c K$ iff $K$ is a centered ellipsoid, and hence $I K = c K$ iff $K$ is a centered Euclidean ball, answering long-standing questions by Lutwak, Gardner, and Fish-Nazarov-Ryab…
▽ More
The intersection body $IK$ of a star-body $K$ in $\mathbb{R}^n$ was introduced by E. Lutwak following the work of H. Busemann, and plays a central role in the dual Brunn-Minkowski theory. We show that when $n \geq 3$, $I^2 K = c K$ iff $K$ is a centered ellipsoid, and hence $I K = c K$ iff $K$ is a centered Euclidean ball, answering long-standing questions by Lutwak, Gardner, and Fish-Nazarov-Ryabogin-Zvavitch. To this end, we recast the iterated intersection body equation as an Euler-Lagrange equation for a certain volume functional under radial perturbations, derive new formulas for the volume of $I K$, and introduce a continuous version of Steiner symmetrization for Lipschitz star-bodies, which (surprisingly) yields a useful radial perturbation exactly when $n\geq 3$.
△ Less
Submitted 10 June, 2025; v1 submitted 15 August, 2024;
originally announced August 2024.
-
Dual VC Dimension Obstructs Sample Compression by Embeddings
Authors:
Zachary Chase,
Bogdan Chornomaz,
Steve Hanneke,
Shay Moran,
Amir Yehudayoff
Abstract:
This work studies embedding of arbitrary VC classes in well-behaved VC classes, focusing particularly on extremal classes. Our main result expresses an impossibility: such embeddings necessarily require a significant increase in dimension. In particular, we prove that for every $d$ there is a class with VC dimension $d$ that cannot be embedded in any extremal class of VC dimension smaller than exp…
▽ More
This work studies embedding of arbitrary VC classes in well-behaved VC classes, focusing particularly on extremal classes. Our main result expresses an impossibility: such embeddings necessarily require a significant increase in dimension. In particular, we prove that for every $d$ there is a class with VC dimension $d$ that cannot be embedded in any extremal class of VC dimension smaller than exponential in $d$.
In addition to its independent interest, this result has an important implication in learning theory, as it reveals a fundamental limitation of one of the most extensively studied approaches to tackling the long-standing sample compression conjecture. Concretely, the approach proposed by Floyd and Warmuth entails embedding any given VC class into an extremal class of a comparable dimension, and then applying an optimal sample compression scheme for extremal classes. However, our results imply that this strategy would in some cases result in a sample compression scheme at least exponentially larger than what is predicted by the sample compression conjecture.
The above implications follow from a general result we prove: any extremal class with VC dimension $d$ has dual VC dimension at most $2d+1$. This bound is exponentially smaller than the classical bound $2^{d+1}-1$ of Assouad, which applies to general concept classes (and is known to be unimprovable for some classes). We in fact prove a stronger result, establishing that $2d+1$ upper bounds the dual Radon number of extremal classes. This theorem represents an abstraction of the classical Radon theorem for convex sets, extending its applicability to a wider combinatorial framework, without relying on the specifics of Euclidean convexity. The proof utilizes the topological method and is primarily based on variants of the Topological Radon Theorem.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
The Sample Complexity Of ERMs In Stochastic Convex Optimization
Authors:
Daniel Carmon,
Roi Livni,
Amir Yehudayoff
Abstract:
Stochastic convex optimization is one of the most well-studied models for learning in modern machine learning. Nevertheless, a central fundamental question in this setup remained unresolved: "How many data points must be observed so that any empirical risk minimizer (ERM) shows good performance on the true population?" This question was proposed by Feldman (2016), who proved that…
▽ More
Stochastic convex optimization is one of the most well-studied models for learning in modern machine learning. Nevertheless, a central fundamental question in this setup remained unresolved: "How many data points must be observed so that any empirical risk minimizer (ERM) shows good performance on the true population?" This question was proposed by Feldman (2016), who proved that $Ω(\frac{d}ε+\frac{1}{ε^2})$ data points are necessary (where $d$ is the dimension and $ε>0$ is the accuracy parameter). Proving an $ω(\frac{d}ε+\frac{1}{ε^2})$ lower bound was left as an open problem. In this work we show that in fact $\tilde{O}(\frac{d}ε+\frac{1}{ε^2})$ data points are also sufficient. This settles the question and yields a new separation between ERMs and uniform convergence. This sample complexity holds for the classical setup of learning bounded convex Lipschitz functions over the Euclidean unit ball. We further generalize the result and show that a similar upper bound holds for all symmetric convex bodies. The general bound is composed of two terms: (i) a term of the form $\tilde{O}(\frac{d}ε)$ with an inverse-linear dependence on the accuracy parameter, and (ii) a term that depends on the statistical complexity of the class of $\textit{linear}$ functions (captured by the Rademacher complexity). The proof builds a mechanism for controlling the behavior of stochastic convex optimization problems.
△ Less
Submitted 9 November, 2023;
originally announced November 2023.
-
Local Borsuk-Ulam, Stability, and Replicability
Authors:
Zachary Chase,
Bogdan Chornomaz,
Shay Moran,
Amir Yehudayoff
Abstract:
We use and adapt the Borsuk-Ulam Theorem from topology to derive limitations on list-replicable and globally stable learning algorithms. We further demonstrate the applicability of our methods in combinatorics and topology.
We show that, besides trivial cases, both list-replicable and globally stable learning are impossible in the agnostic PAC setting. This is in contrast with the realizable cas…
▽ More
We use and adapt the Borsuk-Ulam Theorem from topology to derive limitations on list-replicable and globally stable learning algorithms. We further demonstrate the applicability of our methods in combinatorics and topology.
We show that, besides trivial cases, both list-replicable and globally stable learning are impossible in the agnostic PAC setting. This is in contrast with the realizable case where it is known that any class with a finite Littlestone dimension can be learned by such algorithms. In the realizable PAC setting, we sharpen previous impossibility results and broaden their scope. Specifically, we establish optimal bounds for list replicability and global stability numbers in finite classes. This provides an exponential improvement over previous works and implies an exponential separation from the Littlestone dimension. We further introduce lower bounds for weak learners, i.e., learners that are only marginally better than random guessing. Lower bounds from previous works apply only to stronger learners.
To offer a broader and more comprehensive view of our topological approach, we prove a local variant of the Borsuk-Ulam theorem in topology and a result in combinatorics concerning Kneser colorings. In combinatorics, we prove that if $c$ is a coloring of all non-empty subsets of $[n]$ such that disjoint sets have different colors, then there is a chain of subsets that receives at least $1+ \lfloor n/2\rfloor$ colors (this bound is sharp). In topology, we prove e.g. that for any open antipodal-free cover of the $d$-dimensional sphere, there is a point $x$ that belongs to at least $t=\lceil\frac{d+3}{2}\rceil$ sets.
△ Less
Submitted 2 November, 2023;
originally announced November 2023.
-
The discrepancy of greater-than
Authors:
Srikanth Srinivasan,
Amir Yehudayoff
Abstract:
The discrepancy of the $n \times n$ greater-than matrix is shown to be $\fracπ{2 \ln n}$ up to lower order terms.
The discrepancy of the $n \times n$ greater-than matrix is shown to be $\fracπ{2 \ln n}$ up to lower order terms.
△ Less
Submitted 15 September, 2023;
originally announced September 2023.
-
Dual Systolic Graphs
Authors:
Daniel Carmon,
Amir Yehudayoff
Abstract:
We define a family of graphs we call dual systolic graphs. This definition comes from graphs that are duals of systolic simplicial complexes. Our main result is a sharp (up to constants) isoperimetric inequality for dual systolic graphs. The first step in the proof is an extension of the classical isoperimetric inequality of the boolean cube. The isoperimetric inequality for dual systolic graphs,…
▽ More
We define a family of graphs we call dual systolic graphs. This definition comes from graphs that are duals of systolic simplicial complexes. Our main result is a sharp (up to constants) isoperimetric inequality for dual systolic graphs. The first step in the proof is an extension of the classical isoperimetric inequality of the boolean cube. The isoperimetric inequality for dual systolic graphs, however, is exponentially stronger than the one for the boolean cube. Interestingly, we know that dual systolic graphs exist, but we do not yet know how to efficiently construct them. We, therefore, define a weaker notion of dual systolicity. We prove the same isoperimetric inequality for weakly dual systolic graphs, and at the same time provide an efficient construction of a family of graphs that are weakly dual systolic. We call this family of graphs clique products. We show that there is a non-trivial connection between the small set expansion capabilities and the threshold rank of clique products, and believe they can find further applications.
△ Less
Submitted 17 April, 2023; v1 submitted 11 April, 2023;
originally announced April 2023.
-
A Unified Characterization of Private Learnability via Graph Theory
Authors:
Noga Alon,
Shay Moran,
Hilla Schefler,
Amir Yehudayoff
Abstract:
We provide a unified framework for characterizing pure and approximate differentially private (DP) learnability. The framework uses the language of graph theory: for a concept class $\mathcal{H}$, we define the contradiction graph $G$ of $\mathcal{H}$. Its vertices are realizable datasets, and two datasets $S,S'$ are connected by an edge if they contradict each other (i.e., there is a point $x$ th…
▽ More
We provide a unified framework for characterizing pure and approximate differentially private (DP) learnability. The framework uses the language of graph theory: for a concept class $\mathcal{H}$, we define the contradiction graph $G$ of $\mathcal{H}$. Its vertices are realizable datasets, and two datasets $S,S'$ are connected by an edge if they contradict each other (i.e., there is a point $x$ that is labeled differently in $S$ and $S'$). Our main finding is that the combinatorial structure of $G$ is deeply related to learning $\mathcal{H}$ under DP. Learning $\mathcal{H}$ under pure DP is captured by the fractional clique number of $G$. Learning $\mathcal{H}$ under approximate DP is captured by the clique number of $G$. Consequently, we identify graph-theoretic dimensions that characterize DP learnability: the clique dimension and fractional clique dimension. Along the way, we reveal properties of the contradiction graph which may be of independent interest. We also suggest several open questions and directions for future research.
△ Less
Submitted 12 June, 2024; v1 submitted 8 April, 2023;
originally announced April 2023.
-
Replicability and stability in learning
Authors:
Zachary Chase,
Shay Moran,
Amir Yehudayoff
Abstract:
Replicability is essential in science as it allows us to validate and verify research findings. Impagliazzo, Lei, Pitassi and Sorrell (`22) recently initiated the study of replicability in machine learning. A learning algorithm is replicable if it typically produces the same output when applied on two i.i.d. inputs using the same internal randomness. We study a variant of replicability that does n…
▽ More
Replicability is essential in science as it allows us to validate and verify research findings. Impagliazzo, Lei, Pitassi and Sorrell (`22) recently initiated the study of replicability in machine learning. A learning algorithm is replicable if it typically produces the same output when applied on two i.i.d. inputs using the same internal randomness. We study a variant of replicability that does not involve fixing the randomness. An algorithm satisfies this form of replicability if it typically produces the same output when applied on two i.i.d. inputs (without fixing the internal randomness). This variant is called global stability and was introduced by Bun, Livni and Moran ('20) in the context of differential privacy.
Impagliazzo et al. showed how to boost any replicable algorithm so that it produces the same output with probability arbitrarily close to 1. In contrast, we demonstrate that for numerous learning tasks, global stability can only be accomplished weakly, where the same output is produced only with probability bounded away from 1. To overcome this limitation, we introduce the concept of list replicability, which is equivalent to global stability. Moreover, we prove that list replicability can be boosted so that it is achieved with probability arbitrarily close to 1. We also describe basic relations between standard learning-theoretic complexity measures and list replicable numbers. Our results, in addition, imply that besides trivial cases, replicable algorithms (in the sense of Impagliazzo et al.) must be randomized.
The proof of the impossibility result is based on a topological fixed-point theorem. For every algorithm, we are able to locate a "hard input distribution" by applying the Poincaré-Miranda theorem in a related topological setting. The equivalence between global stability and list replicability is algorithmic.
△ Less
Submitted 12 April, 2023; v1 submitted 7 April, 2023;
originally announced April 2023.
-
Random walks on regular trees can not be slowed down
Authors:
Omer Angel,
Jacob Richey,
Yinon Spinka,
Amir Yehudayoff
Abstract:
A random walk on a regular tree (or any non-amenable graph) has positive speed. We ask whether such a walk can be slowed down by applying carefully chosen time-dependent permutations of the vertices. We prove that on trees the random walk can not be slowed down.
A random walk on a regular tree (or any non-amenable graph) has positive speed. We ask whether such a walk can be slowed down by applying carefully chosen time-dependent permutations of the vertices. We prove that on trees the random walk can not be slowed down.
△ Less
Submitted 1 February, 2023;
originally announced February 2023.
-
A Characterization of Multiclass Learnability
Authors:
Nataly Brukhim,
Daniel Carmon,
Irit Dinur,
Shay Moran,
Amir Yehudayoff
Abstract:
A seminal result in learning theory characterizes the PAC learnability of binary classes through the Vapnik-Chervonenkis dimension. Extending this characterization to the general multiclass setting has been open since the pioneering works on multiclass PAC learning in the late 1980s. This work resolves this problem: we characterize multiclass PAC learnability through the DS dimension, a combinator…
▽ More
A seminal result in learning theory characterizes the PAC learnability of binary classes through the Vapnik-Chervonenkis dimension. Extending this characterization to the general multiclass setting has been open since the pioneering works on multiclass PAC learning in the late 1980s. This work resolves this problem: we characterize multiclass PAC learnability through the DS dimension, a combinatorial dimension defined by Daniely and Shalev-Shwartz (2014).
The classical characterization of the binary case boils down to empirical risk minimization. In contrast, our characterization of the multiclass case involves a variety of algorithmic ideas; these include a natural setting we call list PAC learning. In the list learning setting, instead of predicting a single outcome for a given unseen input, the goal is to provide a short menu of predictions.
Our second main result concerns the Natarajan dimension, which has been a central candidate for characterizing multiclass learnability. This dimension was introduced by Natarajan (1988) as a barrier for PAC learning. Whether the Natarajan dimension characterizes PAC learnability in general has been posed as an open question in several papers since. This work provides a negative answer: we construct a non-learnable class with Natarajan dimension one.
For the construction, we identify a fundamental connection between concept classes and topology (i.e., colorful simplicial complexes). We crucially rely on a deep and involved construction of hyperbolic pseudo-manifolds by Januszkiewicz and Swiatkowski. It is interesting that hyperbolicity is directly related to learning problems that are difficult to solve although no obvious barriers exist. This is another demonstration of the fruitful links machine learning has with different areas in mathematics.
△ Less
Submitted 3 March, 2022;
originally announced March 2022.
-
Anti-concentration and the Exact Gap-Hamming Problem
Authors:
Anup Rao,
Amir Yehudayoff
Abstract:
We prove anti-concentration bounds for the inner product of two independent random vectors, and use these bounds to prove lower bounds in communication complexity. We show that if $A,B$ are subsets of the cube $\{\pm 1\}^n$ with $|A| \cdot |B| \geq 2^{1.01 n}$, and $X \in A$ and $Y \in B$ are sampled independently and uniformly, then the inner product $\langle X,Y \rangle$ takes on any fixed value…
▽ More
We prove anti-concentration bounds for the inner product of two independent random vectors, and use these bounds to prove lower bounds in communication complexity. We show that if $A,B$ are subsets of the cube $\{\pm 1\}^n$ with $|A| \cdot |B| \geq 2^{1.01 n}$, and $X \in A$ and $Y \in B$ are sampled independently and uniformly, then the inner product $\langle X,Y \rangle$ takes on any fixed value with probability at most $O(1/\sqrt{n})$. In fact, we prove the following stronger "smoothness" statement: $$ \max_{k } \big| \Pr[\langle X,Y \rangle = k] - \Pr[\langle X,Y \rangle = k+4]\big| \leq O(1/n).$$ We use these results to prove that the exact gap-hamming problem requires linear communication, resolving an open problem in communication complexity. We also conclude anti-concentration for structured distributions with low entropy. If $x \in \mathcal{Z}^n$ has no zero coordinates, and $B \subseteq \{\pm 1\}^n$ corresponds to a subspace of $\mathcal{F}_2^n$ of dimension $0.51n$, then $\max_k \Pr[\langle x,Y \rangle = k] \leq O(\sqrt{\ln (n)/n})$.
△ Less
Submitted 4 January, 2022;
originally announced January 2022.
-
Regularization by Misclassification in ReLU Neural Networks
Authors:
Elisabetta Cornacchia,
Jan Hązła,
Ido Nachum,
Amir Yehudayoff
Abstract:
We study the implicit bias of ReLU neural networks trained by a variant of SGD where at each step, the label is changed with probability $p$ to a random label (label smoothing being a close variant of this procedure). Our experiments demonstrate that label noise propels the network to a sparse solution in the following sense: for a typical input, a small fraction of neurons are active, and the fir…
▽ More
We study the implicit bias of ReLU neural networks trained by a variant of SGD where at each step, the label is changed with probability $p$ to a random label (label smoothing being a close variant of this procedure). Our experiments demonstrate that label noise propels the network to a sparse solution in the following sense: for a typical input, a small fraction of neurons are active, and the firing pattern of the hidden layers is sparser. In fact, for some instances, an appropriate amount of label noise does not only sparsify the network but further reduces the test error. We then turn to the theoretical analysis of such sparsification mechanisms, focusing on the extremal case of $p=1$. We show that in this case, the network withers as anticipated from experiments, but surprisingly, in different ways that depend on the learning rate and the presence of bias, with either weights vanishing or neurons ceasing to fire.
△ Less
Submitted 3 November, 2021;
originally announced November 2021.
-
Tight bounds on the Fourier growth of bounded functions on the hypercube
Authors:
Siddharth Iyer,
Anup Rao,
Victor Reis,
Thomas Rothvoss,
Amir Yehudayoff
Abstract:
We give tight bounds on the degree $\ell$ homogenous parts $f_\ell$ of a bounded function $f$ on the cube. We show that if $f: \{\pm 1\}^n \rightarrow [-1,1]$ has degree $d$, then $\| f_\ell \|_\infty$ is bounded by $d^\ell/\ell!$, and $\| \hat{f}_\ell \|_1$ is bounded by $d^\ell e^{\binom{\ell+1}{2}} n^{\frac{\ell-1}{2}}$. We describe applications to pseudorandomness and learning theory. We use s…
▽ More
We give tight bounds on the degree $\ell$ homogenous parts $f_\ell$ of a bounded function $f$ on the cube. We show that if $f: \{\pm 1\}^n \rightarrow [-1,1]$ has degree $d$, then $\| f_\ell \|_\infty$ is bounded by $d^\ell/\ell!$, and $\| \hat{f}_\ell \|_1$ is bounded by $d^\ell e^{\binom{\ell+1}{2}} n^{\frac{\ell-1}{2}}$. We describe applications to pseudorandomness and learning theory. We use similar methods to generalize the classical Pisier's inequality from convex analysis. Our analysis involves properties of real-rooted polynomials that may be useful elsewhere.
△ Less
Submitted 19 July, 2021; v1 submitted 13 July, 2021;
originally announced July 2021.
-
A lower bound for essential covers of the cube
Authors:
Gal Yehuda,
Amir Yehudayoff
Abstract:
Essential covers were introduced by Linial and Radhakrishnan as a model that captures two complementary properties: (1) all variables must be included and (2) no element is redundant. In their seminal paper, they proved that every essential cover of the $n$-dimensional hypercube must be of size at least $Ω(n^{0.5})$. Later on, this notion found several applications in complexity theory. We improve…
▽ More
Essential covers were introduced by Linial and Radhakrishnan as a model that captures two complementary properties: (1) all variables must be included and (2) no element is redundant. In their seminal paper, they proved that every essential cover of the $n$-dimensional hypercube must be of size at least $Ω(n^{0.5})$. Later on, this notion found several applications in complexity theory. We improve the lower bound to $Ω(n^{0.52})$, and describe two applications.
△ Less
Submitted 28 May, 2021;
originally announced May 2021.
-
On the Communication Complexity of Key-Agreement Protocols
Authors:
Iftach Haitner,
Noam Mazor,
Rotem Oshman,
Omer Reingold,
Amir Yehudayoff
Abstract:
Key-agreement protocols whose security is proven in the random oracle model are an important alternative to protocols based on public-key cryptography. In the random oracle model, the parties and the eavesdropper have access to a shared random function (an "oracle"), but the parties are limited in the number of queries they can make to the oracle. The random oracle serves as an abstraction for bla…
▽ More
Key-agreement protocols whose security is proven in the random oracle model are an important alternative to protocols based on public-key cryptography. In the random oracle model, the parties and the eavesdropper have access to a shared random function (an "oracle"), but the parties are limited in the number of queries they can make to the oracle. The random oracle serves as an abstraction for black-box access to a symmetric cryptographic primitive, such as a collision resistant hash. Unfortunately, as shown by Impagliazzo and Rudich [STOC '89] and Barak and Mahmoody [Crypto '09], such protocols can only guarantee limited secrecy: the key of any $\ell$-query protocol can be revealed by an $O(\ell^2)$-query adversary. This quadratic gap between the query complexity of the honest parties and the eavesdropper matches the gap obtained by the Merkle's Puzzles protocol of Merkle [CACM '78].
In this work we tackle a new aspect of key-agreement protocols in the random oracle model: their communication complexity. In Merkle's Puzzles, to obtain secrecy against an eavesdropper that makes roughly $\ell^2$ queries, the honest parties need to exchange $Ω(\ell)$ bits. We show that for protocols with certain natural properties, ones that Merkle's Puzzle has, such high communication is unavoidable. Specifically, this is the case if the honest parties' queries are uniformly random, or alternatively if the protocol uses non-adaptive queries and has only two rounds. Our proof for the first setting uses a novel reduction from the set-disjointness problem in two-party communication complexity. For the second setting we prove the lower bound directly, using information-theoretic arguments.
△ Less
Submitted 6 May, 2021; v1 submitted 5 May, 2021;
originally announced May 2021.
-
Slicing the hypercube is not easy
Authors:
Gal Yehuda,
Amir Yehudayoff
Abstract:
We prove that at least $Ω(n^{0.51})$ hyperplanes are needed to slice all edges of the $n$-dimensional hypercube. We provide a couple of applications: lower bounds on the computational complexity of parity, and a lower bound on the cover number of the hypercube by skew hyperplanes.
We prove that at least $Ω(n^{0.51})$ hyperplanes are needed to slice all edges of the $n$-dimensional hypercube. We provide a couple of applications: lower bounds on the computational complexity of parity, and a lower bound on the cover number of the hypercube by skew hyperplanes.
△ Less
Submitted 17 February, 2021; v1 submitted 10 February, 2021;
originally announced February 2021.
-
A Theory of Universal Learning
Authors:
Olivier Bousquet,
Steve Hanneke,
Shay Moran,
Ramon van Handel,
Amir Yehudayoff
Abstract:
How quickly can a given class of concepts be learned from examples? It is common to measure the performance of a supervised machine learning algorithm by plotting its "learning curve", that is, the decay of the error rate as a function of the number of training examples. However, the classical theoretical framework for understanding learnability, the PAC model of Vapnik-Chervonenkis and Valiant, d…
▽ More
How quickly can a given class of concepts be learned from examples? It is common to measure the performance of a supervised machine learning algorithm by plotting its "learning curve", that is, the decay of the error rate as a function of the number of training examples. However, the classical theoretical framework for understanding learnability, the PAC model of Vapnik-Chervonenkis and Valiant, does not explain the behavior of learning curves: the distribution-free PAC model of learning can only bound the upper envelope of the learning curves over all possible data distributions. This does not match the practice of machine learning, where the data source is typically fixed in any given scenario, while the learner may choose the number of training examples on the basis of factors such as computational resources and desired accuracy.
In this paper, we study an alternative learning model that better captures such practical aspects of machine learning, but still gives rise to a complete theory of the learnable in the spirit of the PAC model. More precisely, we consider the problem of universal learning, which aims to understand the performance of learning algorithms on every data distribution, but without requiring uniformity over the distribution. The main result of this paper is a remarkable trichotomy: there are only three possible rates of universal learning. More precisely, we show that the learning curves of any given concept class decay either at an exponential, linear, or arbitrarily slow rates. Moreover, each of these cases is completely characterized by appropriate combinatorial parameters, and we exhibit optimal learning algorithms that achieve the best possible rate in each case.
For concreteness, we consider in this paper only the realizable case, though analogous results are expected to extend to more general learning scenarios.
△ Less
Submitted 9 November, 2020;
originally announced November 2020.
-
An Elementary Exposition of Pisier's Inequality
Authors:
Siddharth Iyer,
Anup Rao,
Victor Reis,
Thomas Rothvoss,
Amir Yehudayoff
Abstract:
Pisier's inequality is central in the study of normed spaces and has important applications in geometry. We provide an elementary proof of this inequality, which avoids some non-constructive steps from previous proofs. Our goal is to make the inequality and its proof more accessible, because we think they will find additional applications. We demonstrate this with a new type of restriction on the…
▽ More
Pisier's inequality is central in the study of normed spaces and has important applications in geometry. We provide an elementary proof of this inequality, which avoids some non-constructive steps from previous proofs. Our goal is to make the inequality and its proof more accessible, because we think they will find additional applications. We demonstrate this with a new type of restriction on the Fourier spectrum of bounded functions on the discrete cube.
△ Less
Submitted 22 September, 2020;
originally announced September 2020.
-
Sharp Isoperimetric Inequalities for Affine Quermassintegrals
Authors:
Emanuel Milman,
Amir Yehudayoff
Abstract:
The affine quermassintegrals associated to a convex body in $\mathbb{R}^n$ are affine-invariant analogues of the classical intrinsic volumes from the Brunn-Minkowski theory, and thus constitute a central pillar of affine convex geometry. They were introduced in the 1980's by E. Lutwak, who conjectured that among all convex bodies of a given volume, the $k$-th affine quermassintegral is minimized p…
▽ More
The affine quermassintegrals associated to a convex body in $\mathbb{R}^n$ are affine-invariant analogues of the classical intrinsic volumes from the Brunn-Minkowski theory, and thus constitute a central pillar of affine convex geometry. They were introduced in the 1980's by E. Lutwak, who conjectured that among all convex bodies of a given volume, the $k$-th affine quermassintegral is minimized precisely on the family of ellipsoids. The known cases $k=1$ and $k=n-1$ correspond to the classical Blaschke-Santaló and Petty projection inequalities, respectively. In this work we confirm Lutwak's conjecture, including characterization of the equality cases, for all values of $k=1,\ldots,n-1$, in a single unified framework. In fact, it turns out that ellipsoids are the only local minimizers with respect to the Hausdorff topology.
For the proof, we introduce a number of new ingredients, including a novel construction of the Projection Rolodex of a convex body. In particular, from this new view point, Petty's inequality is interpreted as an integrated form of a generalized Blaschke--Santaló inequality for a new family of polar bodies encoded by the Projection Rolodex. We extend these results to more general $L^p$-moment quermassintegrals, and interpret the case $p=0$ as a sharp averaged Loomis--Whitney isoperimetric inequality.
△ Less
Submitted 19 August, 2022; v1 submitted 10 May, 2020;
originally announced May 2020.
-
On Symmetry and Initialization for Neural Networks
Authors:
Ido Nachum,
Amir Yehudayoff
Abstract:
This work provides an additional step in the theoretical understanding of neural networks. We consider neural networks with one hidden layer and show that when learning symmetric functions, one can choose initial conditions so that standard SGD training efficiently produces generalization guarantees. We empirically verify this and show that this does not hold when the initial conditions are chosen…
▽ More
This work provides an additional step in the theoretical understanding of neural networks. We consider neural networks with one hidden layer and show that when learning symmetric functions, one can choose initial conditions so that standard SGD training efficiently produces generalization guarantees. We empirically verify this and show that this does not hold when the initial conditions are chosen at random. The proof of convergence investigates the interaction between the two layers of the network. Our results highlight the importance of using symmetry in the design of neural networks.
△ Less
Submitted 1 July, 2019;
originally announced July 2019.
-
Average-Case Information Complexity of Learning
Authors:
Ido Nachum,
Amir Yehudayoff
Abstract:
How many bits of information are revealed by a learning algorithm for a concept class of VC-dimension $d$? Previous works have shown that even for $d=1$ the amount of information may be unbounded (tend to $\infty$ with the universe size). Can it be that all concepts in the class require leaking a large amount of information? We show that typically concepts do not require leakage. There exists a pr…
▽ More
How many bits of information are revealed by a learning algorithm for a concept class of VC-dimension $d$? Previous works have shown that even for $d=1$ the amount of information may be unbounded (tend to $\infty$ with the universe size). Can it be that all concepts in the class require leaking a large amount of information? We show that typically concepts do not require leakage. There exists a proper learning algorithm that reveals $O(d)$ bits of information for most concepts in the class. This result is a special case of a more general phenomenon we explore. If there is a low information learner when the algorithm {\em knows} the underlying distribution on inputs, then there is a learner that reveals little information on an average concept {\em without knowing} the distribution on inputs.
△ Less
Submitted 24 November, 2018;
originally announced November 2018.
-
Anti-concentration in most directions
Authors:
Anup Rao,
Amir Yehudayoff
Abstract:
We prove anti-concentration bounds for the inner product of two independent random vectors. For example, we show that if $A,B$ are subsets of the cube $\{\pm 1\}^n$ with $|A| \cdot |B| \geq 2^{1.01 n}$, and $X \in A$ and $Y \in B$ are sampled independently and uniformly, then the inner product $\langle X, Y \rangle$ takes on any fixed value with probability at most $O(\tfrac{1}{\sqrt{n}})$. Extend…
▽ More
We prove anti-concentration bounds for the inner product of two independent random vectors. For example, we show that if $A,B$ are subsets of the cube $\{\pm 1\}^n$ with $|A| \cdot |B| \geq 2^{1.01 n}$, and $X \in A$ and $Y \in B$ are sampled independently and uniformly, then the inner product $\langle X, Y \rangle$ takes on any fixed value with probability at most $O(\tfrac{1}{\sqrt{n}})$. Extending Halász work, we prove stronger bounds when the choices for $x$ are unstructured. We also describe applications to communication complexity, randomness extraction and additive combinatorics.
△ Less
Submitted 4 March, 2019; v1 submitted 15 November, 2018;
originally announced November 2018.
-
An isoperimetric inequality for Hamming balls and local expansion in hypercubes
Authors:
Zilin Jiang,
Amir Yehudayoff
Abstract:
We prove a vertex isoperimetric inequality for the $n$-dimensional Hamming ball $\mathcal{B}_n(R)$ of radius $R$. The isoperimetric inequality is sharp up to a constant factor for sets that are comparable to $\mathcal{B}_n(R)$ in size. A key step in the proof is a local expansion phenomenon in hypercubes.
We prove a vertex isoperimetric inequality for the $n$-dimensional Hamming ball $\mathcal{B}_n(R)$ of radius $R$. The isoperimetric inequality is sharp up to a constant factor for sets that are comparable to $\mathcal{B}_n(R)$ in size. A key step in the proof is a local expansion phenomenon in hypercubes.
△ Less
Submitted 8 February, 2022; v1 submitted 2 July, 2018;
originally announced July 2018.
-
On the Perceptron's Compression
Authors:
Shay Moran,
Ido Nachum,
Itai Panasoff,
Amir Yehudayoff
Abstract:
We study and provide exposition to several phenomena that are related to the perceptron's compression. One theme concerns modifications of the perceptron algorithm that yield better guarantees on the margin of the hyperplane it outputs. These modifications can be useful in training neural networks as well, and we demonstrate them with some experimental data. In a second theme, we deduce conclusion…
▽ More
We study and provide exposition to several phenomena that are related to the perceptron's compression. One theme concerns modifications of the perceptron algorithm that yield better guarantees on the margin of the hyperplane it outputs. These modifications can be useful in training neural networks as well, and we demonstrate them with some experimental data. In a second theme, we deduce conclusions from the perceptron's compression in various contexts.
△ Less
Submitted 14 June, 2018;
originally announced June 2018.
-
On the Covariance-Hessian Relation in Evolution Strategies
Authors:
Ofer M. Shir,
Amir Yehudayoff
Abstract:
We consider Evolution Strategies operating only with isotropic Gaussian mutations on positive quadratic objective functions, and investigate the covariance matrix when constructed out of selected individuals by truncation. We prove that the covariance matrix over $(1,λ)$-selected decision vectors becomes proportional to the inverse of the landscape Hessian as the population-size $λ$ increases. Thi…
▽ More
We consider Evolution Strategies operating only with isotropic Gaussian mutations on positive quadratic objective functions, and investigate the covariance matrix when constructed out of selected individuals by truncation. We prove that the covariance matrix over $(1,λ)$-selected decision vectors becomes proportional to the inverse of the landscape Hessian as the population-size $λ$ increases. This generalizes a previous result that proved an equivalent phenomenon when sampling was assumed to take place in the vicinity of the optimum. It further confirms the classical hypothesis that statistical learning of the landscape is an inherent characteristic of standard Evolution Strategies, and that this distinguishing capability stems only from the usage of isotropic Gaussian mutations and rank-based selection. We provide broad numerical validation for the proven results, and present empirical evidence for its generalization to $(μ,λ)$-selection.
△ Less
Submitted 27 October, 2019; v1 submitted 10 June, 2018;
originally announced June 2018.
-
A Direct Sum Result for the Information Complexity of Learning
Authors:
Ido Nachum,
Jonathan Shafer,
Amir Yehudayoff
Abstract:
How many bits of information are required to PAC learn a class of hypotheses of VC dimension $d$? The mathematical setting we follow is that of Bassily et al. (2018), where the value of interest is the mutual information $\mathrm{I}(S;A(S))$ between the input sample $S$ and the hypothesis outputted by the learning algorithm $A$. We introduce a class of functions of VC dimension $d$ over the domain…
▽ More
How many bits of information are required to PAC learn a class of hypotheses of VC dimension $d$? The mathematical setting we follow is that of Bassily et al. (2018), where the value of interest is the mutual information $\mathrm{I}(S;A(S))$ between the input sample $S$ and the hypothesis outputted by the learning algorithm $A$. We introduce a class of functions of VC dimension $d$ over the domain $\mathcal{X}$ with information complexity at least $Ω\left(d\log \log \frac{|\mathcal{X}|}{d}\right)$ bits for any consistent and proper algorithm (deterministic or random). Bassily et al. proved a similar (but quantitatively weaker) result for the case $d=1$.
The above result is in fact a special case of a more general phenomenon we explore. We define the notion of information complexity of a given class of functions $\mathcal{H}$. Intuitively, it is the minimum amount of information that an algorithm for $\mathcal{H}$ must retain about its input to ensure consistency and properness. We prove a direct sum result for information complexity in this context; roughly speaking, the information complexity sums when combining several classes.
△ Less
Submitted 15 April, 2018;
originally announced April 2018.
-
On Communication Complexity of Classification Problems
Authors:
Daniel M. Kane,
Roi Livni,
Shay Moran,
Amir Yehudayoff
Abstract:
This work studies distributed learning in the spirit of Yao's model of communication complexity: consider a two-party setting, where each of the players gets a list of labelled examples and they communicate in order to jointly perform some learning task. To naturally fit into the framework of learning theory, the players can send each other examples (as well as bits) where each example/bit costs o…
▽ More
This work studies distributed learning in the spirit of Yao's model of communication complexity: consider a two-party setting, where each of the players gets a list of labelled examples and they communicate in order to jointly perform some learning task. To naturally fit into the framework of learning theory, the players can send each other examples (as well as bits) where each example/bit costs one unit of communication. This enables a uniform treatment of infinite classes such as half-spaces in $\mathbb{R}^d$, which are ubiquitous in machine learning.
We study several fundamental questions in this model. For example, we provide combinatorial characterizations of the classes that can be learned with efficient communication in the proper-case as well as in the improper-case. These findings imply unconditional separations between various learning contexts, e.g.\ realizable versus agnostic learning, proper versus improper learning, etc.
The derivation of these results hinges on a type of decision problems we term "{\it realizability problems}" where the goal is deciding whether a distributed input sample is consistent with an hypothesis from a pre-specified class.
From a technical perspective, the protocols we use are based on ideas from machine learning theory and the impossibility results are based on ideas from communication complexity theory.
△ Less
Submitted 23 April, 2018; v1 submitted 15 November, 2017;
originally announced November 2017.
-
A learning problem that is independent of the set theory ZFC axioms
Authors:
Shai Ben-David,
Pavel Hrubes,
Shay Moran,
Amir Shpilka,
Amir Yehudayoff
Abstract:
We consider the following statistical estimation problem: given a family F of real valued functions over some domain X and an i.i.d. sample drawn from an unknown distribution P over X, find h in F such that the expectation of h w.r.t. P is probably approximately equal to the supremum over expectations on members of F. This Expectation Maximization (EMX) problem captures many well studied learning…
▽ More
We consider the following statistical estimation problem: given a family F of real valued functions over some domain X and an i.i.d. sample drawn from an unknown distribution P over X, find h in F such that the expectation of h w.r.t. P is probably approximately equal to the supremum over expectations on members of F. This Expectation Maximization (EMX) problem captures many well studied learning problems; in fact, it is equivalent to Vapnik's general setting of learning.
Surprisingly, we show that the EMX learnability, as well as the learning rates of some basic class F, depend on the cardinality of the continuum and is therefore independent of the set theory ZFC axioms (that are widely accepted as a formalization of the notion of a mathematical proof).
We focus on the case where the functions in F are Boolean, which generalizes classification problems. We study the interaction between the statistical sample complexity of F and its combinatorial structure. We introduce a new version of sample compression schemes and show that it characterizes EMX learnability for a wide family of classes. However, we show that for the class of finite subsets of the real line, the existence of such compression schemes is independent of set theory. We conclude that the learnability of that class with respect to the family of probability distributions of countable support is independent of the set theory ZFC axioms.
We also explore the existence of a "VC-dimension-like" parameter that captures learnability in this setting. Our results imply that that there exist no "finitary" combinatorial parameter that characterizes EMX learnability in a way similar to the VC-dimension based characterization of binary valued classification problems.
△ Less
Submitted 14 November, 2017;
originally announced November 2017.
-
Learners that Use Little Information
Authors:
Raef Bassily,
Shay Moran,
Ido Nachum,
Jonathan Shafer,
Amir Yehudayoff
Abstract:
We study learning algorithms that are restricted to using a small amount of information from their input sample. We introduce a category of learning algorithms we term $d$-bit information learners, which are algorithms whose output conveys at most $d$ bits of information of their input. A central theme in this work is that such algorithms generalize.
We focus on the learning capacity of these al…
▽ More
We study learning algorithms that are restricted to using a small amount of information from their input sample. We introduce a category of learning algorithms we term $d$-bit information learners, which are algorithms whose output conveys at most $d$ bits of information of their input. A central theme in this work is that such algorithms generalize.
We focus on the learning capacity of these algorithms, and prove sample complexity bounds with tight dependencies on the confidence and error parameters. We also observe connections with well studied notions such as sample compression schemes, Occam's razor, PAC-Bayes and differential privacy.
We discuss an approach that allows us to prove upper bounds on the amount of information that algorithms reveal about their inputs, and also provide a lower bound by showing a simple concept class for which every (possibly randomized) empirical risk minimizer must reveal a lot of information. On the other hand, we show that in the distribution-dependent setting every VC class has empirical risk minimizers that do not reveal a lot of information.
△ Less
Submitted 27 February, 2018; v1 submitted 14 October, 2017;
originally announced October 2017.
-
On weak $ε$-nets and the Radon number
Authors:
Shay Moran,
Amir Yehudayoff
Abstract:
We show that the Radon number characterizes the existence of weak nets in separable convexity spaces (an abstraction of the euclidean notion of convexity). The construction of weak nets when the Radon number is finite is based on Helly's property and on metric properties of VC classes. The lower bound on the size of weak nets when the Radon number is large relies on the chromatic number of the Kne…
▽ More
We show that the Radon number characterizes the existence of weak nets in separable convexity spaces (an abstraction of the euclidean notion of convexity). The construction of weak nets when the Radon number is finite is based on Helly's property and on metric properties of VC classes. The lower bound on the size of weak nets when the Radon number is large relies on the chromatic number of the Kneser graph. As an application, we prove a boosting-type result for weak $ε$-nets.
△ Less
Submitted 27 February, 2019; v1 submitted 17 July, 2017;
originally announced July 2017.
-
Submultiplicative Glivenko-Cantelli and Uniform Convergence of Revenues
Authors:
Noga Alon,
Moshe Babaioff,
Yannai A. Gonczarowski,
Yishay Mansour,
Shay Moran,
Amir Yehudayoff
Abstract:
In this work we derive a variant of the classic Glivenko-Cantelli Theorem, which asserts uniform convergence of the empirical Cumulative Distribution Function (CDF) to the CDF of the underlying distribution. Our variant allows for tighter convergence bounds for extreme values of the CDF.
We apply our bound in the context of revenue learning, which is a well-studied problem in economics and algor…
▽ More
In this work we derive a variant of the classic Glivenko-Cantelli Theorem, which asserts uniform convergence of the empirical Cumulative Distribution Function (CDF) to the CDF of the underlying distribution. Our variant allows for tighter convergence bounds for extreme values of the CDF.
We apply our bound in the context of revenue learning, which is a well-studied problem in economics and algorithmic game theory. We derive sample-complexity bounds on the uniform convergence rate of the empirical revenues to the true revenues, assuming a bound on the $k$th moment of the valuations, for any (possibly fractional) $k>1$.
For uniform convergence in the limit, we give a complete characterization and a zero-one law: if the first moment of the valuations is finite, then uniform convergence almost surely occurs; conversely, if the first moment is infinite, then uniform convergence almost never occurs.
△ Less
Submitted 6 November, 2017; v1 submitted 23 May, 2017;
originally announced May 2017.
-
On statistical learning via the lens of compression
Authors:
Ofir David,
Shay Moran,
Amir Yehudayoff
Abstract:
This work continues the study of the relationship between sample compression schemes and statistical learning, which has been mostly investigated within the framework of binary classification. The central theme of this work is establishing equivalences between learnability and compressibility, and utilizing these equivalences in the study of statistical learning theory.
We begin with the setting…
▽ More
This work continues the study of the relationship between sample compression schemes and statistical learning, which has been mostly investigated within the framework of binary classification. The central theme of this work is establishing equivalences between learnability and compressibility, and utilizing these equivalences in the study of statistical learning theory.
We begin with the setting of multiclass categorization (zero/one loss). We prove that in this case learnability is equivalent to compression of logarithmic sample size, and that uniform convergence implies compression of constant size.
We then consider Vapnik's general learning setting: we show that in order to extend the compressibility-learnability equivalence to this case, it is necessary to consider an approximate variant of compression.
Finally, we provide some applications of the compressibility-learnability equivalences:
(i) Agnostic-case learnability and realizable-case learnability are equivalent in multiclass categorization problems (in terms of sample complexity).
(ii) This equivalence between agnostic-case learnability and realizable-case learnability does not hold for general learning problems: There exists a learning problem whose loss function takes just three values, under which agnostic-case and realizable-case learnability are not equivalent.
(iii) Uniform convergence implies compression of constant size in multiclass categorization problems. Part of the argument includes an analysis of the uniform convergence rate in terms of the graph dimension, in which we improve upon previous bounds.
(iv) A dichotomy for sample compression in multiclass categorization problems: If a non-trivial compression exists then a compression of logarithmic size exists.
(v) A compactness theorem for multiclass categorization problems.
△ Less
Submitted 30 December, 2016; v1 submitted 11 October, 2016;
originally announced October 2016.
-
Distributed Construction of Purely Additive Spanners
Authors:
Keren Censor-Hillel,
Telikepalli Kavitha,
Ami Paz,
Amir Yehudayoff
Abstract:
This paper studies the complexity of distributed construction of purely additive spanners in the CONGEST model. We describe algorithms for building such spanners in several cases. Because of the need to simultaneously make decisions at far apart locations, the algorithms use additional mechanisms compared to their sequential counterparts.
We complement our algorithms with a lower bound on the nu…
▽ More
This paper studies the complexity of distributed construction of purely additive spanners in the CONGEST model. We describe algorithms for building such spanners in several cases. Because of the need to simultaneously make decisions at far apart locations, the algorithms use additional mechanisms compared to their sequential counterparts.
We complement our algorithms with a lower bound on the number of rounds required for computing pairwise spanners. The standard reductions from set-disjointness and equality seem unsuitable for this task because no specific edge needs to be removed from the graph. Instead, to obtain our lower bound, we define a new communication complexity problem that reduces to computing a sparse spanner, and prove a lower bound on its communication complexity using information theory. This technique significantly extends the current toolbox used for obtaining lower bounds for the CONGEST model, and we believe it may find additional applications.
△ Less
Submitted 19 July, 2016;
originally announced July 2016.
-
On the Theoretical Capacity of Evolution Strategies to Statistically Learn the Landscape Hessian
Authors:
Ofer M. Shir,
Jonathan Roslund,
Amir Yehudayoff
Abstract:
We study the theoretical capacity to statistically learn local landscape information by Evolution Strategies (ESs). Specifically, we investigate the covariance matrix when constructed by ESs operating with the selection operator alone. We model continuous generation of candidate solutions about quadratic basins of attraction, with deterministic selection of the decision vectors that minimize the o…
▽ More
We study the theoretical capacity to statistically learn local landscape information by Evolution Strategies (ESs). Specifically, we investigate the covariance matrix when constructed by ESs operating with the selection operator alone. We model continuous generation of candidate solutions about quadratic basins of attraction, with deterministic selection of the decision vectors that minimize the objective function values. Our goal is to rigorously show that accumulation of winning individuals carries the potential to reveal valuable information about the search landscape, e.g., as already practically utilized by derandomized ES variants. We first show that the statistically-constructed covariance matrix over such winning decision vectors shares the same eigenvectors with the Hessian matrix about the optimum. We then provide an analytic approximation of this covariance matrix for a non-elitist multi-child $(1,λ)$-strategy, which holds for a large population size $λ$. Finally, we also numerically corroborate our results.
△ Less
Submitted 23 June, 2016;
originally announced June 2016.
-
Geometric stability via information theory
Authors:
David Ellis,
Ehud Friedgut,
Guy Kindler,
Amir Yehudayoff
Abstract:
The Loomis-Whitney inequality, and the more general Uniform Cover inequality, bound the volume of a body in terms of a product of the volumes of lower-dimensional projections of the body. In this paper, we prove stability versions of these inequalities, showing that when they are close to being tight, the body in question is close in symmetric difference to a 'box'. Our results are best possible u…
▽ More
The Loomis-Whitney inequality, and the more general Uniform Cover inequality, bound the volume of a body in terms of a product of the volumes of lower-dimensional projections of the body. In this paper, we prove stability versions of these inequalities, showing that when they are close to being tight, the body in question is close in symmetric difference to a 'box'. Our results are best possible up to a constant factor depending upon the dimension alone. Our approach is information theoretic.
We use our stability result for the Loomis-Whitney inequality to obtain a stability result for the edge-isoperimetric inequality in the infinite $d$-dimensional lattice. Namely, we prove that a subset of $\mathbb{Z}^d$ with small edge-boundary must be close in symmetric difference to a $d$-dimensional cube. Our bound is, again, best possible up to a constant factor depending upon $d$ alone.
△ Less
Submitted 16 January, 2017; v1 submitted 29 September, 2015;
originally announced October 2015.
-
An elementary exposition to topological overlap in the plane
Authors:
Amir Yehudayoff
Abstract:
The aim of this text is to provide an elementary and self-contained exposition of Gromov's argument on topological overlap (the presentation is based on Gromov's work, as well as two follow-up papers of Matousek and Wagner, and of Dotterrer, Kaufman and Wagner). We also discuss a simple generalization in which the vertices are weighted according to some probability distribution. This allows to use…
▽ More
The aim of this text is to provide an elementary and self-contained exposition of Gromov's argument on topological overlap (the presentation is based on Gromov's work, as well as two follow-up papers of Matousek and Wagner, and of Dotterrer, Kaufman and Wagner). We also discuss a simple generalization in which the vertices are weighted according to some probability distribution. This allows to use von Neumann's minimax theorem to deduce a dual statement.
△ Less
Submitted 4 August, 2015;
originally announced August 2015.
-
Sign rank versus VC dimension
Authors:
Noga Alon,
Shay Moran,
Amir Yehudayoff
Abstract:
This work studies the maximum possible sign rank of $N \times N$ sign matrices with a given VC dimension $d$. For $d=1$, this maximum is {three}. For $d=2$, this maximum is $\tildeΘ(N^{1/2})$. For $d >2$, similar but slightly less accurate statements hold. {The lower bounds improve over previous ones by Ben-David et al., and the upper bounds are novel.}
The lower bounds are obtained by probabili…
▽ More
This work studies the maximum possible sign rank of $N \times N$ sign matrices with a given VC dimension $d$. For $d=1$, this maximum is {three}. For $d=2$, this maximum is $\tildeΘ(N^{1/2})$. For $d >2$, similar but slightly less accurate statements hold. {The lower bounds improve over previous ones by Ben-David et al., and the upper bounds are novel.}
The lower bounds are obtained by probabilistic constructions, using a theorem of Warren in real algebraic topology. The upper bounds are obtained using a result of Welzl about spanning trees with low stabbing number, and using the moment curve.
The upper bound technique is also used to: (i) provide estimates on the number of classes of a given VC dimension, and the number of maximum classes of a given VC dimension -- answering a question of Frankl from '89, and (ii) design an efficient algorithm that provides an $O(N/\log(N))$ multiplicative approximation for the sign rank.
We also observe a general connection between sign rank and spectral gaps which is based on Forster's argument. Consider the $N \times N$ adjacency matrix of a $Δ$ regular graph with a second eigenvalue of absolute value $λ$ and $Δ\leq N/2$. We show that the sign rank of the signed version of this matrix is at least $Δ/λ$. We use this connection to prove the existence of a maximum class $C\subseteq\{\pm 1\}^N$ with VC dimension $2$ and sign rank $\tildeΘ(N^{1/2})$. This answers a question of Ben-David et al.~regarding the sign rank of large VC classes. We also describe limitations of this approach, in the spirit of the Alon-Boppana theorem.
We further describe connections to communication complexity, geometry, learning theory, and combinatorics.
△ Less
Submitted 8 July, 2016; v1 submitted 26 March, 2015;
originally announced March 2015.
-
Sample compression schemes for VC classes
Authors:
Shay Moran,
Amir Yehudayoff
Abstract:
Sample compression schemes were defined by Littlestone and Warmuth (1986) as an abstraction of the structure underlying many learning algorithms. Roughly speaking, a sample compression scheme of size $k$ means that given an arbitrary list of labeled examples, one can retain only $k$ of them in a way that allows to recover the labels of all other examples in the list. They showed that compression i…
▽ More
Sample compression schemes were defined by Littlestone and Warmuth (1986) as an abstraction of the structure underlying many learning algorithms. Roughly speaking, a sample compression scheme of size $k$ means that given an arbitrary list of labeled examples, one can retain only $k$ of them in a way that allows to recover the labels of all other examples in the list. They showed that compression implies PAC learnability for binary-labeled classes, and asked whether the other direction holds. We answer their question and show that every concept class $C$ with VC dimension $d$ has a sample compression scheme of size exponential in $d$. The proof uses an approximate minimax phenomenon for binary matrices of low VC dimension, which may be of interest in the context of game theory.
△ Less
Submitted 14 April, 2015; v1 submitted 24 March, 2015;
originally announced March 2015.
-
Teaching and compressing for low VC-dimension
Authors:
Shay Moran,
Amir Shpilka,
Avi Wigderson,
Amir Yehudayoff
Abstract:
In this work we study the quantitative relation between VC-dimension and two other basic parameters related to learning and teaching. Namely, the quality of sample compression schemes and of teaching sets for classes of low VC-dimension. Let $C$ be a binary concept class of size $m$ and VC-dimension $d$. Prior to this work, the best known upper bounds for both parameters were $\log(m)$, while the…
▽ More
In this work we study the quantitative relation between VC-dimension and two other basic parameters related to learning and teaching. Namely, the quality of sample compression schemes and of teaching sets for classes of low VC-dimension. Let $C$ be a binary concept class of size $m$ and VC-dimension $d$. Prior to this work, the best known upper bounds for both parameters were $\log(m)$, while the best lower bounds are linear in $d$. We present significantly better upper bounds on both as follows. Set $k = O(d 2^d \log \log |C|)$.
We show that there always exists a concept $c$ in $C$ with a teaching set (i.e. a list of $c$-labeled examples uniquely identifying $c$ in $C$) of size $k$. This problem was studied by Kuhlmann (1999). Our construction implies that the recursive teaching (RT) dimension of $C$ is at most $k$ as well. The RT-dimension was suggested by Zilles et al. and Doliwa et al. (2010). The same notion (under the name partial-ID width) was independently studied by Wigderson and Yehudayoff (2013). An upper bound on this parameter that depends only on $d$ is known just for the very simple case $d=1$, and is open even for $d=2$. We also make small progress towards this seemingly modest goal.
We further construct sample compression schemes of size $k$ for $C$, with additional information of $k \log(k)$ bits. Roughly speaking, given any list of $C$-labelled examples of arbitrary length, we can retain only $k$ labeled examples in a way that allows to recover the labels of all others examples in the list, using additional $k\log (k)$ information bits. This problem was first suggested by Littlestone and Warmuth (1986).
△ Less
Submitted 23 November, 2016; v1 submitted 22 February, 2015;
originally announced February 2015.
-
Inequalities and tail bounds for elementary symmetric polynomial with applications
Authors:
Parikshit Gopalan,
Amir Yehudayoff
Abstract:
We study the extent of independence needed to approximate the product of bounded random variables in expectation, a natural question that has applications in pseudorandomness and min-wise independent hashing.
For random variables whose absolute value is bounded by $1$, we give an error bound of the form $σ^{Ω(k)}$ where $k$ is the amount of independence and $σ^2$ is the total variance of the sum…
▽ More
We study the extent of independence needed to approximate the product of bounded random variables in expectation, a natural question that has applications in pseudorandomness and min-wise independent hashing.
For random variables whose absolute value is bounded by $1$, we give an error bound of the form $σ^{Ω(k)}$ where $k$ is the amount of independence and $σ^2$ is the total variance of the sum. Previously known bounds only applied in more restricted settings, and were quanitively weaker. We use this to give a simpler and more modular analysis of a construction of min-wise independent hash functions and pseudorandom generators for combinatorial rectangles due to Gopalan et al., which also slightly improves their seed-length.
Our proof relies on a new analytic inequality for the elementary symmetric polynomials $S_k(x)$ for $x \in \mathbb{R}^n$ which we believe to be of independent interest. We show that if $|S_k(x)|,|S_{k+1}(x)|$ are small relative to $|S_{k-1}(x)|$ for some $k>0$ then $|S_\ell(x)|$ is also small for all $\ell > k$. From these, we derive tail bounds for the elementary symmetric polynomials when the inputs are only $k$-wise independent.
△ Less
Submitted 10 August, 2015; v1 submitted 14 February, 2014;
originally announced February 2014.
-
Grounded Lipschitz functions on trees are typically flat
Authors:
Ron Peled,
Wojciech Samotij,
Amir Yehudayoff
Abstract:
A grounded M-Lipschitz function on a rooted d-ary tree is an integer-valued map on the vertices that changes by at most along edges and attains the value zero on the leaves. We study the behavior of such functions, specifically, their typical value at the root v_0 of the tree. We prove that the probability that the value of a uniformly chosen random function at v_0 is more than M+t is doubly-expon…
▽ More
A grounded M-Lipschitz function on a rooted d-ary tree is an integer-valued map on the vertices that changes by at most along edges and attains the value zero on the leaves. We study the behavior of such functions, specifically, their typical value at the root v_0 of the tree. We prove that the probability that the value of a uniformly chosen random function at v_0 is more than M+t is doubly-exponentially small in t. We also show a similar bound for continuous (real-valued) grounded Lipschitz functions.
△ Less
Submitted 14 May, 2013;
originally announced May 2013.
-
Lipschitz Functions on Expanders are Typically Flat
Authors:
Ron Peled,
Wojciech Samotij,
Amir Yehudayoff
Abstract:
This work studies the typical behavior of random integer-valued Lipschitz functions on expander graphs with sufficiently good expansion. We consider two families of functions: M-Lipschitz functions (functions that change by at most M along edges) and integer-homomorphisms (functions that change by exactly 1 along edges). We prove that such functions typically exhibit very small fluctuations. For i…
▽ More
This work studies the typical behavior of random integer-valued Lipschitz functions on expander graphs with sufficiently good expansion. We consider two families of functions: M-Lipschitz functions (functions that change by at most M along edges) and integer-homomorphisms (functions that change by exactly 1 along edges). We prove that such functions typically exhibit very small fluctuations. For instance, we show that a uniformly chosen M-Lipschitz function takes only M+1 values on most of the graph, with a double exponential decay for the probability to take other values.
△ Less
Submitted 18 March, 2012;
originally announced March 2012.
-
Containing Internal Diffusion Limited Aggregation
Authors:
Hugo Duminil-Copin,
Cyrille Lucas,
Ariel Yadin,
Amir Yehudayoff
Abstract:
Internal Diffusion Limited Aggregation (IDLA) is a model that describes the growth of a random aggregate of particles from the inside out. Shellef proved that IDLA processes on supercritical percolation clusters of integer-lattices fill Euclidean balls, with high probability. In this article, we complete the picture and prove a limit-shape theorem for IDLA on such percolation clusters, by providin…
▽ More
Internal Diffusion Limited Aggregation (IDLA) is a model that describes the growth of a random aggregate of particles from the inside out. Shellef proved that IDLA processes on supercritical percolation clusters of integer-lattices fill Euclidean balls, with high probability. In this article, we complete the picture and prove a limit-shape theorem for IDLA on such percolation clusters, by providing the corresponding upper bound.
The technique to prove upper bounds is new and robust: it only requires the existence of a "good" lower bound. Specifically, this way of proving upper bounds on IDLA clusters is more suitable for random environments than previous ways, since it does not harness harmonic measure estimates.
△ Less
Submitted 2 November, 2011;
originally announced November 2011.