Search | arXiv e-print repository

Efficient and Provable Algorithms for Covariate Shift

Abstract: Covariate shift, a widely used assumption in tackling {\it distributional shift} (when training and test distributions differ), focuses on scenarios where the distribution of the labels conditioned on the feature vector is the same, but the distribution of features in the training and test data are different. Despite the significance and extensive work on covariate shift, theoretical guarantees fo… ▽ More Covariate shift, a widely used assumption in tackling {\it distributional shift} (when training and test distributions differ), focuses on scenarios where the distribution of the labels conditioned on the feature vector is the same, but the distribution of features in the training and test data are different. Despite the significance and extensive work on covariate shift, theoretical guarantees for algorithms in this domain remain sparse. In this paper, we distill the essence of the covariate shift problem and focus on estimating the average $\mathbb{E}_{\tilde{\mathbf{x}}\sim p_{\mathrm{test}}}\mathbf{f}(\tilde{\mathbf{x}})$, of any unknown and bounded function $\mathbf{f}$, given labeled training samples $(\mathbf{x}_i, \mathbf{f}(\mathbf{x}_i))$, and unlabeled test samples $\tilde{\mathbf{x}}_i$; this is a core subroutine for several widely studied learning problems. We give several efficient algorithms, with provable sample complexity and computational guarantees. Moreover, we provide the first rigorous analysis of algorithms in this space when $\mathbf{f}$ is unrestricted, laying the groundwork for developing a solid theoretical foundation for covariate shift problems. △ Less

Submitted 21 February, 2025; originally announced February 2025.

arXiv:2501.09545 [pdf, ps, other]

Hardness of clique approximation for monotone circuits

Authors: Jarosław Błasiok, Linus Meierhöfer

Abstract: We consider a problem of approximating the size of the largest clique in a graph, with a monotone circuit. Concretely, we focus on distinguishing a random Erdős-Renyi graph $\mathcal{G}_{n,p}$, with $p=n^{-\frac{2}{α-1}}$ chosen st. with high probability it does not even have an $α$-clique, from a random clique on $β$ vertices (where $α\leq β$). Using the approximation method of Razborov, Alon and… ▽ More We consider a problem of approximating the size of the largest clique in a graph, with a monotone circuit. Concretely, we focus on distinguishing a random Erdős-Renyi graph $\mathcal{G}_{n,p}$, with $p=n^{-\frac{2}{α-1}}$ chosen st. with high probability it does not even have an $α$-clique, from a random clique on $β$ vertices (where $α\leq β$). Using the approximation method of Razborov, Alon and Boppana showed in 1987 that as long as $\sqrtα β< n^{1-δ}/\log n$, this problem requires a monotone circuit of size $n^{Ω(δ\sqrtα)}$, implying a lower bound of $2^{\tildeΩ(n^{1/3})}$ for the exact version of the problem when $k\approx n^{2/3}$. Recently Cavalar, Kumar, and Rossman improved their result by showing the tight lower bound $n^{Ω(k)}$, in a limited range $k \leq n^{1/3}$, implying a comparable $2^{\tildeΩ(n^{1/3})}$ lower bound. We combine the ideas of Cavalar, Kumar and Rossman with the recent breakthrough results on the sunflower conjecture by Alweiss, Lovett, Wu and Zhang to show that as long as $αβ< n^{1-δ}/\log n$, any monotone circuit rejecting $\mathcal{G}_{n,p}$ while accepting a $β$-clique needs to have size at least $n^{Ω(δ^2 α)}$; this implies a stronger $2^{\tildeΩ(\sqrt{n})}$ lower bound for the unrestricted version of the problem. We complement this result with a construction of an explicit monotone circuit of size $O(n^{δ^2 α/2})$ which rejects $\mathcal{G}_{n,p}$, and accepts any graph containing $β$-clique whenever $β> n^{1-δ}$. Those two theorems explain the largest $β$-clique that can be distinguished from $\mathcal{G}_{n, 1/2}$: when $β> n / 2^{C \sqrt{\log n}}$, polynomial size circuit co do it, while for $β< n / 2^{ω(\sqrt{\log n})}$ every circuit needs size $n^{ω(1)}$. △ Less

Submitted 16 January, 2025; originally announced January 2025.

arXiv:2404.14159 [pdf, ps, other]

Semirandom Planted Clique and the Restricted Isometry Property

Authors: Jarosław Błasiok, Rares-Darius Buhai, Pravesh K. Kothari, David Steurer

Abstract: We give a simple, greedy $O(n^{ω+0.5})=O(n^{2.872})$-time algorithm to list-decode planted cliques in a semirandom model introduced in [CSV17] (following [FK01]) that succeeds whenever the size of the planted clique is $k\geq O(\sqrt{n} \log^2 n)$. In the model, the edges touching the vertices in the planted $k$-clique are drawn independently with probability $p=1/2$ while the edges not touching t… ▽ More We give a simple, greedy $O(n^{ω+0.5})=O(n^{2.872})$-time algorithm to list-decode planted cliques in a semirandom model introduced in [CSV17] (following [FK01]) that succeeds whenever the size of the planted clique is $k\geq O(\sqrt{n} \log^2 n)$. In the model, the edges touching the vertices in the planted $k$-clique are drawn independently with probability $p=1/2$ while the edges not touching the planted clique are chosen by an adversary in response to the random choices. Our result shows that the computational threshold in the semirandom setting is within a $O(\log^2 n)$ factor of the information-theoretic one [Ste17] thus resolving an open question of Steinhardt. This threshold also essentially matches the conjectured computational threshold for the well-studied special case of fully random planted clique. All previous algorithms [CSV17, MMT20, BKS23] in this model are based on rather sophisticated rounding algorithms for entropy-constrained semidefinite programming relaxations and their sum-of-squares strengthenings and the best known guarantee is a $n^{O(1/ε)}$-time algorithm to list-decode planted cliques of size $k \geq \tilde{O}(n^{1/2+ε})$. In particular, the guarantee trivializes to quasi-polynomial time if the planted clique is of size $O(\sqrt{n} \operatorname{polylog} n)$. Our algorithm achieves an almost optimal guarantee with a surprisingly simple greedy algorithm. The prior state-of-the-art algorithmic result above is based on a reduction to certifying bounds on the size of unbalanced bicliques in random graphs -- closely related to certifying the restricted isometry property (RIP) of certain random matrices and known to be hard in the low-degree polynomial model. Our key idea is a new approach that relies on the truth of -- but not efficient certificates for -- RIP of a new class of matrices built from the input graphs. △ Less

Submitted 9 October, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

Comments: 22 pages, to appear FOCS 2024

arXiv:2309.12236 [pdf, other]

Smooth ECE: Principled Reliability Diagrams via Kernel Smoothing

Authors: Jarosław Błasiok, Preetum Nakkiran

Abstract: Calibration measures and reliability diagrams are two fundamental tools for measuring and interpreting the calibration of probabilistic predictors. Calibration measures quantify the degree of miscalibration, and reliability diagrams visualize the structure of this miscalibration. However, the most common constructions of reliability diagrams and calibration measures -- binning and ECE -- both suff… ▽ More Calibration measures and reliability diagrams are two fundamental tools for measuring and interpreting the calibration of probabilistic predictors. Calibration measures quantify the degree of miscalibration, and reliability diagrams visualize the structure of this miscalibration. However, the most common constructions of reliability diagrams and calibration measures -- binning and ECE -- both suffer from well-known flaws (e.g. discontinuity). We show that a simple modification fixes both constructions: first smooth the observations using an RBF kernel, then compute the Expected Calibration Error (ECE) of this smoothed function. We prove that with a careful choice of bandwidth, this method yields a calibration measure that is well-behaved in the sense of (Błasiok, Gopalan, Hu, and Nakkiran 2023a) -- a consistent calibration measure. We call this measure the SmoothECE. Moreover, the reliability diagram obtained from this smoothed function visually encodes the SmoothECE, just as binned reliability diagrams encode the BinnedECE. We also provide a Python package with simple, hyperparameter-free methods for measuring and plotting calibration: `pip install relplot\`. △ Less

Submitted 21 September, 2023; originally announced September 2023.

Comments: Code at: https://github.com/apple/ml-calibration

arXiv:2305.18764 [pdf, other]

When Does Optimizing a Proper Loss Yield Calibration?

Authors: Jarosław Błasiok, Parikshit Gopalan, Lunjia Hu, Preetum Nakkiran

Abstract: Optimizing proper loss functions is popularly believed to yield predictors with good calibration properties; the intuition being that for such losses, the global optimum is to predict the ground-truth probabilities, which is indeed calibrated. However, typical machine learning models are trained to approximately minimize loss over restricted families of predictors, that are unlikely to contain the… ▽ More Optimizing proper loss functions is popularly believed to yield predictors with good calibration properties; the intuition being that for such losses, the global optimum is to predict the ground-truth probabilities, which is indeed calibrated. However, typical machine learning models are trained to approximately minimize loss over restricted families of predictors, that are unlikely to contain the ground truth. Under what circumstances does optimizing proper loss over a restricted family yield calibrated models? What precise calibration guarantees does it give? In this work, we provide a rigorous answer to these questions. We replace the global optimality with a local optimality condition stipulating that the (proper) loss of the predictor cannot be reduced much by post-processing its predictions with a certain family of Lipschitz functions. We show that any predictor with this local optimality satisfies smooth calibration as defined in Kakade-Foster (2008), Błasiok et al. (2023). Local optimality is plausibly satisfied by well-trained DNNs, which suggests an explanation for why they are calibrated from proper loss minimization alone. Finally, we show that the connection between local optimality and calibration error goes both ways: nearly calibrated predictors are also nearly locally optimal. △ Less

Submitted 8 December, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

Comments: In NeurIPS 2023. Selected for spotlight presentation

arXiv:2304.09424 [pdf, other]

Loss Minimization Yields Multicalibration for Large Neural Networks

Authors: Jarosław Błasiok, Parikshit Gopalan, Lunjia Hu, Adam Tauman Kalai, Preetum Nakkiran

Abstract: Multicalibration is a notion of fairness for predictors that requires them to provide calibrated predictions across a large set of protected groups. Multicalibration is known to be a distinct goal than loss minimization, even for simple predictors such as linear functions. In this work, we consider the setting where the protected groups can be represented by neural networks of size $k$, and the… ▽ More Multicalibration is a notion of fairness for predictors that requires them to provide calibrated predictions across a large set of protected groups. Multicalibration is known to be a distinct goal than loss minimization, even for simple predictors such as linear functions. In this work, we consider the setting where the protected groups can be represented by neural networks of size $k$, and the predictors are neural networks of size $n > k$. We show that minimizing the squared loss over all neural nets of size $n$ implies multicalibration for all but a bounded number of unlucky values of $n$. We also give evidence that our bound on the number of unlucky values is tight, given our proof technique. Previously, results of the flavor that loss minimization yields multicalibration were known only for predictors that were near the ground truth, hence were rather limited in applicability. Unlike these, our results rely on the expressivity of neural nets and utilize the representation of the predictor. △ Less

Submitted 7 December, 2023; v1 submitted 19 April, 2023; originally announced April 2023.

Comments: In ITCS 2024

arXiv:2302.11476 [pdf, ps, other]

Matrix Multiplication and Number On the Forehead Communication

Authors: Josh Alman, Jarosław Błasiok

Abstract: Three-player Number On the Forehead communication may be thought of as a three-player Number In the Hand promise model, in which each player is given the inputs that are supposedly on the other two players' heads, and promised that they are consistent with the inputs of of the other players. The set of all allowed inputs under this promise may be thought of as an order-3 tensor. We surprisingly ob… ▽ More Three-player Number On the Forehead communication may be thought of as a three-player Number In the Hand promise model, in which each player is given the inputs that are supposedly on the other two players' heads, and promised that they are consistent with the inputs of of the other players. The set of all allowed inputs under this promise may be thought of as an order-3 tensor. We surprisingly observe that this tensor is exactly the matrix multiplication tensor, which is widely studied in the design of fast matrix multiplication algorithms. Using this connection, we prove a number of results about both Number On the Forehead communication and matrix multiplication, each by using known results or techniques about the other. For example, we show how the Laser method, a key technique used to design the best matrix multiplication algorithms, can also be used to design communication protocols for a variety of problems. We also show how known lower bounds for Number On the Forehead communication can be used to bound properties of the matrix multiplication tensor such as its zeroing out subrank. Finally, we substantially generalize known methods based on slice-rank for studying communication, and show how they directly relate to the matrix multiplication exponent $ω$. △ Less

Submitted 22 February, 2023; originally announced February 2023.

arXiv:2211.16886 [pdf, other]

A Unifying Theory of Distance from Calibration

Authors: Jarosław Błasiok, Parikshit Gopalan, Lunjia Hu, Preetum Nakkiran

Abstract: We study the fundamental question of how to define and measure the distance from calibration for probabilistic predictors. While the notion of perfect calibration is well-understood, there is no consensus on how to quantify the distance from perfect calibration. Numerous calibration measures have been proposed in the literature, but it is unclear how they compare to each other, and many popular me… ▽ More We study the fundamental question of how to define and measure the distance from calibration for probabilistic predictors. While the notion of perfect calibration is well-understood, there is no consensus on how to quantify the distance from perfect calibration. Numerous calibration measures have been proposed in the literature, but it is unclear how they compare to each other, and many popular measures such as Expected Calibration Error (ECE) fail to satisfy basic properties like continuity. We present a rigorous framework for analyzing calibration measures, inspired by the literature on property testing. We propose a ground-truth notion of distance from calibration: the $\ell_1$ distance to the nearest perfectly calibrated predictor. We define a consistent calibration measure as one that is polynomially related to this distance. Applying our framework, we identify three calibration measures that are consistent and can be estimated efficiently: smooth calibration, interval calibration, and Laplace kernel calibration. The former two give quadratic approximations to the ground truth distance, which we show is information-theoretically optimal in a natural model for measuring calibration which we term the prediction-only access model. Our work thus establishes fundamental lower and upper bounds on measuring the distance to calibration, and also provides theoretical justification for preferring certain metrics (like Laplace kernel calibration) in practice. △ Less

Submitted 31 March, 2023; v1 submitted 30 November, 2022; originally announced November 2022.

Comments: In STOC 2023

arXiv:2211.13473 [pdf, ps, other]

Communication Complexity of Inner Product in Symmetric Normed Spaces

Authors: Alexandr Andoni, Jarosław Błasiok, Arnold Filtser

Abstract: We introduce and study the communication complexity of computing the inner product of two vectors, where the input is restricted w.r.t. a norm $N$ on the space $\mathbb{R}^n$. Here, Alice and Bob hold two vectors $v,u$ such that $\|v\|_N\le 1$ and $\|u\|_{N^*}\le 1$, where $N^*$ is the dual norm. They want to compute their inner product $\langle v,u \rangle$ up to an $\varepsilon$ additive term. T… ▽ More We introduce and study the communication complexity of computing the inner product of two vectors, where the input is restricted w.r.t. a norm $N$ on the space $\mathbb{R}^n$. Here, Alice and Bob hold two vectors $v,u$ such that $\|v\|_N\le 1$ and $\|u\|_{N^*}\le 1$, where $N^*$ is the dual norm. They want to compute their inner product $\langle v,u \rangle$ up to an $\varepsilon$ additive term. The problem is denoted by $\mathrm{IP}_N$. We systematically study $\mathrm{IP}_N$, showing the following results: - For any symmetric norm $N$, given $\|v\|_N\le 1$ and $\|u\|_{N^*}\le 1$ there is a randomized protocol for $\mathrm{IP}_N$ using $\tilde{\mathcal{O}}(\varepsilon^{-6} \log n)$ bits -- we will denote this by $\mathcal{R}_{\varepsilon,1/3}(\mathrm{IP}_{N}) \leq \tilde{\mathcal{O}}(\varepsilon^{-6} \log n)$. - One way communication complexity $\overrightarrow{\mathcal{R}}(\mathrm{IP}_{\ell_p})\leq\mathcal{O}(\varepsilon^{-\max(2,p)}\cdot \log\frac n\varepsilon)$, and a nearly matching lower bound $\overrightarrow{\mathcal{R}}(\mathrm{IP}_{\ell_p}) \geq Ω(\varepsilon^{-\max(2,p)})$ for $\varepsilon^{-\max(2,p)} \ll n$. - One way communication complexity $\overrightarrow{\mathcal{R}}(N)$ for a symmetric norm $N$ is governed by embeddings $\ell_\infty^k$ into $N$. Specifically, while a small distortion embedding easily implies a lower bound $Ω(k)$, we show that, conversely, non-existence of such an embedding implies protocol with communication $k^{\mathcal{O}(\log \log k)} \log^2 n$. - For arbitrary origin symmetric convex polytope $P$, we show $\mathcal{R}(\mathrm{IP}_{N}) \le\mathcal{O}(\varepsilon^{-2} \log \mathrm{xc}(P))$, where $N$ is the unique norm for which $P$ is a unit ball, and $\mathrm{xc}(P)$ is the extension complexity of $P$. △ Less

Submitted 24 November, 2022; originally announced November 2022.

Comments: Accepted to ITCS 2023

arXiv:2204.03230 [pdf, other]

What You See is What You Get: Principled Deep Learning via Distributional Generalization

Authors: Bogdan Kulynych, Yao-Yuan Yang, Yaodong Yu, Jarosław Błasiok, Preetum Nakkiran

Abstract: Having similar behavior at training time and test time $-$ what we call a "What You See Is What You Get" (WYSIWYG) property $-$ is desirable in machine learning. Models trained with standard stochastic gradient descent (SGD), however, do not necessarily have this property, as their complex behaviors such as robustness or subgroup performance can differ drastically between training and test time. I… ▽ More Having similar behavior at training time and test time $-$ what we call a "What You See Is What You Get" (WYSIWYG) property $-$ is desirable in machine learning. Models trained with standard stochastic gradient descent (SGD), however, do not necessarily have this property, as their complex behaviors such as robustness or subgroup performance can differ drastically between training and test time. In contrast, we show that Differentially-Private (DP) training provably ensures the high-level WYSIWYG property, which we quantify using a notion of distributional generalization. Applying this connection, we introduce new conceptual tools for designing deep-learning methods by reducing generalization concerns to optimization ones: to mitigate unwanted behavior at test time, it is provably sufficient to mitigate this behavior on the training data. By applying this novel design principle, which bypasses "pathologies" of SGD, we construct simple algorithms that are competitive with SOTA in several distributional-robustness applications, significantly improve the privacy vs. disparate impact trade-off of DP-SGD, and mitigate robust overfitting in adversarial training. Finally, we also improve on theoretical bounds relating DP, stability, and distributional generalization. △ Less

Submitted 17 October, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

Comments: First two authors contributed equally. To appear in NeurIPS 2022

arXiv:2107.10797 [pdf, other]

Fourier growth of structured $\mathbb{F}_2$-polynomials and applications

Authors: Jarosław Błasiok, Peter Ivanov, Yaonan Jin, Chin Ho Lee, Rocco A. Servedio, Emanuele Viola

Abstract: We analyze the Fourier growth, i.e. the $L_1$ Fourier weight at level $k$ (denoted $L_{1,k}$), of various well-studied classes of "structured" $\mathbb{F}_2$-polynomials. This study is motivated by applications in pseudorandomness, in particular recent results and conjectures due to [CHHL19,CHLT19,CGLSS20] which show that upper bounds on Fourier growth (even at level $k=2$) give unconditional pseu… ▽ More We analyze the Fourier growth, i.e. the $L_1$ Fourier weight at level $k$ (denoted $L_{1,k}$), of various well-studied classes of "structured" $\mathbb{F}_2$-polynomials. This study is motivated by applications in pseudorandomness, in particular recent results and conjectures due to [CHHL19,CHLT19,CGLSS20] which show that upper bounds on Fourier growth (even at level $k=2$) give unconditional pseudorandom generators. Our main structural results on Fourier growth are as follows: - We show that any symmetric degree-$d$ $\mathbb{F}_2$-polynomial $p$ has $L_{1,k}(p) \le \Pr[p=1] \cdot O(d)^k$, and this is tight for any constant $k$. This quadratically strengthens an earlier bound that was implicit in [RSV13]. - We show that any read-$Δ$ degree-$d$ $\mathbb{F}_2$-polynomial $p$ has $L_{1,k}(p) \le \Pr[p=1] \cdot (k Δd)^{O(k)}$. - We establish a composition theorem which gives $L_{1,k}$ bounds on disjoint compositions of functions that are closed under restrictions and admit $L_{1,k}$ bounds. Finally, we apply the above structural results to obtain new unconditional pseudorandom generators and new correlation bounds for various classes of $\mathbb{F}_2$-polynomials. △ Less

Submitted 11 October, 2024; v1 submitted 22 July, 2021; originally announced July 2021.

Comments: Corrected a mistake in Lemma 27 in the previous version of the paper

arXiv:1903.12135 [pdf, other]

An Improved Lower Bound for Sparse Reconstruction from Subsampled Walsh Matrices

Authors: Jarosław Błasiok, Patrick Lopatto, Kyle Luh, Jake Marcinek, Shravas Rao

Abstract: We give a short argument that yields a new lower bound on the number of subsampled rows from a bounded, orthonormal matrix necessary to form a matrix with the restricted isometry property. We show that a matrix formed by uniformly subsampling rows of an $N \times N$ Walsh matrix contains a $K$-sparse vector in the kernel, unless the number of subsampled rows is $Ω(K \log K \log (N/K))$ -- our lowe… ▽ More We give a short argument that yields a new lower bound on the number of subsampled rows from a bounded, orthonormal matrix necessary to form a matrix with the restricted isometry property. We show that a matrix formed by uniformly subsampling rows of an $N \times N$ Walsh matrix contains a $K$-sparse vector in the kernel, unless the number of subsampled rows is $Ω(K \log K \log (N/K))$ -- our lower bound applies whenever $\min(K, N/K) > \log^C N$. Containing a sparse vector in the kernel precludes not only the restricted isometry property, but more generally the application of those matrices for uniform sparse recovery. △ Less

Submitted 9 May, 2023; v1 submitted 28 March, 2019; originally announced March 2019.

Comments: Revised version. Published in Discrete Analysis

arXiv:1811.03763 [pdf, ps, other]

Towards Instance-Optimal Private Query Release

Authors: Jaroslaw Blasiok, Mark Bun, Aleksandar Nikolov, Thomas Steinke

Abstract: We study efficient mechanisms for the query release problem in differential privacy: given a workload of $m$ statistical queries, output approximate answers to the queries while satisfying the constraints of differential privacy. In particular, we are interested in mechanisms that optimally adapt to the given workload. Building on the projection mechanism of Nikolov, Talwar, and Zhang, and using t… ▽ More We study efficient mechanisms for the query release problem in differential privacy: given a workload of $m$ statistical queries, output approximate answers to the queries while satisfying the constraints of differential privacy. In particular, we are interested in mechanisms that optimally adapt to the given workload. Building on the projection mechanism of Nikolov, Talwar, and Zhang, and using the ideas behind Dudley's chaining inequality, we propose new efficient algorithms for the query release problem, and prove that they achieve optimal sample complexity for the given workload (up to constant factors, in certain parameter regimes) with respect to the class of mechanisms that satisfy concentrated differential privacy. We also give variants of our algorithms that satisfy local differential privacy, and prove that they also achieve optimal sample complexity among all local sequentially interactive private mechanisms. △ Less

Submitted 8 November, 2018; originally announced November 2018.

Comments: To appear in SODA 2019

arXiv:1810.04298 [pdf, ps, other]

Polar Codes with exponentially small error at finite block length

Authors: Jarosław Błasiok, Venkatesan Guruswami, Madhu Sudan

Abstract: We show that the entire class of polar codes (up to a natural necessary condition) converge to capacity at block lengths polynomial in the gap to capacity, while simultaneously achieving failure probabilities that are exponentially small in the block length (i.e., decoding fails with probability $\exp(-N^{Ω(1)})$ for codes of length $N$). Previously this combination was known only for one specific… ▽ More We show that the entire class of polar codes (up to a natural necessary condition) converge to capacity at block lengths polynomial in the gap to capacity, while simultaneously achieving failure probabilities that are exponentially small in the block length (i.e., decoding fails with probability $\exp(-N^{Ω(1)})$ for codes of length $N$). Previously this combination was known only for one specific family within the class of polar codes, whereas we establish this whenever the polar code exhibits a condition necessary for any polarization. Our results adapt and strengthen a local analysis of polar codes due to the authors with Nakkiran and Rudra [Proc. STOC 2018]. Their analysis related the time-local behavior of a martingale to its global convergence, and this allowed them to prove that the broad class of polar codes converge to capacity at polynomial block lengths. Their analysis easily adapts to show exponentially small failure probabilities, provided the associated martingale, the ``Arikan martingale'', exhibits a corresponding strong local effect. The main contribution of this work is a much stronger local analysis of the Arikan martingale. This leads to the general result claimed above. In addition to our general result, we also show, for the first time, polar codes that achieve failure probability $\exp(-N^β)$ for any $β< 1$ while converging to capacity at block length polynomial in the gap to capacity. Finally we also show that the ``local'' approach can be combined with any analysis of failure probability of an arbitrary polar code to get essentially the same failure probability while achieving block length polynomial in the gap to capacity. △ Less

Submitted 9 October, 2018; originally announced October 2018.

Comments: 17 pages, Appeared in RANDOM'18. arXiv admin note: substantial text overlap with arXiv:1802.02718

arXiv:1809.05596 [pdf, ps, other]

The Generic Holdout: Preventing False-Discoveries in Adaptive Data Science

Authors: Preetum Nakkiran, Jarosław Błasiok

Abstract: Adaptive data analysis has posed a challenge to science due to its ability to generate false hypotheses on moderately large data sets. In general, with non-adaptive data analyses (where queries to the data are generated without being influenced by answers to previous queries) a data set containing $n$ samples may support exponentially many queries in $n$. This number reduces to linearly many under… ▽ More Adaptive data analysis has posed a challenge to science due to its ability to generate false hypotheses on moderately large data sets. In general, with non-adaptive data analyses (where queries to the data are generated without being influenced by answers to previous queries) a data set containing $n$ samples may support exponentially many queries in $n$. This number reduces to linearly many under naive adaptive data analysis, and even sophisticated remedies such as the Reusable Holdout (Dwork et. al 2015) only allow quadratically many queries in $n$. In this work, we propose a new framework for adaptive science which exponentially improves on this number of queries under a restricted yet scientifically relevant setting, where the goal of the scientist is to find a single (or a few) true hypotheses about the universe based on the samples. Such a setting may describe the search for predictive factors of some disease based on medical data, where the analyst may wish to try a number of predictive models until a satisfactory one is found. Our solution, the Generic Holdout, involves two simple ingredients: (1) a partitioning of the data into a exploration set and a holdout set and (2) a limited exposure strategy for the holdout set. An analyst is free to use the exploration set arbitrarily, but when testing hypotheses against the holdout set, the analyst only learns the answer to the question: "Is the given hypothesis true (empirically) on the holdout set?" -- and no more information, such as "how well" the hypothesis fits the holdout set. The resulting scheme is immediate to analyze, but despite its simplicity we do not believe our method is obvious, as evidenced by the many violations in practice. Our proposal can be seen as an alternative to pre-registration, and allows researchers to get the benefits of adaptive data analysis without the problems of adaptivity. △ Less

Submitted 14 September, 2018; originally announced September 2018.

arXiv:1804.01642 [pdf, ps, other]

Optimal streaming and tracking distinct elements with high probability

Authors: Jarosław Błasiok

Abstract: The distinct elements problem is one of the fundamental problems in streaming algorithms --- given a stream of integers in the range $\{1,\ldots,n\}$, we wish to provide a $(1+\varepsilon)$ approximation to the number of distinct elements in the input. After a long line of research an optimal solution for this problem with constant probability of success, using… ▽ More The distinct elements problem is one of the fundamental problems in streaming algorithms --- given a stream of integers in the range $\{1,\ldots,n\}$, we wish to provide a $(1+\varepsilon)$ approximation to the number of distinct elements in the input. After a long line of research an optimal solution for this problem with constant probability of success, using $\mathcal{O}(\frac{1}{\varepsilon^2}+\log n)$ bits of space, was given by Kane, Nelson and Woodruff in 2010. The standard approach used in order to achieve low failure probability $δ$ is to take the median of $\log δ^{-1}$ parallel repetitions of the original algorithm. We show that such a multiplicative space blow-up is unnecessary: we provide an optimal algorithm using $\mathcal{O}(\frac{\log δ^{-1}}{\varepsilon^2} + \log n)$ bits of space --- matching known lower bounds for this problem. That is, the $\logδ^{-1}$ factor does not multiply the $\log n$ term. This settles completely the space complexity of the distinct elements problem with respect to all standard parameters. We consider also the \emph{strong tracking} (or \emph{continuous monitoring}) variant of the distinct elements problem, where we want an algorithm which provides an approximation of the number of distinct elements seen so far, at all times of the stream. We show that this variant can be solved using $\mathcal{O}(\frac{\log \log n + \log δ^{-1}}{\varepsilon^2} + \log n)$ bits of space, which we show to be optimal. △ Less

Submitted 4 January, 2019; v1 submitted 4 April, 2018; originally announced April 2018.

Comments: Preliminary version of this paper appeard in SODA 2018

arXiv:1802.02718 [pdf, other]

doi 10.1145/3491390

General Strong Polarization

Authors: Jarosław Błasiok, Venkatesan Guruswami, Preetum Nakkiran, Atri Rudra, Madhu Sudan

Abstract: Arikan's exciting discovery of polar codes has provided an altogether new way to efficiently achieve Shannon capacity. Given a (constant-sized) invertible matrix $M$, a family of polar codes can be associated with this matrix and its ability to approach capacity follows from the {\em polarization} of an associated $[0,1]$-bounded martingale, namely its convergence in the limit to either $0$ or… ▽ More Arikan's exciting discovery of polar codes has provided an altogether new way to efficiently achieve Shannon capacity. Given a (constant-sized) invertible matrix $M$, a family of polar codes can be associated with this matrix and its ability to approach capacity follows from the {\em polarization} of an associated $[0,1]$-bounded martingale, namely its convergence in the limit to either $0$ or $1$. Arikan showed polarization of the martingale associated with the matrix $G_2 = \left(\begin{matrix} 1& 0 1& 1\end{matrix}\right)$ to get capacity achieving codes. His analysis was later extended to all matrices $M$ that satisfy an obvious necessary condition for polarization. While Arikan's theorem does not guarantee that the codes achieve capacity at small blocklengths, it turns out that a "strong" analysis of the polarization of the underlying martingale would lead to such constructions. Indeed for the martingale associated with $G_2$ such a strong polarization was shown in two independent works ([Guruswami and Xia, IEEE IT '15] and [Hassani et al., IEEE IT '14]), resolving a major theoretical challenge of the efficient attainment of Shannon capacity. In this work we extend the result above to cover martingales associated with all matrices that satisfy the necessary condition for (weak) polarization. In addition to being vastly more general, our proofs of strong polarization are also simpler and modular. Specifically, our result shows strong polarization over all prime fields and leads to efficient capacity-achieving codes for arbitrary symmetric memoryless channels. We show how to use our analyses to achieve exponentially small error probabilities at lengths inverse polynomial in the gap to capacity. Indeed we show that we can essentially match any error probability with lengths that are only inverse polynomial in the gap to capacity. △ Less

Submitted 8 May, 2022; v1 submitted 8 February, 2018; originally announced February 2018.

Comments: 73 pages, 2 figures. The final version appeared in JACM. This paper combines results presented in preliminary form at STOC 2018 and RANDOM 2018

Journal ref: Jarosław Błasiok, Venkatesan Guruswami, Preetum Nakkiran, Atri Rudra, and Madhu Sudan. 2022. General Strong Polarization. J. ACM 69, 2, Article 11 (April 2022), 67 pages

arXiv:1709.07308 [pdf, other]

Predicting Positive and Negative Links with Noisy Queries: Theory & Practice

Authors: Charalampos E. Tsourakakis, Michael Mitzenmacher, Kasper Green Larsen, Jarosław Błasiok, Ben Lawson, Preetum Nakkiran, Vasileios Nakos

Abstract: Social networks involve both positive and negative relationships, which can be captured in signed graphs. The {\em edge sign prediction problem} aims to predict whether an interaction between a pair of nodes will be positive or negative. We provide theoretical results for this problem that motivate natural improvements to recent heuristics. The edge sign prediction problem is related to correlat… ▽ More Social networks involve both positive and negative relationships, which can be captured in signed graphs. The {\em edge sign prediction problem} aims to predict whether an interaction between a pair of nodes will be positive or negative. We provide theoretical results for this problem that motivate natural improvements to recent heuristics. The edge sign prediction problem is related to correlation clustering; a positive relationship means being in the same cluster. We consider the following model for two clusters: we are allowed to query any pair of nodes whether they belong to the same cluster or not, but the answer to the query is corrupted with some probability $0<q<\frac{1}{2}$. Let $δ=1-2q$ be the bias. We provide an algorithm that recovers all signs correctly with high probability in the presence of noise with $O(\frac{n\log n}{δ^2}+\frac{\log^2 n}{δ^6})$ queries. This is the best known result for this problem for all but tiny $δ$, improving on the recent work of Mazumdar and Saha \cite{mazumdar2017clustering}. We also provide an algorithm that performs $O(\frac{n\log n}{δ^4})$ queries, and uses breadth first search as its main algorithmic primitive. While both the running time and the number of queries for this algorithm are sub-optimal, our result relies on novel theoretical techniques, and naturally suggests the use of edge-disjoint paths as a feature for predicting signs in online social networks. Correspondingly, we experiment with using edge disjoint $s-t$ paths of short length as a feature for predicting the sign of edge $(s,t)$ in real-world signed networks. Empirical findings suggest that the use of such paths improves the classification accuracy, especially for pairs of nodes with no common neighbors. △ Less

Submitted 6 December, 2020; v1 submitted 19 September, 2017; originally announced September 2017.

Comments: arXiv admin note: text overlap with arXiv:1609.00750

arXiv:1704.06710 [pdf, ps, other]

Continuous monitoring of $\ell_p$ norms in data streams

Authors: Jarosław Błasiok, Jian Ding, Jelani Nelson

Abstract: In insertion-only streaming, one sees a sequence of indices $a_1, a_2, \ldots, a_m\in [n]$. The stream defines a sequence of $m$ frequency vectors $x^{(1)},\ldots,x^{(m)}\in\mathbb{R}^n$ with $(x^{(t)})_i = |\{j : j\in[t], a_j = i\}|$. That is, $x^{(t)}$ is the frequency vector after seeing the first $t$ items in the stream. Much work in the streaming literature focuses on estimating some function… ▽ More In insertion-only streaming, one sees a sequence of indices $a_1, a_2, \ldots, a_m\in [n]$. The stream defines a sequence of $m$ frequency vectors $x^{(1)},\ldots,x^{(m)}\in\mathbb{R}^n$ with $(x^{(t)})_i = |\{j : j\in[t], a_j = i\}|$. That is, $x^{(t)}$ is the frequency vector after seeing the first $t$ items in the stream. Much work in the streaming literature focuses on estimating some function $f(x^{(m)})$. Many applications though require obtaining estimates at time $t$ of $f(x^{(t)})$, for every $t\in[m]$. Naively this guarantee is obtained by devising an algorithm with failure probability $\ll 1/m$, then performing a union bound over all stream updates to guarantee that all $m$ estimates are simultaneously accurate with good probability. When $f(x)$ is some $\ell_p$ norm of $x$, recent works have shown that this union bound is wasteful and better space complexity is possible for the continuous monitoring problem, with the strongest known results being for $p=2$ [HTY14, BCIW16, BCINWW17]. In this work, we improve the state of the art for all $0<p<2$, which we obtain via a novel analysis of Indyk's $p$-stable sketch [Indyk06]. △ Less

Submitted 8 November, 2017; v1 submitted 21 April, 2017; originally announced April 2017.

Comments: v2: Lemma 10 proof now correctly bounds q <= (1/eps)^{O(1/p}) instead of the previously erroneous 1/eps^4. All stated results still hold for p in (0,2] bounded away from zero

arXiv:1609.05388 [pdf, other]

ADAGIO: Fast Data-aware Near-Isometric Linear Embeddings

Authors: Jarosław Błasiok, Charalampos E. Tsourakakis

Abstract: Many important applications, including signal reconstruction, parameter estimation, and signal processing in a compressed domain, rely on a low-dimensional representation of the dataset that preserves {\em all} pairwise distances between the data points and leverages the inherent geometric structure that is typically present. Recently Hedge, Sankaranarayanan, Yin and Baraniuk \cite{hedge2015} prop… ▽ More Many important applications, including signal reconstruction, parameter estimation, and signal processing in a compressed domain, rely on a low-dimensional representation of the dataset that preserves {\em all} pairwise distances between the data points and leverages the inherent geometric structure that is typically present. Recently Hedge, Sankaranarayanan, Yin and Baraniuk \cite{hedge2015} proposed the first data-aware near-isometric linear embedding which achieves the best of both worlds. However, their method NuMax does not scale to large-scale datasets. Our main contribution is a simple, data-aware, near-isometric linear dimensionality reduction method which significantly outperforms a state-of-the-art method \cite{hedge2015} with respect to scalability while achieving high quality near-isometries. Furthermore, our method comes with strong worst-case theoretical guarantees that allow us to guarantee the quality of the obtained near-isometry. We verify experimentally the efficiency of our method on numerous real-world datasets, where we find that our method ($<$10 secs) is more than 3\,000$\times$ faster than the state-of-the-art method \cite{hedge2015} ($>$9 hours) on medium scale datasets with 60\,000 data points in 784 dimensions. Finally, we use our method as a preprocessing step to increase the computational efficiency of a classification application and for speeding up approximate nearest neighbor queries. △ Less

Submitted 17 September, 2016; originally announced September 2016.

Comments: ICDM 2016

arXiv:1602.05719 [pdf, ps, other]

An improved analysis of the ER-SpUD dictionary learning algorithm

Authors: Jarosław Błasiok, Jelani Nelson

Abstract: In "dictionary learning" we observe $Y = AX + E$ for some $Y\in\mathbb{R}^{n\times p}$, $A \in\mathbb{R}^{m\times n}$, and $X\in\mathbb{R}^{m\times p}$. The matrix $Y$ is observed, and $A, X, E$ are unknown. Here $E$ is "noise" of small norm, and $X$ is column-wise sparse. The matrix $A$ is referred to as a {\em dictionary}, and its columns as {\em atoms}. Then, given some small number $p$ of samp… ▽ More In "dictionary learning" we observe $Y = AX + E$ for some $Y\in\mathbb{R}^{n\times p}$, $A \in\mathbb{R}^{m\times n}$, and $X\in\mathbb{R}^{m\times p}$. The matrix $Y$ is observed, and $A, X, E$ are unknown. Here $E$ is "noise" of small norm, and $X$ is column-wise sparse. The matrix $A$ is referred to as a {\em dictionary}, and its columns as {\em atoms}. Then, given some small number $p$ of samples, i.e.\ columns of $Y$, the goal is to learn the dictionary $A$ up to small error, as well as $X$. The motivation is that in many applications data is expected to sparse when represented by atoms in the "right" dictionary $A$ (e.g.\ images in the Haar wavelet basis), and the goal is to learn $A$ from the data to then use it for other applications. Recently, [SWW12] proposed the dictionary learning algorithm ER-SpUD with provable guarantees when $E = 0$ and $m = n$. They showed if $X$ has independent entries with an expected $s$ non-zeroes per column for $1 \lesssim s \lesssim \sqrt{n}$, and with non-zero entries being subgaussian, then for $p\gtrsim n^2\log^2 n$ with high probability ER-SpUD outputs matrices $A', X'$ which equal $A, X$ up to permuting and scaling columns (resp.\ rows) of $A$ (resp.\ $X$). They conjectured $p\gtrsim n\log n$ suffices, which they showed was information theoretically necessary for {\em any} algorithm to succeed when $s \simeq 1$. Significant progress was later obtained in [LV15]. We show that for a slight variant of ER-SpUD, $p\gtrsim n\log(n/δ)$ samples suffice for successful recovery with probability $1-δ$. We also show that for the unmodified ER-SpUD, $p\gtrsim n^{1.99}$ samples are required even to learn $A, X$ with polynomially small success probability. This resolves the main conjecture of [SWW12], and contradicts the main result of [LV15], which claimed that $p\gtrsim n\log^4 n$ guarantees success whp. △ Less

Submitted 18 February, 2016; originally announced February 2016.

ACM Class: I.2.6; F.2.0

arXiv:1511.01111 [pdf, other]

Streaming Symmetric Norms via Measure Concentration

Authors: Jaroslaw Blasiok, Vladimir Braverman, Stephen R. Chestnut, Robert Krauthgamer, Lin F. Yang

Abstract: We characterize the streaming space complexity of every symmetric norm $l$ (a norm on $\mathbb{R}^n$ invariant under sign-flips and coordinate-permutations), by relating this space complexity to the measure-concentration characteristics of $l$. Specifically, we provide nearly matching upper and lower bounds on the space complexity of calculating a $(1\pmε)$-approximation to the norm of the stream,… ▽ More We characterize the streaming space complexity of every symmetric norm $l$ (a norm on $\mathbb{R}^n$ invariant under sign-flips and coordinate-permutations), by relating this space complexity to the measure-concentration characteristics of $l$. Specifically, we provide nearly matching upper and lower bounds on the space complexity of calculating a $(1\pmε)$-approximation to the norm of the stream, for every $0<ε\leq 1/2$. (The bounds match up to $poly(ε^{-1} \log n)$ factors.) We further extend those bounds to any large approximation ratio $D\geq 1.1$, showing that the decrease in space complexity is proportional to $D^2$, and that this factor the best possible. All of the bounds depend on the median of $l(x)$ when $x$ is drawn uniformly from the $l_2$ unit sphere. The same median governs many phenomena in high-dimensional spaces, such as large-deviation bounds and the critical dimension in Dvoretzky's Theorem. The family of symmetric norms contains several well-studied norms, such as all $l_p$~norms, and indeed we provide a new explanation for the disparity in space complexity between $p\le 2$ and $p>2$. In addition, we apply our general results to easily derive bounds for several norms that were not studied before in the streaming model, including the top-$k$ norm and the $k$-support norm, which was recently employed for machine learning tasks. Overall, these results make progress on two outstanding problems in the area of sublinear algorithms (Problems 5 and 30 in~\url{http://sublinear.info}). △ Less

Submitted 26 June, 2017; v1 submitted 3 November, 2015; originally announced November 2015.

Comments: published in STOC 2017

arXiv:1510.07135 [pdf, ps, other]

Induced minors and well-quasi-ordering

Authors: Jarosław Błasiok, Marcin Kamiński, Jean-Florent Raymond, Théophile Trunck

Abstract: A graph $H$ is an induced minor of a graph $G$ if it can be obtained from an induced subgraph of $G$ by contracting edges. Otherwise, $G$ is said to be $H$-induced minor-free. Robin Thomas showed that $K_4$-induced minor-free graphs are well-quasi-ordered by induced minors [Graphs without $K_4$ and well-quasi-ordering, Journal of Combinatorial Theory, Series B, 38(3):240 -- 247, 1985]. We provid… ▽ More A graph $H$ is an induced minor of a graph $G$ if it can be obtained from an induced subgraph of $G$ by contracting edges. Otherwise, $G$ is said to be $H$-induced minor-free. Robin Thomas showed that $K_4$-induced minor-free graphs are well-quasi-ordered by induced minors [Graphs without $K_4$ and well-quasi-ordering, Journal of Combinatorial Theory, Series B, 38(3):240 -- 247, 1985]. We provide a dichotomy theorem for $H$-induced minor-free graphs and show that the class of $H$-induced minor-free graphs is well-quasi-ordered by the induced minor relation if and only if $H$ is an induced minor of the gem (the path on 4 vertices plus a dominating vertex) or of the graph obtained by adding a vertex of degree 2 to the complete graph on 4 vertices. To this end we proved two decomposition theorems which are of independent interest. Similar dichotomy results were previously given for subgraphs by Guoli Ding in [Subgraphs and well-quasi-ordering, Journal of Graph Theory, 16(5):489--502, 1992] and for induced subgraphs by Peter Damaschke in [Induced subgraphs and well-quasi-ordering, Journal of Graph Theory, 14(4):427--435, 1990]. △ Less

Submitted 22 January, 2018; v1 submitted 24 October, 2015; originally announced October 2015.

MSC Class: 05C; 06A07 ACM Class: G.2.2

arXiv:1304.5849 [pdf, other]

Chain minors are FPT

Authors: Jaroslaw Blasiok, Marcin Kaminski

Abstract: Given two finite posets P and Q, P is a chain minor of Q if there exists a partial function f from the elements of Q to the elements of P such that for every chain in P there is a chain C_Q in Q with the property that f restricted to C_Q is an isomorphism of chains. We give an algorithm to decide whether a poset P is a chain minor of o poset Q that runs in time O(|Q| log |Q|) for every fixed poset… ▽ More Given two finite posets P and Q, P is a chain minor of Q if there exists a partial function f from the elements of Q to the elements of P such that for every chain in P there is a chain C_Q in Q with the property that f restricted to C_Q is an isomorphism of chains. We give an algorithm to decide whether a poset P is a chain minor of o poset Q that runs in time O(|Q| log |Q|) for every fixed poset P. This solves an open problem from the monograph by Downey and Fellows [Parameterized Complexity, 1999] who asked whether the problem was fixed parameter tractable. △ Less

Submitted 22 April, 2013; originally announced April 2013.

Showing 1–24 of 24 results for author: Błasiok, J