Search | arXiv e-print repository

TURF: A Two-factor, Universal, Robust, Fast Distribution Learning Algorithm

Authors: Yi Hao, Ayush Jain, Alon Orlitsky, Vaishakh Ravindrakumar

Abstract: Approximating distributions from their samples is a canonical statistical-learning problem. One of its most powerful and successful modalities approximates every distribution to an $\ell_1$ distance essentially at most a constant times larger than its closest $t$-piece degree-$d$ polynomial, where $t\ge1$ and $d\ge0$. Letting $c_{t,d}$ denote the smallest such factor, clearly $c_{1,0}=1$, and it c… ▽ More Approximating distributions from their samples is a canonical statistical-learning problem. One of its most powerful and successful modalities approximates every distribution to an $\ell_1$ distance essentially at most a constant times larger than its closest $t$-piece degree-$d$ polynomial, where $t\ge1$ and $d\ge0$. Letting $c_{t,d}$ denote the smallest such factor, clearly $c_{1,0}=1$, and it can be shown that $c_{t,d}\ge 2$ for all other $t$ and $d$. Yet current computationally efficient algorithms show only $c_{t,1}\le 2.25$ and the bound rises quickly to $c_{t,d}\le 3$ for $d\ge 9$. We derive a near-linear-time and essentially sample-optimal estimator that establishes $c_{t,d}=2$ for all $(t,d)\ne(1,0)$. Additionally, for many practical distributions, the lowest approximation distance is achieved by polynomials with vastly varying number of pieces. We provide a method that estimates this number near-optimally, hence helps approach the best possible approximation. Experiments combining the two techniques confirm improved performance over existing methodologies. △ Less

Submitted 17 June, 2022; v1 submitted 14 February, 2022; originally announced February 2022.

Comments: 19 pages, 12 figures

arXiv:2202.05453 [pdf, ps, other]

Robust estimation algorithms don't need to know the corruption level

Authors: Ayush Jain, Alon Orlitsky, Vaishakh Ravindrakumar

Abstract: Real data are rarely pure. Hence the past half-century has seen great interest in robust estimation algorithms that perform well even when part of the data is corrupt. However, their vast majority approach optimal accuracy only when given a tight upper bound on the fraction of corrupt data. Such bounds are not available in practice, resulting in weak guarantees and often poor performance. This bri… ▽ More Real data are rarely pure. Hence the past half-century has seen great interest in robust estimation algorithms that perform well even when part of the data is corrupt. However, their vast majority approach optimal accuracy only when given a tight upper bound on the fraction of corrupt data. Such bounds are not available in practice, resulting in weak guarantees and often poor performance. This brief note abstracts the complex and pervasive robustness problem into a simple geometric puzzle. It then applies the puzzle's solution to derive a universal meta technique that converts any robust estimation algorithm requiring a tight corruption-level upper bound to achieve its optimal accuracy into one achieving essentially the same accuracy without using any upper bounds. △ Less

Submitted 11 February, 2022; originally announced February 2022.

arXiv:2002.09589 [pdf, other]

SURF: A Simple, Universal, Robust, Fast Distribution Learning Algorithm

Authors: Yi Hao, Ayush Jain, Alon Orlitsky, Vaishakh Ravindrakumar

Abstract: Sample- and computationally-efficient distribution estimation is a fundamental tenet in statistics and machine learning. We present SURF, an algorithm for approximating distributions by piecewise polynomials. SURF is: simple, replacing prior complex optimization techniques by straight-forward {empirical probability} approximation of each potential polynomial piece {through simple empirical-probabi… ▽ More Sample- and computationally-efficient distribution estimation is a fundamental tenet in statistics and machine learning. We present SURF, an algorithm for approximating distributions by piecewise polynomials. SURF is: simple, replacing prior complex optimization techniques by straight-forward {empirical probability} approximation of each potential polynomial piece {through simple empirical-probability interpolation}, and using plain divide-and-conquer to merge the pieces; universal, as well-known polynomial-approximation results imply that it accurately approximates a large class of common distributions; robust to distribution mis-specification as for any degree $d \le 8$, it estimates any distribution to an $\ell_1$ distance $< 3$ times that of the nearest degree-$d$ piecewise polynomial, improving known factor upper bounds of 3 for single polynomials and 15 for polynomials with arbitrarily many pieces; fast, using optimal sample complexity, running in near sample-linear time, and if given sorted samples it may be parallelized to run in sub-linear time. In experiments, SURF outperforms state-of-the art algorithms. △ Less

Submitted 11 February, 2021; v1 submitted 21 February, 2020; originally announced February 2020.

Comments: 27 pages, 9 figures, 3 tables

arXiv:1605.01348 [pdf, other]

Private Coded Caching

Authors: Vaishakh Ravindrakumar, Parthasarathi Panda, Nikhil Karamchandani, Vinod Prabhakaran

Abstract: Recent work by Maddah-Ali and Niesen introduced coded caching which demonstrated the benefits of joint design of storage and transmission policies in content delivery networks. They studied a setup where a server communicates with a set of users, each equipped with a local cache, over a shared error-free link and proposed an order-optimal caching and delivery scheme. In this paper, we introduce th… ▽ More Recent work by Maddah-Ali and Niesen introduced coded caching which demonstrated the benefits of joint design of storage and transmission policies in content delivery networks. They studied a setup where a server communicates with a set of users, each equipped with a local cache, over a shared error-free link and proposed an order-optimal caching and delivery scheme. In this paper, we introduce the problem of secretive coded caching where we impose the additional constraint that a user should not be able to learn anything, from either the content stored in its cache or the server transmissions, about a file it did not request. We propose a feasible scheme for this setting and demonstrate its order-optimality with respect to information-theoretic lower bounds. △ Less

Submitted 11 October, 2017; v1 submitted 4 May, 2016; originally announced May 2016.

Comments: To appear in IEEE Transactions on Information Forensics and Security

Showing 1–4 of 4 results for author: Ravindrakumar, V