-
TURF: A Two-factor, Universal, Robust, Fast Distribution Learning Algorithm
Authors:
Yi Hao,
Ayush Jain,
Alon Orlitsky,
Vaishakh Ravindrakumar
Abstract:
Approximating distributions from their samples is a canonical statistical-learning problem. One of its most powerful and successful modalities approximates every distribution to an $\ell_1$ distance essentially at most a constant times larger than its closest $t$-piece degree-$d$ polynomial, where $t\ge1$ and $d\ge0$. Letting $c_{t,d}$ denote the smallest such factor, clearly $c_{1,0}=1$, and it c…
▽ More
Approximating distributions from their samples is a canonical statistical-learning problem. One of its most powerful and successful modalities approximates every distribution to an $\ell_1$ distance essentially at most a constant times larger than its closest $t$-piece degree-$d$ polynomial, where $t\ge1$ and $d\ge0$. Letting $c_{t,d}$ denote the smallest such factor, clearly $c_{1,0}=1$, and it can be shown that $c_{t,d}\ge 2$ for all other $t$ and $d$. Yet current computationally efficient algorithms show only $c_{t,1}\le 2.25$ and the bound rises quickly to $c_{t,d}\le 3$ for $d\ge 9$. We derive a near-linear-time and essentially sample-optimal estimator that establishes $c_{t,d}=2$ for all $(t,d)\ne(1,0)$. Additionally, for many practical distributions, the lowest approximation distance is achieved by polynomials with vastly varying number of pieces. We provide a method that estimates this number near-optimally, hence helps approach the best possible approximation. Experiments combining the two techniques confirm improved performance over existing methodologies.
△ Less
Submitted 17 June, 2022; v1 submitted 14 February, 2022;
originally announced February 2022.
-
Robust estimation algorithms don't need to know the corruption level
Authors:
Ayush Jain,
Alon Orlitsky,
Vaishakh Ravindrakumar
Abstract:
Real data are rarely pure. Hence the past half-century has seen great interest in robust estimation algorithms that perform well even when part of the data is corrupt. However, their vast majority approach optimal accuracy only when given a tight upper bound on the fraction of corrupt data. Such bounds are not available in practice, resulting in weak guarantees and often poor performance. This bri…
▽ More
Real data are rarely pure. Hence the past half-century has seen great interest in robust estimation algorithms that perform well even when part of the data is corrupt. However, their vast majority approach optimal accuracy only when given a tight upper bound on the fraction of corrupt data. Such bounds are not available in practice, resulting in weak guarantees and often poor performance. This brief note abstracts the complex and pervasive robustness problem into a simple geometric puzzle. It then applies the puzzle's solution to derive a universal meta technique that converts any robust estimation algorithm requiring a tight corruption-level upper bound to achieve its optimal accuracy into one achieving essentially the same accuracy without using any upper bounds.
△ Less
Submitted 11 February, 2022;
originally announced February 2022.
-
SURF: A Simple, Universal, Robust, Fast Distribution Learning Algorithm
Authors:
Yi Hao,
Ayush Jain,
Alon Orlitsky,
Vaishakh Ravindrakumar
Abstract:
Sample- and computationally-efficient distribution estimation is a fundamental tenet in statistics and machine learning. We present SURF, an algorithm for approximating distributions by piecewise polynomials. SURF is: simple, replacing prior complex optimization techniques by straight-forward {empirical probability} approximation of each potential polynomial piece {through simple empirical-probabi…
▽ More
Sample- and computationally-efficient distribution estimation is a fundamental tenet in statistics and machine learning. We present SURF, an algorithm for approximating distributions by piecewise polynomials. SURF is: simple, replacing prior complex optimization techniques by straight-forward {empirical probability} approximation of each potential polynomial piece {through simple empirical-probability interpolation}, and using plain divide-and-conquer to merge the pieces; universal, as well-known polynomial-approximation results imply that it accurately approximates a large class of common distributions; robust to distribution mis-specification as for any degree $d \le 8$, it estimates any distribution to an $\ell_1$ distance $< 3$ times that of the nearest degree-$d$ piecewise polynomial, improving known factor upper bounds of 3 for single polynomials and 15 for polynomials with arbitrarily many pieces; fast, using optimal sample complexity, running in near sample-linear time, and if given sorted samples it may be parallelized to run in sub-linear time. In experiments, SURF outperforms state-of-the art algorithms.
△ Less
Submitted 11 February, 2021; v1 submitted 21 February, 2020;
originally announced February 2020.
-
Private Coded Caching
Authors:
Vaishakh Ravindrakumar,
Parthasarathi Panda,
Nikhil Karamchandani,
Vinod Prabhakaran
Abstract:
Recent work by Maddah-Ali and Niesen introduced coded caching which demonstrated the benefits of joint design of storage and transmission policies in content delivery networks. They studied a setup where a server communicates with a set of users, each equipped with a local cache, over a shared error-free link and proposed an order-optimal caching and delivery scheme. In this paper, we introduce th…
▽ More
Recent work by Maddah-Ali and Niesen introduced coded caching which demonstrated the benefits of joint design of storage and transmission policies in content delivery networks. They studied a setup where a server communicates with a set of users, each equipped with a local cache, over a shared error-free link and proposed an order-optimal caching and delivery scheme. In this paper, we introduce the problem of secretive coded caching where we impose the additional constraint that a user should not be able to learn anything, from either the content stored in its cache or the server transmissions, about a file it did not request. We propose a feasible scheme for this setting and demonstrate its order-optimality with respect to information-theoretic lower bounds.
△ Less
Submitted 11 October, 2017; v1 submitted 4 May, 2016;
originally announced May 2016.