-
Linear Regression using Heterogeneous Data Batches
Authors:
Ayush Jain,
Rajat Sen,
Weihao Kong,
Abhimanyu Das,
Alon Orlitsky
Abstract:
In many learning applications, data are collected from multiple sources, each providing a \emph{batch} of samples that by itself is insufficient to learn its input-output relationship. A common approach assumes that the sources fall in one of several unknown subgroups, each with an unknown input distribution and input-output relationship. We consider one of this setup's most fundamental and import…
▽ More
In many learning applications, data are collected from multiple sources, each providing a \emph{batch} of samples that by itself is insufficient to learn its input-output relationship. A common approach assumes that the sources fall in one of several unknown subgroups, each with an unknown input distribution and input-output relationship. We consider one of this setup's most fundamental and important manifestations where the output is a noisy linear combination of the inputs, and there are $k$ subgroups, each with its own regression vector. Prior work~\cite{kong2020meta} showed that with abundant small-batches, the regression vectors can be learned with only few, $\tildeΩ( k^{3/2})$, batches of medium-size with $\tildeΩ(\sqrt k)$ samples each. However, the paper requires that the input distribution for all $k$ subgroups be isotropic Gaussian, and states that removing this assumption is an ``interesting and challenging problem". We propose a novel gradient-based algorithm that improves on the existing results in several ways. It extends the applicability of the algorithm by: (1) allowing the subgroups' underlying input distributions to be different, unknown, and heavy-tailed; (2) recovering all subgroups followed by a significant proportion of batches even for infinite $k$; (3) removing the separation requirement between the regression vectors; (4) reducing the number of batches and allowing smaller batch sizes.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
TURF: A Two-factor, Universal, Robust, Fast Distribution Learning Algorithm
Authors:
Yi Hao,
Ayush Jain,
Alon Orlitsky,
Vaishakh Ravindrakumar
Abstract:
Approximating distributions from their samples is a canonical statistical-learning problem. One of its most powerful and successful modalities approximates every distribution to an $\ell_1$ distance essentially at most a constant times larger than its closest $t$-piece degree-$d$ polynomial, where $t\ge1$ and $d\ge0$. Letting $c_{t,d}$ denote the smallest such factor, clearly $c_{1,0}=1$, and it c…
▽ More
Approximating distributions from their samples is a canonical statistical-learning problem. One of its most powerful and successful modalities approximates every distribution to an $\ell_1$ distance essentially at most a constant times larger than its closest $t$-piece degree-$d$ polynomial, where $t\ge1$ and $d\ge0$. Letting $c_{t,d}$ denote the smallest such factor, clearly $c_{1,0}=1$, and it can be shown that $c_{t,d}\ge 2$ for all other $t$ and $d$. Yet current computationally efficient algorithms show only $c_{t,1}\le 2.25$ and the bound rises quickly to $c_{t,d}\le 3$ for $d\ge 9$. We derive a near-linear-time and essentially sample-optimal estimator that establishes $c_{t,d}=2$ for all $(t,d)\ne(1,0)$. Additionally, for many practical distributions, the lowest approximation distance is achieved by polynomials with vastly varying number of pieces. We provide a method that estimates this number near-optimally, hence helps approach the best possible approximation. Experiments combining the two techniques confirm improved performance over existing methodologies.
△ Less
Submitted 17 June, 2022; v1 submitted 14 February, 2022;
originally announced February 2022.
-
Robust estimation algorithms don't need to know the corruption level
Authors:
Ayush Jain,
Alon Orlitsky,
Vaishakh Ravindrakumar
Abstract:
Real data are rarely pure. Hence the past half-century has seen great interest in robust estimation algorithms that perform well even when part of the data is corrupt. However, their vast majority approach optimal accuracy only when given a tight upper bound on the fraction of corrupt data. Such bounds are not available in practice, resulting in weak guarantees and often poor performance. This bri…
▽ More
Real data are rarely pure. Hence the past half-century has seen great interest in robust estimation algorithms that perform well even when part of the data is corrupt. However, their vast majority approach optimal accuracy only when given a tight upper bound on the fraction of corrupt data. Such bounds are not available in practice, resulting in weak guarantees and often poor performance. This brief note abstracts the complex and pervasive robustness problem into a simple geometric puzzle. It then applies the puzzle's solution to derive a universal meta technique that converts any robust estimation algorithm requiring a tight corruption-level upper bound to achieve its optimal accuracy into one achieving essentially the same accuracy without using any upper bounds.
△ Less
Submitted 11 February, 2022;
originally announced February 2022.
-
Linear-Sample Learning of Low-Rank Distributions
Authors:
Ayush Jain,
Alon Orlitsky
Abstract:
Many latent-variable applications, including community detection, collaborative filtering, genomic analysis, and NLP, model data as generated by low-rank matrices. Yet despite considerable research, except for very special cases, the number of samples required to efficiently recover the underlying matrices has not been known. We determine the onset of learning in several common latent-variable set…
▽ More
Many latent-variable applications, including community detection, collaborative filtering, genomic analysis, and NLP, model data as generated by low-rank matrices. Yet despite considerable research, except for very special cases, the number of samples required to efficiently recover the underlying matrices has not been known. We determine the onset of learning in several common latent-variable settings. For all of them, we show that learning $k\times k$, rank-$r$, matrices to normalized $L_{1}$ distance $ε$ requires $Ω(\frac{kr}{ε^2})$ samples, and propose an algorithm that uses ${\cal O}(\frac{kr}{ε^2}\log^2\frac rε)$ samples, a number linear in the high dimension, and nearly linear in the, typically low, rank. The algorithm improves on existing spectral techniques and runs in polynomial time. The proofs establish new results on the rapid convergence of the spectral distance between the model and observation matrices, and may be of independent interest.
△ Less
Submitted 30 September, 2020;
originally announced October 2020.
-
Profile Entropy: A Fundamental Measure for the Learnability and Compressibility of Discrete Distributions
Authors:
Yi Hao,
Alon Orlitsky
Abstract:
The profile of a sample is the multiset of its symbol frequencies. We show that for samples of discrete distributions, profile entropy is a fundamental measure unifying the concepts of estimation, inference, and compression. Specifically, profile entropy a) determines the speed of estimating the distribution relative to the best natural estimator; b) characterizes the rate of inferring all symmetr…
▽ More
The profile of a sample is the multiset of its symbol frequencies. We show that for samples of discrete distributions, profile entropy is a fundamental measure unifying the concepts of estimation, inference, and compression. Specifically, profile entropy a) determines the speed of estimating the distribution relative to the best natural estimator; b) characterizes the rate of inferring all symmetric properties compared with the best estimator over any label-invariant distribution collection; c) serves as the limit of profile compression, for which we derive optimal near-linear-time block and sequential algorithms. To further our understanding of profile entropy, we investigate its attributes, provide algorithms for approximating its value, and determine its magnitude for numerous structural distribution families.
△ Less
Submitted 26 February, 2020;
originally announced February 2020.
-
A General Method for Robust Learning from Batches
Authors:
Ayush Jain,
Alon Orlitsky
Abstract:
In many applications, data is collected in batches, some of which are corrupt or even adversarial. Recent work derived optimal robust algorithms for estimating discrete distributions in this setting. We consider a general framework of robust learning from batches, and determine the limits of both classification and distribution estimation over arbitrary, including continuous, domains. Building on…
▽ More
In many applications, data is collected in batches, some of which are corrupt or even adversarial. Recent work derived optimal robust algorithms for estimating discrete distributions in this setting. We consider a general framework of robust learning from batches, and determine the limits of both classification and distribution estimation over arbitrary, including continuous, domains. Building on these results, we derive the first robust agnostic computationally-efficient learning algorithms for piecewise-interval classification, and for piecewise-polynomial, monotone, log-concave, and gaussian-mixture distribution estimation.
△ Less
Submitted 25 February, 2020;
originally announced February 2020.
-
SURF: A Simple, Universal, Robust, Fast Distribution Learning Algorithm
Authors:
Yi Hao,
Ayush Jain,
Alon Orlitsky,
Vaishakh Ravindrakumar
Abstract:
Sample- and computationally-efficient distribution estimation is a fundamental tenet in statistics and machine learning. We present SURF, an algorithm for approximating distributions by piecewise polynomials. SURF is: simple, replacing prior complex optimization techniques by straight-forward {empirical probability} approximation of each potential polynomial piece {through simple empirical-probabi…
▽ More
Sample- and computationally-efficient distribution estimation is a fundamental tenet in statistics and machine learning. We present SURF, an algorithm for approximating distributions by piecewise polynomials. SURF is: simple, replacing prior complex optimization techniques by straight-forward {empirical probability} approximation of each potential polynomial piece {through simple empirical-probability interpolation}, and using plain divide-and-conquer to merge the pieces; universal, as well-known polynomial-approximation results imply that it accurately approximates a large class of common distributions; robust to distribution mis-specification as for any degree $d \le 8$, it estimates any distribution to an $\ell_1$ distance $< 3$ times that of the nearest degree-$d$ piecewise polynomial, improving known factor upper bounds of 3 for single polynomials and 15 for polynomials with arbitrarily many pieces; fast, using optimal sample complexity, running in near sample-linear time, and if given sorted samples it may be parallelized to run in sub-linear time. In experiments, SURF outperforms state-of-the art algorithms.
△ Less
Submitted 11 February, 2021; v1 submitted 21 February, 2020;
originally announced February 2020.
-
Optimal Robust Learning of Discrete Distributions from Batches
Authors:
Ayush Jain,
Alon Orlitsky
Abstract:
Many applications, including natural language processing, sensor networks, collaborative filtering, and federated learning, call for estimating discrete distributions from data collected in batches, some of which may be untrustworthy, erroneous, faulty, or even adversarial.
Previous estimators for this setting ran in exponential time, and for some regimes required a suboptimal number of batches.…
▽ More
Many applications, including natural language processing, sensor networks, collaborative filtering, and federated learning, call for estimating discrete distributions from data collected in batches, some of which may be untrustworthy, erroneous, faulty, or even adversarial.
Previous estimators for this setting ran in exponential time, and for some regimes required a suboptimal number of batches. We provide the first polynomial-time estimator that is optimal in the number of batches and achieves essentially the best possible estimation accuracy.
△ Less
Submitted 24 February, 2020; v1 submitted 19 November, 2019;
originally announced November 2019.
-
Unified Sample-Optimal Property Estimation in Near-Linear Time
Authors:
Yi Hao,
Alon Orlitsky
Abstract:
We consider the fundamental learning problem of estimating properties of distributions over large domains. Using a novel piecewise-polynomial approximation technique, we derive the first unified methodology for constructing sample- and time-efficient estimators for all sufficiently smooth, symmetric and non-symmetric, additive properties. This technique yields near-linear-time computable estimator…
▽ More
We consider the fundamental learning problem of estimating properties of distributions over large domains. Using a novel piecewise-polynomial approximation technique, we derive the first unified methodology for constructing sample- and time-efficient estimators for all sufficiently smooth, symmetric and non-symmetric, additive properties. This technique yields near-linear-time computable estimators whose approximation values are asymptotically optimal and highly-concentrated, resulting in the first: 1) estimators achieving the $\mathcal{O}(k/(\varepsilon^2\log k))$ min-max $\varepsilon$-error sample complexity for all $k$-symbol Lipschitz properties; 2) unified near-optimal differentially private estimators for a variety of properties; 3) unified estimator achieving optimal bias and near-optimal variance for five important properties; 4) near-optimal sample-complexity estimators for several important symmetric properties over both domain sizes and confidence levels. In addition, we establish a McDiarmid's inequality under Poisson sampling, which is of independent interest.
△ Less
Submitted 17 March, 2020; v1 submitted 8 November, 2019;
originally announced November 2019.
-
The Broad Optimality of Profile Maximum Likelihood
Authors:
Yi Hao,
Alon Orlitsky
Abstract:
We study three fundamental statistical-learning problems: distribution estimation, property estimation, and property testing. We establish the profile maximum likelihood (PML) estimator as the first unified sample-optimal approach to a wide range of learning tasks. In particular, for every alphabet size $k$ and desired accuracy $\varepsilon$:
$\textbf{Distribution estimation}$ Under $\ell_1$ dis…
▽ More
We study three fundamental statistical-learning problems: distribution estimation, property estimation, and property testing. We establish the profile maximum likelihood (PML) estimator as the first unified sample-optimal approach to a wide range of learning tasks. In particular, for every alphabet size $k$ and desired accuracy $\varepsilon$:
$\textbf{Distribution estimation}$ Under $\ell_1$ distance, PML yields optimal $Θ(k/(\varepsilon^2\log k))$ sample complexity for sorted-distribution estimation, and a PML-based estimator empirically outperforms the Good-Turing estimator on the actual distribution;
$\textbf{Additive property estimation}$ For a broad class of additive properties, the PML plug-in estimator uses just four times the sample size required by the best estimator to achieve roughly twice its error, with exponentially higher confidence;
$\boldsymbolα\textbf{-Rényi entropy estimation}$ For integer $α>1$, the PML plug-in estimator has optimal $k^{1-1/α}$ sample complexity; for non-integer $α>3/4$, the PML plug-in estimator has sample complexity lower than the state of the art;
$\textbf{Identity testing}$ In testing whether an unknown distribution is equal to or at least $\varepsilon$ far from a given distribution in $\ell_1$ distance, a PML-based tester achieves the optimal sample complexity up to logarithmic factors of $k$.
Most of these results also hold for a near-linear-time computable variant of PML. Stronger results hold for a different and novel variant called truncated PML (TPML).
△ Less
Submitted 11 July, 2019; v1 submitted 10 June, 2019;
originally announced June 2019.
-
Data Amplification: A Unified and Competitive Approach to Property Estimation
Authors:
Yi Hao,
Alon Orlitsky,
Ananda T. Suresh,
Yihong Wu
Abstract:
Estimating properties of discrete distributions is a fundamental problem in statistical learning. We design the first unified, linear-time, competitive, property estimator that for a wide class of properties and for all underlying distributions uses just $2n$ samples to achieve the performance attained by the empirical estimator with $n\sqrt{\log n}$ samples. This provides off-the-shelf, distribut…
▽ More
Estimating properties of discrete distributions is a fundamental problem in statistical learning. We design the first unified, linear-time, competitive, property estimator that for a wide class of properties and for all underlying distributions uses just $2n$ samples to achieve the performance attained by the empirical estimator with $n\sqrt{\log n}$ samples. This provides off-the-shelf, distribution-independent, "amplification" of the amount of data available relative to common-practice estimators.
We illustrate the estimator's practical advantages by comparing it to existing estimators for a wide variety of properties and distributions. In most cases, its performance with $n$ samples is even as good as that of the empirical estimator with $n\log n$ samples, and for essentially all properties, its performance is comparable to that of the best existing estimator designed specifically for that property.
△ Less
Submitted 29 March, 2019;
originally announced April 2019.
-
Data Amplification: Instance-Optimal Property Estimation
Authors:
Yi Hao,
Alon Orlitsky
Abstract:
The best-known and most commonly used distribution-property estimation technique uses a plug-in estimator, with empirical frequency replacing the underlying distribution. We present novel linear-time-computable estimators that significantly "amplify" the effective amount of data available. For a large variety of distribution properties including four of the most popular ones and for every underlyi…
▽ More
The best-known and most commonly used distribution-property estimation technique uses a plug-in estimator, with empirical frequency replacing the underlying distribution. We present novel linear-time-computable estimators that significantly "amplify" the effective amount of data available. For a large variety of distribution properties including four of the most popular ones and for every underlying distribution, they achieve the accuracy that the empirical-frequency plug-in estimators would attain using a logarithmic-factor more samples.
Specifically, for Shannon entropy and a very broad class of properties including $\ell_1$-distance, the new estimators use $n$ samples to achieve the accuracy attained by the empirical estimators with $n\log n$ samples. For support-size and coverage, the new estimators use $n$ samples to achieve the performance of empirical frequency with sample size $n$ times the logarithm of the property value. Significantly strengthening the traditional min-max formulation, these results hold not only for the worst distributions, but for each and every underlying distribution. Furthermore, the logarithmic amplification factors are optimal. Experiments on a wide variety of distributions show that the new estimators outperform the previous state-of-the-art estimators designed for each specific property.
△ Less
Submitted 5 March, 2019; v1 submitted 4 March, 2019;
originally announced March 2019.
-
On Learning Markov Chains
Authors:
Yi Hao,
Alon Orlitsky,
Venkatadheeraj Pichapati
Abstract:
The problem of estimating an unknown discrete distribution from its samples is a fundamental tenet of statistical learning. Over the past decade, it attracted significant research effort and has been solved for a variety of divergence measures. Surprisingly, an equally important problem, estimating an unknown Markov chain from its samples, is still far from understood. We consider two problems rel…
▽ More
The problem of estimating an unknown discrete distribution from its samples is a fundamental tenet of statistical learning. Over the past decade, it attracted significant research effort and has been solved for a variety of divergence measures. Surprisingly, an equally important problem, estimating an unknown Markov chain from its samples, is still far from understood. We consider two problems related to the min-max risk (expected loss) of estimating an unknown $k$-state Markov chain from its $n$ sequential samples: predicting the conditional distribution of the next sample with respect to the KL-divergence, and estimating the transition matrix with respect to a natural loss induced by KL or a more general $f$-divergence measure.
For the first measure, we determine the min-max prediction risk to within a linear factor in the alphabet size, showing it is $Ω(k\log\log n\ / n)$ and $\mathcal{O}(k^2\log\log n\ / n)$. For the second, if the transition probabilities can be arbitrarily small, then only trivial uniform risk upper bounds can be derived. We therefore consider transition probabilities that are bounded away from zero, and resolve the problem for essentially all sufficiently smooth $f$-divergences, including KL-, $L_2$-, Chi-squared, Hellinger, and Alpha-divergences.
△ Less
Submitted 27 October, 2018;
originally announced October 2018.
-
Maximum Selection and Ranking under Noisy Comparisons
Authors:
Moein Falahatgar,
Alon Orlitsky,
Venkatadheeraj Pichapati,
Ananda Theertha Suresh
Abstract:
We consider $(ε,δ)$-PAC maximum-selection and ranking for general probabilistic models whose comparisons probabilities satisfy strong stochastic transitivity and stochastic triangle inequality. Modifying the popular knockout tournament, we propose a maximum-selection algorithm that uses $\mathcal{O}\left(\frac{n}{ε^2}\log \frac{1}δ\right)$ comparisons, a number tight up to a constant factor. We th…
▽ More
We consider $(ε,δ)$-PAC maximum-selection and ranking for general probabilistic models whose comparisons probabilities satisfy strong stochastic transitivity and stochastic triangle inequality. Modifying the popular knockout tournament, we propose a maximum-selection algorithm that uses $\mathcal{O}\left(\frac{n}{ε^2}\log \frac{1}δ\right)$ comparisons, a number tight up to a constant factor. We then derive a general framework that improves the performance of many ranking algorithms, and combine it with merge sort and binary search to obtain a ranking algorithm that uses $\mathcal{O}\left(\frac{n\log n (\log \log n)^3}{ε^2}\right)$ comparisons for any $δ\ge\frac1n$, a number optimal up to a $(\log \log n)^3$ factor.
△ Less
Submitted 15 May, 2017;
originally announced May 2017.
-
A Unified Maximum Likelihood Approach for Optimal Distribution Property Estimation
Authors:
Jayadev Acharya,
Hirakendu Das,
Alon Orlitsky,
Ananda Theertha Suresh
Abstract:
The advent of data science has spurred interest in estimating properties of distributions over large alphabets. Fundamental symmetric properties such as support size, support coverage, entropy, and proximity to uniformity, received most attention, with each property estimated using a different technique and often intricate analysis tools.
We prove that for all these properties, a single, simple,…
▽ More
The advent of data science has spurred interest in estimating properties of distributions over large alphabets. Fundamental symmetric properties such as support size, support coverage, entropy, and proximity to uniformity, received most attention, with each property estimated using a different technique and often intricate analysis tools.
We prove that for all these properties, a single, simple, plug-in estimator---profile maximum likelihood (PML)---performs as well as the best specialized techniques. This raises the possibility that PML may optimally estimate many other symmetric properties.
△ Less
Submitted 28 November, 2016; v1 submitted 9 November, 2016;
originally announced November 2016.
-
Maximum Selection and Sorting with Adversarial Comparators and an Application to Density Estimation
Authors:
Jayadev Acharya,
Moein Falahatgar,
Ashkan Jafarpour,
Alon Orlitsky,
Ananda Theertha Suresh
Abstract:
We study maximum selection and sorting of $n$ numbers using pairwise comparators that output the larger of their two inputs if the inputs are more than a given threshold apart, and output an adversarially-chosen input otherwise. We consider two adversarial models. A non-adaptive adversary that decides on the outcomes in advance based solely on the inputs, and an adaptive adversary that can decide…
▽ More
We study maximum selection and sorting of $n$ numbers using pairwise comparators that output the larger of their two inputs if the inputs are more than a given threshold apart, and output an adversarially-chosen input otherwise. We consider two adversarial models. A non-adaptive adversary that decides on the outcomes in advance based solely on the inputs, and an adaptive adversary that can decide on the outcome of each query depending on previous queries and outcomes.
Against the non-adaptive adversary, we derive a maximum-selection algorithm that uses at most $2n$ comparisons in expectation, and a sorting algorithm that uses at most $2n \ln n$ comparisons in expectation. These numbers are within small constant factors from the best possible. Against the adaptive adversary, we propose a maximum-selection algorithm that uses $Θ(n\log (1/ε))$ comparisons to output a correct answer with probability at least $1-ε$. The existence of this algorithm affirmatively resolves an open problem of Ajtai, Feldman, Hassadim, and Nelson.
Our study was motivated by a density-estimation problem where, given samples from an unknown underlying distribution, we would like to find a distribution in a known class of $n$ candidate distributions that is close to underlying distribution in $\ell_1$ distance. Scheffe's algorithm outputs a distribution at an $\ell_1$ distance at most 9 times the minimum and runs in time $Θ(n^2\log n)$. Using maximum selection, we propose an algorithm with the same approximation guarantee but run time of $Θ(n\log n)$.
△ Less
Submitted 8 June, 2016;
originally announced June 2016.
-
Universal Compression of Power-Law Distributions
Authors:
Moein Falahatgar,
Ashkan Jafarpour,
Alon Orlitsky,
Venkatadheeraj Pichapati,
Ananda Theertha Suresh
Abstract:
English words and the outputs of many other natural processes are well-known to follow a Zipf distribution. Yet this thoroughly-established property has never been shown to help compress or predict these important processes. We show that the expected redundancy of Zipf distributions of order $α>1$ is roughly the $1/α$ power of the expected redundancy of unrestricted distributions. Hence for these…
▽ More
English words and the outputs of many other natural processes are well-known to follow a Zipf distribution. Yet this thoroughly-established property has never been shown to help compress or predict these important processes. We show that the expected redundancy of Zipf distributions of order $α>1$ is roughly the $1/α$ power of the expected redundancy of unrestricted distributions. Hence for these orders, Zipf distributions can be better compressed and predicted than was previously known. Unlike the expected case, we show that worst-case redundancy is roughly the same for Zipf and for unrestricted distributions. Hence Zipf distributions have significantly different worst-case and expected redundancies, making them the first natural distribution class shown to have such a difference.
△ Less
Submitted 30 April, 2015; v1 submitted 29 April, 2015;
originally announced April 2015.
-
Faster Algorithms for Testing under Conditional Sampling
Authors:
Moein Falahatgar,
Ashkan Jafarpour,
Alon Orlitsky,
Venkatadheeraj Pichapathi,
Ananda Theertha Suresh
Abstract:
There has been considerable recent interest in distribution-tests whose run-time and sample requirements are sublinear in the domain-size $k$. We study two of the most important tests under the conditional-sampling model where each query specifies a subset $S$ of the domain, and the response is a sample drawn from $S$ according to the underlying distribution.
For identity testing, which asks whe…
▽ More
There has been considerable recent interest in distribution-tests whose run-time and sample requirements are sublinear in the domain-size $k$. We study two of the most important tests under the conditional-sampling model where each query specifies a subset $S$ of the domain, and the response is a sample drawn from $S$ according to the underlying distribution.
For identity testing, which asks whether the underlying distribution equals a specific given distribution or $ε$-differs from it, we reduce the known time and sample complexities from $\tilde{\mathcal{O}}(ε^{-4})$ to $\tilde{\mathcal{O}}(ε^{-2})$, thereby matching the information theoretic lower bound. For closeness testing, which asks whether two distributions underlying observed data sets are equal or different, we reduce existing complexity from $\tilde{\mathcal{O}}(ε^{-4} \log^5 k)$ to an even sub-logarithmic $\tilde{\mathcal{O}}(ε^{-5} \log \log k)$ thus providing a better bound to an open problem in Bertinoro Workshop on Sublinear Algorithms [Fisher, 2004].
△ Less
Submitted 16 April, 2015;
originally announced April 2015.
-
Competitive Distribution Estimation
Authors:
Alon Orlitsky,
Ananda Theertha Suresh
Abstract:
Estimating an unknown distribution from its samples is a fundamental problem in statistics. The common, min-max, formulation of this goal considers the performance of the best estimator over all distributions in a class. It shows that with $n$ samples, distributions over $k$ symbols can be learned to a KL divergence that decreases to zero with the sample size $n$, but grows unboundedly with the al…
▽ More
Estimating an unknown distribution from its samples is a fundamental problem in statistics. The common, min-max, formulation of this goal considers the performance of the best estimator over all distributions in a class. It shows that with $n$ samples, distributions over $k$ symbols can be learned to a KL divergence that decreases to zero with the sample size $n$, but grows unboundedly with the alphabet size $k$.
Min-max performance can be viewed as regret relative to an oracle that knows the underlying distribution. We consider two natural and modest limits on the oracle's power. One where it knows the underlying distribution only up to symbol permutations, and the other where it knows the exact distribution but is restricted to use natural estimators that assign the same probability to symbols that appeared equally many times in the sample.
We show that in both cases the competitive regret reduces to $\min(k/n,\tilde{\mathcal{O}}(1/\sqrt n))$, a quantity upper bounded uniformly for every alphabet size. This shows that distributions can be estimated nearly as well as when they are essentially known in advance, and nearly as well as when they are completely known in advance but need to be estimated via a natural estimator. We also provide an estimator that runs in linear time and incurs competitive regret of $\tilde{\mathcal{O}}(\min(k/n,1/\sqrt n))$, and show that for natural estimators this competitive regret is inevitable. We also demonstrate the effectiveness of competitive estimators using simulations.
△ Less
Submitted 26 March, 2015;
originally announced March 2015.
-
Universal compression of Gaussian sources with unknown parameters
Authors:
A. Orlitsky,
N. Santhanam
Abstract:
For a collection of distributions over a countable support set, the worst case universal compression formulation by Shtarkov attempts to assign a universal distribution over the support set. The formulation aims to ensure that the universal distribution does not underestimate the probability of any element in the support set relative to distributions in the collection. When the alphabet is uncount…
▽ More
For a collection of distributions over a countable support set, the worst case universal compression formulation by Shtarkov attempts to assign a universal distribution over the support set. The formulation aims to ensure that the universal distribution does not underestimate the probability of any element in the support set relative to distributions in the collection. When the alphabet is uncountable and we have a collection $\cal P$ of Lebesgue continuous measures instead, we ask if there is a corresponding universal probability density function (pdf) that does not underestimate the value of the density function at any point in the support relative to pdfs in $\cal P$.
Analogous to the worst case redundancy of a collection of distributions over a countable alphabet, we define the \textit{attenuation} of a class to be $A$ when the worst case optimal universal pdf at any point $x$ in the support is always at least the value any pdf in the collection $\cal P$ assigns to $x$ divided by $A$. We analyze the attenuation of the worst optimal universal pdf over length-$n$ samples generated \textit{i.i.d.} from a Gaussian distribution whose mean can be anywhere between $-α/2$ to $α/2$ and variance between $σ_m^2$ and $σ_M^2$. We show that this attenuation is finite, grows with the number of samples as ${\cal O}(n)$, and also specify the attentuation exactly without approximations. When only one parameter is allowed to vary, we show that the attenuation grows as ${\cal O}(\sqrt{n})$, again keeping in line with results from prior literature that fix the order of magnitude as a factor of $\sqrt{n}$ per parameter. In addition, we also specify the attenuation exactly without approximation when only the mean or only the variance is allowed to vary.
△ Less
Submitted 16 October, 2014;
originally announced October 2014.
-
Estimating Renyi Entropy of Discrete Distributions
Authors:
Jayadev Acharya,
Alon Orlitsky,
Ananda Theertha Suresh,
Himanshu Tyagi
Abstract:
It was recently shown that estimating the Shannon entropy $H({\rm p})$ of a discrete $k$-symbol distribution ${\rm p}$ requires $Θ(k/\log k)$ samples, a number that grows near-linearly in the support size. In many applications $H({\rm p})$ can be replaced by the more general Rényi entropy of order $α$, $H_α({\rm p})$. We determine the number of samples needed to estimate $H_α({\rm p})$ for all…
▽ More
It was recently shown that estimating the Shannon entropy $H({\rm p})$ of a discrete $k$-symbol distribution ${\rm p}$ requires $Θ(k/\log k)$ samples, a number that grows near-linearly in the support size. In many applications $H({\rm p})$ can be replaced by the more general Rényi entropy of order $α$, $H_α({\rm p})$. We determine the number of samples needed to estimate $H_α({\rm p})$ for all $α$, showing that $α< 1$ requires a super-linear, roughly $k^{1/α}$ samples, noninteger $α>1$ requires a near-linear $k$ samples, but, perhaps surprisingly, integer $α>1$ requires only $Θ(k^{1-1/α})$ samples. Furthermore, developing on a recently established connection between polynomial approximation and estimation of additive functions of the form $\sum_{x} f({\rm p}_x)$, we reduce the sample complexity for noninteger values of $α$ by a factor of $\log k$ compared to the empirical estimator. The estimators achieving these bounds are simple and run in time linear in the number of samples. Our lower bounds provide explicit constructions of distributions with different Rényi entropies that are hard to distinguish.
△ Less
Submitted 10 March, 2016; v1 submitted 2 August, 2014;
originally announced August 2014.
-
Universal Compression of Envelope Classes: Tight Characterization via Poisson Sampling
Authors:
Jayadev Acharya,
Ashkan Jafarpour,
Alon Orlitsky,
Ananda Theertha Suresh
Abstract:
The Poisson-sampling technique eliminates dependencies among symbol appearances in a random sequence. It has been used to simplify the analysis and strengthen the performance guarantees of randomized algorithms. Applying this method to universal compression, we relate the redundancies of fixed-length and Poisson-sampled sequences, use the relation to derive a simple single-letter formula that appr…
▽ More
The Poisson-sampling technique eliminates dependencies among symbol appearances in a random sequence. It has been used to simplify the analysis and strengthen the performance guarantees of randomized algorithms. Applying this method to universal compression, we relate the redundancies of fixed-length and Poisson-sampled sequences, use the relation to derive a simple single-letter formula that approximates the redundancy of any envelope class to within an additive logarithmic term. As a first application, we consider i.i.d. distributions over a small alphabet as a step-envelope class, and provide a short proof that determines the redundancy of discrete distributions over a small al- phabet up to the first order terms. We then show the strength of our method by applying the formula to tighten the existing bounds on the redundancy of exponential and power-law classes, in particular answering a question posed by Boucheron, Garivier and Gassiat.
△ Less
Submitted 29 May, 2014;
originally announced May 2014.
-
String Reconstruction from Substring Compositions
Authors:
Jayadev Acharya,
Hirakendu Das,
Olgica Milenkovic,
Alon Orlitsky,
Shengjun Pan
Abstract:
Motivated by mass-spectrometry protein sequencing, we consider a simply-stated problem of reconstructing a string from the multiset of its substring compositions. We show that all strings of length 7, one less than a prime, or one less than twice a prime, can be reconstructed uniquely up to reversal. For all other lengths we show that reconstruction is not always possible and provide sometimes-tig…
▽ More
Motivated by mass-spectrometry protein sequencing, we consider a simply-stated problem of reconstructing a string from the multiset of its substring compositions. We show that all strings of length 7, one less than a prime, or one less than twice a prime, can be reconstructed uniquely up to reversal. For all other lengths we show that reconstruction is not always possible and provide sometimes-tight bounds on the largest number of strings with given substring compositions. The lower bounds are derived by combinatorial arguments and the upper bounds by algebraic considerations that precisely characterize the set of strings with the same substring compositions in terms of the factorization of bivariate polynomials. The problem can be viewed as a combinatorial simplification of the turnpike problem, and its solution may shed light on this long-standing problem as well. Using well known results on transience of multi-dimensional random walks, we also provide a reconstruction algorithm that reconstructs random strings over alphabets of size $\ge4$ in optimal near-quadratic time.
△ Less
Submitted 10 March, 2014;
originally announced March 2014.
-
Near-optimal-sample estimators for spherical Gaussian mixtures
Authors:
Jayadev Acharya,
Ashkan Jafarpour,
Alon Orlitsky,
Ananda Theertha Suresh
Abstract:
Statistical and machine-learning algorithms are frequently applied to high-dimensional data. In many of these applications data is scarce, and often much more costly than computation time. We provide the first sample-efficient polynomial-time estimator for high-dimensional spherical Gaussian mixtures.
For mixtures of any $k$ $d$-dimensional spherical Gaussians, we derive an intuitive spectral-es…
▽ More
Statistical and machine-learning algorithms are frequently applied to high-dimensional data. In many of these applications data is scarce, and often much more costly than computation time. We provide the first sample-efficient polynomial-time estimator for high-dimensional spherical Gaussian mixtures.
For mixtures of any $k$ $d$-dimensional spherical Gaussians, we derive an intuitive spectral-estimator that uses $\mathcal{O}_k\bigl(\frac{d\log^2d}{ε^4}\bigr)$ samples and runs in time $\mathcal{O}_{k,ε}(d^3\log^5 d)$, both significantly lower than previously known. The constant factor $\mathcal{O}_k$ is polynomial for sample complexity and is exponential for the time complexity, again much smaller than what was previously known. We also show that $Ω_k\bigl(\frac{d}{ε^2}\bigr)$ samples are needed for any algorithm. Hence the sample complexity is near-optimal in the number of dimensions.
We also derive a simple estimator for one-dimensional mixtures that uses $\mathcal{O}\bigl(\frac{k \log \frac{k}ε }{ε^2} \bigr)$ samples and runs in time $\widetilde{\mathcal{O}}\left(\bigl(\frac{k}ε\bigr)^{3k+1}\right)$. Our other technical contributions include a faster algorithm for choosing a density estimate from a set of distributions, that minimizes the $\ell_1$ distance to an unknown underlying distribution.
△ Less
Submitted 19 February, 2014;
originally announced February 2014.
-
On Modeling Profiles instead of Values
Authors:
Alon Orlitsky,
Narayana Santhanam,
Krishnamurthy Viswanathan,
Junan Zhang
Abstract:
We consider the problem of estimating the distribution underlying an observed sample of data. Instead of maximum likelihood, which maximizes the probability of the ob served values, we propose a different estimate, the high-profile distribution, which maximizes the probability of the observed profile the number of symbols appearing any given number of times. We determine the high-profile distribut…
▽ More
We consider the problem of estimating the distribution underlying an observed sample of data. Instead of maximum likelihood, which maximizes the probability of the ob served values, we propose a different estimate, the high-profile distribution, which maximizes the probability of the observed profile the number of symbols appearing any given number of times. We determine the high-profile distribution of several data samples, establish some of its general properties, and show that when the number of distinct symbols observed is small compared to the data size, the high-profile and maximum-likelihood distributions are roughly the same, but when the number of symbols is large, the distributions differ, and high-profile better explains the data.
△ Less
Submitted 11 July, 2012;
originally announced July 2012.