-
A simple sketching algorithm for entropy estimation
Authors:
Peter Clifford,
Ioana Ada Cosma
Abstract:
We consider the problem of approximating the empirical Shannon entropy of a high-frequency data stream under the relaxed strict-turnstile model, when space limitations make exact computation infeasible. An equivalent measure of entropy is the Renyi entropy that depends on a constant alpha. This quantity can be estimated efficiently and unbiasedly from a low-dimensional synopsis called an alpha-sta…
▽ More
We consider the problem of approximating the empirical Shannon entropy of a high-frequency data stream under the relaxed strict-turnstile model, when space limitations make exact computation infeasible. An equivalent measure of entropy is the Renyi entropy that depends on a constant alpha. This quantity can be estimated efficiently and unbiasedly from a low-dimensional synopsis called an alpha-stable data sketch via the method of compressed counting. An approximation to the Shannon entropy can be obtained from the Renyi entropy by taking alpha sufficiently close to 1. However, practical guidelines for parameter calibration with respect to alpha are lacking. We avoid this problem by showing that the random variables used in estimating the Renyi entropy can be transformed to have a proper distributional limit as alpha approaches 1: the maximally skewed, strictly stable distribution with alpha = 1 defined on the entire real line. We propose a family of asymptotically unbiased log-mean estimators of the Shannon entropy, indexed by a constant zeta > 0, that can be computed in a single-pass algorithm to provide an additive approximation. We recommend the log-mean estimator with zeta = 1 that has exponentially decreasing tail bounds on the error probability, asymptotic relative efficiency of 0.932, and near-optimal computational complexity.
△ Less
Submitted 17 April, 2013; v1 submitted 27 August, 2009;
originally announced August 2009.
-
Principle of detailed balance and convergence assessment of Markov Chain Monte Carlo methods and simulated annealing
Authors:
Ioana A. Cosma,
Masoud Asgharian
Abstract:
Markov Chain Monte Carlo (MCMC) methods are employed to sample from a given distribution of interest, whenever either the distribution does not exist in closed form, or, if it does, no efficient method to simulate an independent sample from it is available. Although a wealth of diagnostic tools for convergence assessment of MCMC methods have been proposed in the last two decades, the search for…
▽ More
Markov Chain Monte Carlo (MCMC) methods are employed to sample from a given distribution of interest, whenever either the distribution does not exist in closed form, or, if it does, no efficient method to simulate an independent sample from it is available. Although a wealth of diagnostic tools for convergence assessment of MCMC methods have been proposed in the last two decades, the search for a dependable and easy to implement tool is ongoing. We present in this article a criterion based on the principle of detailed balance which provides a qualitative assessment of the convergence of a given chain. The criterion is based on the behaviour of a one-dimensional statistic, whose asymptotic distribution under the assumption of stationarity is derived; our results apply under weak conditions and have the advantage of being completely intuitive. We implement this criterion as a stopping rule for simulated annealing in the problem of finding maximum likelihood estimators for parameters of a 20-component mixture model. We also apply it to the problem of sampling from a 10-dimensional funnel distribution via slice sampling and the Metropolis-Hastings algorithm. Furthermore, based on this convergence criterion we define a measure of efficiency of one algorithm versus another.
△ Less
Submitted 20 July, 2008;
originally announced July 2008.
-
Efficient l_{alpha} Distance Approximation for High Dimensional Data Using alpha-Stable Projection
Authors:
Peter Clifford,
Ioana A. Cosma
Abstract:
In recent years, large high-dimensional data sets have become commonplace in a wide range of applications in science and commerce. Techniques for dimension reduction are of primary concern in statistical analysis. Projection methods play an important role. We investigate the use of projection algorithms that exploit properties of the alpha-stable distributions. We show that l_{alpha} distances a…
▽ More
In recent years, large high-dimensional data sets have become commonplace in a wide range of applications in science and commerce. Techniques for dimension reduction are of primary concern in statistical analysis. Projection methods play an important role. We investigate the use of projection algorithms that exploit properties of the alpha-stable distributions. We show that l_{alpha} distances and quasi-distances can be recovered from random projections with full statistical efficiency by L-estimation. The computational requirements of our algorithm are modest; after a once-and-for-all calculation to determine an array of length k, the algorithm runs in O(k) time for each distance, where k is the reduced dimension of the projection.
△ Less
Submitted 23 January, 2008;
originally announced January 2008.
-
A statistical analysis of probabilistic counting algorithms
Authors:
Peter Clifford,
Ioana A. Cosma
Abstract:
This paper considers the problem of cardinality estimation in data stream applications. We present a statistical analysis of probabilistic counting algorithms, focusing on two techniques that use pseudo-random variates to form low-dimensional data sketches. We apply conventional statistical methods to compare probabilistic algorithms based on storing either selected order statistics, or random pro…
▽ More
This paper considers the problem of cardinality estimation in data stream applications. We present a statistical analysis of probabilistic counting algorithms, focusing on two techniques that use pseudo-random variates to form low-dimensional data sketches. We apply conventional statistical methods to compare probabilistic algorithms based on storing either selected order statistics, or random projections. We derive estimators of the cardinality in both cases, and show that the maximal-term estimator is recursively computable and has exponentially decreasing error bounds. Furthermore, we show that the estimators have comparable asymptotic efficiency, and explain this result by demonstrating an unexpected connection between the two approaches.
△ Less
Submitted 7 November, 2010; v1 submitted 23 January, 2008;
originally announced January 2008.