-
Just Wing It: Near-Optimal Estimation of Missing Mass in a Markovian Sequence
Authors:
Ashwin Pananjady,
Vidya Muthukumar,
Andrew Thangaraj
Abstract:
We study the problem of estimating the stationary mass -- also called the unigram mass -- that is missing from a single trajectory of a discrete-time, ergodic Markov chain. This problem has several applications -- for example, estimating the stationary missing mass is critical for accurately smoothing probability estimates in sequence models. While the classical Good--Turing estimator from the 195…
▽ More
We study the problem of estimating the stationary mass -- also called the unigram mass -- that is missing from a single trajectory of a discrete-time, ergodic Markov chain. This problem has several applications -- for example, estimating the stationary missing mass is critical for accurately smoothing probability estimates in sequence models. While the classical Good--Turing estimator from the 1950s has appealing properties for i.i.d. data, it is known to be biased in the Markovian setting, and other heuristic estimators do not come equipped with guarantees. Operating in the general setting in which the size of the state space may be much larger than the length $n$ of the trajectory, we develop a linear-runtime estimator called Windowed Good--Turing (WingIt) and show that its risk decays as $\widetilde{O}(\mathsf{T_{mix}}/n)$, where $\mathsf{T_{mix}}$ denotes the mixing time of the chain in total variation distance. Notably, this rate is independent of the size of the state space and minimax-optimal up to a logarithmic factor in $n / \mathsf{T_{mix}}$. We also present an upper bound on the variance of the missing mass random variable, which may be of independent interest. We extend our estimator to approximate the stationary mass placed on elements occurring with small frequency in the trajectory. Finally, we demonstrate the efficacy of our estimators both in simulations on canonical chains and on sequences constructed from natural language text.
△ Less
Submitted 5 October, 2024; v1 submitted 8 April, 2024;
originally announced April 2024.
-
Missing Mass Estimation from Sticky Channels
Authors:
Prafulla Chandra,
Andrew Thangaraj,
Nived Rajaraman
Abstract:
Distribution estimation under error-prone or non-ideal sampling modelled as "sticky" channels have been studied recently motivated by applications such as DNA computing. Missing mass, the sum of probabilities of missing letters, is an important quantity that plays a crucial role in distribution estimation, particularly in the large alphabet regime. In this work, we consider the problem of estimati…
▽ More
Distribution estimation under error-prone or non-ideal sampling modelled as "sticky" channels have been studied recently motivated by applications such as DNA computing. Missing mass, the sum of probabilities of missing letters, is an important quantity that plays a crucial role in distribution estimation, particularly in the large alphabet regime. In this work, we consider the problem of estimation of missing mass, which has been well-studied under independent and identically distributed (i.i.d) sampling, in the case when sampling is "sticky". Precisely, we consider the scenario where each sample from an unknown distribution gets repeated a geometrically-distributed number of times. We characterise the minimax rate of Mean Squared Error (MSE) of estimating missing mass from such sticky sampling channels. An upper bound on the minimax rate is obtained by bounding the risk of a modified Good-Turing estimator. We derive a matching lower bound on the minimax rate by extending the Le Cam method.
△ Less
Submitted 6 February, 2022;
originally announced February 2022.
-
Missing $g$-mass: Investigating the Missing Parts of Distributions
Authors:
Prafulla Chandra,
Andrew Thangaraj
Abstract:
Estimating the underlying distribution from \textit{iid} samples is a classical and important problem in statistics. When the alphabet size is large compared to number of samples, a portion of the distribution is highly likely to be unobserved or sparsely observed. The missing mass, defined as the sum of probabilities $\text{Pr}(x)$ over the missing letters $x$, and the Good-Turing estimator for m…
▽ More
Estimating the underlying distribution from \textit{iid} samples is a classical and important problem in statistics. When the alphabet size is large compared to number of samples, a portion of the distribution is highly likely to be unobserved or sparsely observed. The missing mass, defined as the sum of probabilities $\text{Pr}(x)$ over the missing letters $x$, and the Good-Turing estimator for missing mass have been important tools in large-alphabet distribution estimation. In this article, given a positive function $g$ from $[0,1]$ to the reals, the missing $g$-mass, defined as the sum of $g(\text{Pr}(x))$ over the missing letters $x$, is introduced and studied. The missing $g$-mass can be used to investigate the structure of the missing part of the distribution. Specific applications for special cases such as order-$α$ missing mass ($g(p)=p^α$) and the missing Shannon entropy ($g(p)=-p\log p$) include estimating distance from uniformity of the missing distribution and its partial estimation. Minimax estimation is studied for order-$α$ missing mass for integer values of $α$ and exact minimax convergence rates are obtained. Concentration is studied for a class of functions $g$ and specific results are derived for order-$α$ missing mass and missing Shannon entropy. Sub-Gaussian tail bounds with near-optimal worst-case variance factors are derived. Two new notions of concentration, named strongly sub-Gamma and filtered sub-Gaussian concentration, are introduced and shown to result in right tail bounds that are better than those obtained from sub-Gaussian concentration.
△ Less
Submitted 27 May, 2023; v1 submitted 5 October, 2021;
originally announced October 2021.
-
How good is Good-Turing for Markov samples?
Authors:
Prafulla Chandra,
Andrew Thangaraj,
Nived Rajaraman
Abstract:
The Good-Turing (GT) estimator for the missing mass (i.e., total probability of missing symbols) in $n$ samples is the number of symbols that appeared exactly once divided by $n$. For i.i.d. samples, the bias and squared-error risk of the GT estimator can be shown to fall as $1/n$ by bounding the expected error uniformly over all symbols. In this work, we study convergence of the GT estimator for…
▽ More
The Good-Turing (GT) estimator for the missing mass (i.e., total probability of missing symbols) in $n$ samples is the number of symbols that appeared exactly once divided by $n$. For i.i.d. samples, the bias and squared-error risk of the GT estimator can be shown to fall as $1/n$ by bounding the expected error uniformly over all symbols. In this work, we study convergence of the GT estimator for missing stationary mass (i.e., total stationary probability of missing symbols) of Markov samples on an alphabet $\mathcal{X}$ with stationary distribution $[π_x:x \in \mathcal{X}]$ and transition probability matrix (t.p.m.) $P$. This is an important and interesting problem because GT is widely used in applications with temporal dependencies such as language models assigning probabilities to word sequences, which are modelled as Markov. We show that convergence of GT depends on convergence of $(P^{\sim x})^n$, where $P^{\sim x}$ is $P$ with the $x$-th column zeroed out. This, in turn, depends on the Perron eigenvalue $λ^{\sim x}$ of $P^{\sim x}$ and its relationship with $π_x$ uniformly over $x$. For randomly generated t.p.ms and t.p.ms derived from New York Times and Charles Dickens corpora, we numerically exhibit such uniform-over-$x$ relationships between $λ^{\sim x}$ and $π_x$. This supports the observed success of GT in language models and practical text data scenarios. For Markov chains with rank-2, diagonalizable t.p.ms having spectral gap $β$, we show minimax rate upper and lower bounds of $1/(nβ^5)$ and $1/(nβ)$, respectively, for the estimation of stationary missing mass. This theoretical result extends the $1/n$ minimax rate for i.i.d. or rank-1 t.p.ms to rank-2 Markov, and is a first such minimax rate result for missing mass of Markov samples.
△ Less
Submitted 27 May, 2023; v1 submitted 3 February, 2021;
originally announced February 2021.
-
Convergence of Chao Unseen Species Estimator
Authors:
Nived Rajaraman,
Prafulla Chandra,
Andrew Thangaraj,
Ananda Theertha Suresh
Abstract:
Support size estimation and the related problem of unseen species estimation have wide applications in ecology and database analysis. Perhaps the most used support size estimator is the Chao estimator. Despite its wide spread use, little is known about its theoretical properties. We analyze the Chao estimator and show that its worst case mean squared error (MSE) is smaller than the MSE of the plug…
▽ More
Support size estimation and the related problem of unseen species estimation have wide applications in ecology and database analysis. Perhaps the most used support size estimator is the Chao estimator. Despite its wide spread use, little is known about its theoretical properties. We analyze the Chao estimator and show that its worst case mean squared error (MSE) is smaller than the MSE of the plug-in estimator by a factor of $\mathcal{O} ((k/n)^4)$, where $k$ is the maximum support size and $n$ is the number of samples. Our main technical contribution is a new method to analyze rational estimators for discrete distribution properties, which may be of independent interest.
△ Less
Submitted 13 January, 2020;
originally announced January 2020.
-
Approximation of Capacity for ISI Channels with One-bit Output Quantization
Authors:
Radha Krishna Ganti,
Andrew Thangaraj,
Arijit Mondal
Abstract:
Motivated by recent high bandwidth communication systems, Inter-Symbol Interference (ISI) channels with 1-bit quantized output are considered under an average-power-constrained continuous input. While the exact capacity is difficult to characterize, an approximation that matches with the exact channel output up to a probability of error is provided. The approximation does not have additive noise,…
▽ More
Motivated by recent high bandwidth communication systems, Inter-Symbol Interference (ISI) channels with 1-bit quantized output are considered under an average-power-constrained continuous input. While the exact capacity is difficult to characterize, an approximation that matches with the exact channel output up to a probability of error is provided. The approximation does not have additive noise, but constrains the channel output (without noise) to be above a threshold in absolute value. The capacity under the approximation is computed using methods involving standard Gibbs distributions. Markovian achievable schemes approaching the approximate capacity are provided. The methods used over the approximate ISI channel result in ideas for practical coding schemes for ISI channels with 1-bit output quantization.
△ Less
Submitted 4 May, 2015;
originally announced May 2015.