Search | arXiv e-print repository

arXiv:2002.01044 [pdf, other]

Optimal Confidence Regions for the Multinomial Parameter

Authors: Matthew L. Malloy, Ardhendu Tripathy, Robert D. Nowak

Abstract: Construction of tight confidence regions and intervals is central to statistical inference and decision making. This paper develops new theory showing minimum average volume confidence regions for categorical data. More precisely, consider an empirical distribution $\widehat{\boldsymbol{p}}$ generated from $n$ iid realizations of a random variable that takes one of $k$ possible values according to… ▽ More Construction of tight confidence regions and intervals is central to statistical inference and decision making. This paper develops new theory showing minimum average volume confidence regions for categorical data. More precisely, consider an empirical distribution $\widehat{\boldsymbol{p}}$ generated from $n$ iid realizations of a random variable that takes one of $k$ possible values according to an unknown distribution $\boldsymbol{p}$. This is analogous to a single draw from a multinomial distribution. A confidence region is a subset of the probability simplex that depends on $\widehat{\boldsymbol{p}}$ and contains the unknown $\boldsymbol{p}$ with a specified confidence. This paper shows how one can construct minimum average volume confidence regions, answering a long standing question. We also show the optimality of the regions directly translates to optimal confidence intervals of linear functionals such as the mean, implying sample complexity and regret improvements for adaptive machine learning algorithms. △ Less

Submitted 29 January, 2021; v1 submitted 3 February, 2020; originally announced February 2020.

arXiv:1506.04257 [pdf, other]

Contamination Estimation via Convex Relaxations

Authors: Matthew L. Malloy, Scott Alfeld, Paul Barford

Abstract: Identifying anomalies and contamination in datasets is important in a wide variety of settings. In this paper, we describe a new technique for estimating contamination in large, discrete valued datasets. Our approach considers the normal condition of the data to be specified by a model consisting of a set of distributions. Our key contribution is in our approach to contamination estimation. Specif… ▽ More Identifying anomalies and contamination in datasets is important in a wide variety of settings. In this paper, we describe a new technique for estimating contamination in large, discrete valued datasets. Our approach considers the normal condition of the data to be specified by a model consisting of a set of distributions. Our key contribution is in our approach to contamination estimation. Specifically, we develop a technique that identifies the minimum number of data points that must be discarded (i.e., the level of contamination) from an empirical data set in order to match the model to within a specified goodness-of-fit, controlled by a p-value. Appealing to results from large deviations theory, we show a lower bound on the level of contamination is obtained by solving a series of convex programs. Theoretical results guarantee the bound converges at a rate of $O(\sqrt{\log(p)/p})$, where p is the size of the empirical data set. △ Less

Submitted 13 June, 2015; originally announced June 2015.

Comments: To appear, ISIT 2015

arXiv:1105.4540 [pdf, ps, other]

On the Limits of Sequential Testing in High Dimensions

Authors: Matthew Malloy, Robert Nowak

Abstract: This paper presents results pertaining to sequential methods for support recovery of sparse signals in noise. Specifically, we show that any sequential measurement procedure fails provided the average number of measurements per dimension grows slower then log s / D(f0||f1) where s is the level of sparsity, and D(f0||f1) the Kullback-Leibler divergence between the underlying distributions. For comp… ▽ More This paper presents results pertaining to sequential methods for support recovery of sparse signals in noise. Specifically, we show that any sequential measurement procedure fails provided the average number of measurements per dimension grows slower then log s / D(f0||f1) where s is the level of sparsity, and D(f0||f1) the Kullback-Leibler divergence between the underlying distributions. For comparison, we show any non-sequential procedure fails provided the number of measurements grows at a rate less than log n / D(f1||f0), where n is the total dimension of the problem. Lastly, we show that a simple procedure termed sequential thresholding guarantees exact support recovery provided the average number of measurements per dimension grows faster than (log s + log log n) / D(f0||f1), a mere additive factor more than the lower bound. △ Less

Submitted 18 October, 2011; v1 submitted 23 May, 2011; originally announced May 2011.

Comments: Asilomar 2011

arXiv:1103.5991 [pdf, ps, other]

Sequential Analysis in High Dimensional Multiple Testing and Sparse Recovery

Authors: Matthew Malloy, Robert Nowak

Abstract: This paper studies the problem of high-dimensional multiple testing and sparse recovery from the perspective of sequential analysis. In this setting, the probability of error is a function of the dimension of the problem. A simple sequential testing procedure is proposed. We derive necessary conditions for reliable recovery in the non-sequential setting and contrast them with sufficient conditions… ▽ More This paper studies the problem of high-dimensional multiple testing and sparse recovery from the perspective of sequential analysis. In this setting, the probability of error is a function of the dimension of the problem. A simple sequential testing procedure is proposed. We derive necessary conditions for reliable recovery in the non-sequential setting and contrast them with sufficient conditions for reliable recovery using the proposed sequential testing procedure. Applications of the main results to several commonly encountered models show that sequential testing can be exponentially more sensitive to the difference between the null and alternative distributions (in terms of the dependence on dimension), implying that subtle cases can be much more reliably determined using sequential methods. △ Less

Submitted 3 June, 2011; v1 submitted 30 March, 2011; originally announced March 2011.

Comments: Submitted to ISIT 2011

Showing 1–4 of 4 results for author: Malloy, M