-
Inference for Dependent Data with Learned Clusters
Authors:
Jianfei Cao,
Christian Hansen,
Damian Kozbur,
Lucciano Villacorta
Abstract:
This paper presents and analyzes an approach to cluster-based inference for dependent data. The primary setting considered here is with spatially indexed data in which the dependence structure of observed random variables is characterized by a known, observed dissimilarity measure over spatial indices. Observations are partitioned into clusters with the use of an unsupervised clustering algorithm…
▽ More
This paper presents and analyzes an approach to cluster-based inference for dependent data. The primary setting considered here is with spatially indexed data in which the dependence structure of observed random variables is characterized by a known, observed dissimilarity measure over spatial indices. Observations are partitioned into clusters with the use of an unsupervised clustering algorithm applied to the dissimilarity measure. Once the partition into clusters is learned, a cluster-based inference procedure is applied to a statistical hypothesis testing procedure. The procedure proposed in the paper allows the number of clusters to depend on the data, which gives researchers a principled method for choosing an appropriate clustering level. The paper gives conditions under which the proposed procedure asymptotically attains correct size. A simulation study shows that the proposed procedure attains near nominal size in finite samples in a variety of statistical testing problems with dependent data.
△ Less
Submitted 14 November, 2022; v1 submitted 30 July, 2021;
originally announced July 2021.
-
Dimension-Free Anticoncentration Bounds for Gaussian Order Statistics with Discussion of Applications to Multiple Testing
Authors:
Damian Kozbur
Abstract:
The following anticoncentration property is proved. The probability that the $k$-order statistic of an arbitrarily correlated jointly Gaussian random vector $X$ with unit variance components lies within an interval of length $\varepsilon$ is bounded above by $2{\varepsilon}k ({ 1+\mathrm{E}[\|X\|_\infty ]}) $. This bound has implications for generalized error rate control in statistical high-dimen…
▽ More
The following anticoncentration property is proved. The probability that the $k$-order statistic of an arbitrarily correlated jointly Gaussian random vector $X$ with unit variance components lies within an interval of length $\varepsilon$ is bounded above by $2{\varepsilon}k ({ 1+\mathrm{E}[\|X\|_\infty ]}) $. This bound has implications for generalized error rate control in statistical high-dimensional multiple hypothesis testing problems, which are discussed subsequently.
△ Less
Submitted 22 July, 2021;
originally announced July 2021.
-
Targeted Undersmoothing
Authors:
Christian Hansen,
Damian Kozbur,
Sanjog Misra
Abstract:
This paper proposes a post-model selection inference procedure, called targeted undersmoothing, designed to construct uniformly valid confidence sets for a broad class of functionals of sparse high-dimensional statistical models. These include dense functionals, which may potentially depend on all elements of an unknown high-dimensional parameter. The proposed confidence sets are based on an initi…
▽ More
This paper proposes a post-model selection inference procedure, called targeted undersmoothing, designed to construct uniformly valid confidence sets for a broad class of functionals of sparse high-dimensional statistical models. These include dense functionals, which may potentially depend on all elements of an unknown high-dimensional parameter. The proposed confidence sets are based on an initially selected model and two additionally selected models, an upper model and a lower model, which enlarge the initially selected model. We illustrate application of the procedure in two empirical examples. The first example considers estimation of heterogeneous treatment effects using data from the Job Training Partnership Act of 1982, and the second example looks at estimating profitability from a mailing strategy based on estimated heterogeneous treatment effects in a direct mail marketing campaign. We also provide evidence on the finite sample performance of the proposed targeted undersmoothing procedure through a series of simulation experiments.
△ Less
Submitted 7 June, 2018; v1 submitted 22 June, 2017;
originally announced June 2017.
-
Analysis of Testing-Based Forward Model Selection
Authors:
Damian Kozbur
Abstract:
This paper introduces and analyzes a procedure called Testing-based forward model selection (TBFMS) in linear regression problems. This procedure inductively selects covariates that add predictive power into a working statistical model before estimating a final regression. The criterion for deciding which covariate to include next and when to stop including covariates is derived from a profile of…
▽ More
This paper introduces and analyzes a procedure called Testing-based forward model selection (TBFMS) in linear regression problems. This procedure inductively selects covariates that add predictive power into a working statistical model before estimating a final regression. The criterion for deciding which covariate to include next and when to stop including covariates is derived from a profile of traditional statistical hypothesis tests. This paper proves probabilistic bounds, which depend on the quality of the tests, for prediction error and the number of selected covariates. As an example, the bounds are then specialized to a case with heteroskedastic data, with tests constructed with the help of Huber-Eicker-White standard errors. Under the assumed regularity conditions, these tests lead to estimation convergence rates matching other common high-dimensional estimators including Lasso.
△ Less
Submitted 6 April, 2020; v1 submitted 8 December, 2015;
originally announced December 2015.
-
Inference in Additively Separable Models With a High-Dimensional Set of Conditioning Variables
Authors:
Damian Kozbur
Abstract:
This paper studies nonparametric series estimation and inference for the effect of a single variable of interest x on an outcome y in the presence of potentially high-dimensional conditioning variables z. The context is an additively separable model E[y|x, z] = g0(x) + h0(z). The model is high-dimensional in the sense that the series of approximating functions for h0(z) can have more terms than th…
▽ More
This paper studies nonparametric series estimation and inference for the effect of a single variable of interest x on an outcome y in the presence of potentially high-dimensional conditioning variables z. The context is an additively separable model E[y|x, z] = g0(x) + h0(z). The model is high-dimensional in the sense that the series of approximating functions for h0(z) can have more terms than the sample size, thereby allowing z to have potentially very many measured characteristics. The model is required to be approximately sparse: h0(z) can be approximated using only a small subset of series terms whose identities are unknown. This paper proposes an estimation and inference method for g0(x) called Post-Nonparametric Double Selection which is a generalization of Post-Double Selection. Standard rates of convergence and asymptotic normality for the estimator are shown to hold uniformly over a large class of sparse data generating processes. A simulation study illustrates finite sample estimation properties of the proposed estimator and coverage properties of the corresponding confidence intervals. Finally, an empirical application to college admissions policy demonstrates the practical implementation of the proposed method.
△ Less
Submitted 6 April, 2020; v1 submitted 18 March, 2015;
originally announced March 2015.