-
Disentangled Feature Importance
Authors:
Jin-Hong Du,
Kathryn Roeder,
Larry Wasserman
Abstract:
Feature importance quantification faces a fundamental challenge: when predictors are correlated, standard methods systematically underestimate their contributions. We prove that major existing approaches target identical population functionals under squared-error loss, revealing why they share this correlation-induced bias.
To address this limitation, we introduce \emph{Disentangled Feature Impo…
▽ More
Feature importance quantification faces a fundamental challenge: when predictors are correlated, standard methods systematically underestimate their contributions. We prove that major existing approaches target identical population functionals under squared-error loss, revealing why they share this correlation-induced bias.
To address this limitation, we introduce \emph{Disentangled Feature Importance (DFI)}, a nonparametric generalization of the classical $R^2$ decomposition via optimal transport. DFI transforms correlated features into independent latent variables using a transport map, eliminating correlation distortion. Importance is computed in this disentangled space and attributed back through the transport map's sensitivity. DFI provides a principled decomposition of importance scores that sum to the total predictive variability for latent additive models and to interaction-weighted functional ANOVA variances more generally, under arbitrary feature dependencies.
We develop a comprehensive semiparametric theory for DFI. For general transport maps, we establish root-$n$ consistency and asymptotic normality of importance estimators in the latent space, which extends to the original feature space for the Bures-Wasserstein map. Notably, our estimators achieve second-order estimation error, which vanishes if both regression function and transport map estimation errors are $o_{\mathbb{P}}(n^{-1/4})$. By design, DFI avoids the computational burden of repeated submodel refitting and the challenges of conditional covariate distribution estimation, thereby achieving computational efficiency.
△ Less
Submitted 30 June, 2025;
originally announced July 2025.
-
Statistical Inference for Optimal Transport Maps: Recent Advances and Perspectives
Authors:
Sivaraman Balakrishnan,
Tudor Manole,
Larry Wasserman
Abstract:
In many applications of optimal transport (OT), the object of primary interest is the optimal transport map. This map rearranges mass from one probability distribution to another in the most efficient way possible by minimizing a specified cost. In this paper we review recent advances in estimating and developing limit theorems for the OT map, using samples from the underlying distributions. We al…
▽ More
In many applications of optimal transport (OT), the object of primary interest is the optimal transport map. This map rearranges mass from one probability distribution to another in the most efficient way possible by minimizing a specified cost. In this paper we review recent advances in estimating and developing limit theorems for the OT map, using samples from the underlying distributions. We also review parallel lines of work that establish similar results for special cases and variants of the basic OT setup. We conclude with a discussion of key directions for future research with the goal of providing practitioners with reliable inferential tools.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
Testing Random Effects for Binomial Data
Authors:
Lucas Kania,
Larry Wasserman,
Sivaraman Balakrishnan
Abstract:
In modern scientific research, small-scale studies with limited participants are increasingly common. However, interpreting individual outcomes can be challenging, making it standard practice to combine data across studies using random effects to draw broader scientific conclusions. In this work, we introduce an optimal methodology for assessing the goodness of fit between a given reference distri…
▽ More
In modern scientific research, small-scale studies with limited participants are increasingly common. However, interpreting individual outcomes can be challenging, making it standard practice to combine data across studies using random effects to draw broader scientific conclusions. In this work, we introduce an optimal methodology for assessing the goodness of fit between a given reference distribution and the distribution of random effects arising from binomial counts.
Using the minimax framework, we characterize the smallest separation between the null and alternative hypotheses, called the critical separation, under the 1-Wasserstein distance that ensures the existence of a valid and powerful test. The optimal test combines a plug-in estimator of the Wasserstein distance with a debiased version of Pearson's chi-squared test.
We focus on meta-analyses, where a key question is whether multiple studies agree on a treatment's effectiveness before pooling data. That is, researchers must determine whether treatment effects are homogeneous across studies. We begin by analyzing scenarios with a specified reference effect, such as testing whether all studies show the treatment is effective 80% of the time, and describe how the critical separation depends on the reference effect. We then extend the analysis to homogeneity testing without a reference effect and construct an optimal test by debiasing Cochran's chi-squared test.
Finally, we illustrate how our proposed methodologies improve the construction of p-values and confidence intervals, with applications to assessing drug safety in the context of rare adverse outcomes and modeling political outcomes at the county level.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
Stochastic interventions, sensitivity analysis, and optimal transport
Authors:
Alexander W. Levis,
Edward H. Kennedy,
Alec McClean,
Sivaraman Balakrishnan,
Larry Wasserman
Abstract:
Recent methodological research in causal inference has focused on effects of stochastic interventions, which assign treatment randomly, often according to subject-specific covariates. In this work, we demonstrate that the usual notion of stochastic interventions have a surprising property: when there is unmeasured confounding, bounds on their effects do not collapse when the policy approaches the…
▽ More
Recent methodological research in causal inference has focused on effects of stochastic interventions, which assign treatment randomly, often according to subject-specific covariates. In this work, we demonstrate that the usual notion of stochastic interventions have a surprising property: when there is unmeasured confounding, bounds on their effects do not collapse when the policy approaches the observational regime. As an alternative, we propose to study generalized policies, treatment rules that can depend on covariates, the natural value of treatment, and auxiliary randomness. We show that certain generalized policy formulations can resolve the "non-collapsing" bound issue: bounds narrow to a point when the target treatment distribution approaches that in the observed data. Moreover, drawing connections to the theory of optimal transport, we characterize generalized policies that minimize worst-case bound width in various sensitivity analysis models, as well as corresponding sharp bounds on their causal effects. These optimal policies are new, and can have a more parsimonious interpretation compared to their usual stochastic policy analogues. Finally, we develop flexible, efficient, and robust estimators for the sharp nonparametric bounds that emerge from the framework.
△ Less
Submitted 21 November, 2024;
originally announced November 2024.
-
Double Cross-fit Doubly Robust Estimators: Beyond Series Regression
Authors:
Alec McClean,
Sivaraman Balakrishnan,
Edward H. Kennedy,
Larry Wasserman
Abstract:
Doubly robust estimators with cross-fitting have gained popularity in causal inference due to their favorable structure-agnostic error guarantees. However, when additional structure, such as Hölder smoothness, is available then more accurate "double cross-fit doubly robust" (DCDR) estimators can be constructed by splitting the training data and undersmoothing nuisance function estimators on indepe…
▽ More
Doubly robust estimators with cross-fitting have gained popularity in causal inference due to their favorable structure-agnostic error guarantees. However, when additional structure, such as Hölder smoothness, is available then more accurate "double cross-fit doubly robust" (DCDR) estimators can be constructed by splitting the training data and undersmoothing nuisance function estimators on independent samples. We study a DCDR estimator of the Expected Conditional Covariance, a functional of interest in causal inference and conditional independence testing. We first provide a structure-agnostic error analysis for the DCDR estimator with no assumptions on the nuisance functions or their estimators. Then, assuming the nuisance functions are Hölder smooth, but without assuming knowledge of the true smoothness level or the covariate density, we establish that DCDR estimators with several linear smoothers are $\sqrt{n}$-consistent and asymptotically normal under minimal conditions and achieve fast convergence rates in the non-$\sqrt{n}$ regime. When the covariate density and smoothnesses are known, we propose a minimax rate-optimal DCDR estimator based on undersmoothed kernel regression. Moreover, we show an undersmoothed DCDR estimator satisfies a slower-than-$\sqrt{n}$ central limit theorem, and that inference is possible even in the non-$\sqrt{n}$ regime. Finally, we support our theoretical results with simulations, providing intuition for double cross-fitting and undersmoothing, demonstrating where our estimator achieves $\sqrt{n}$-consistency while the usual "single cross-fit" estimator fails, and illustrating asymptotic normality for the undersmoothed DCDR estimator.
△ Less
Submitted 7 May, 2025; v1 submitted 22 March, 2024;
originally announced March 2024.
-
Semi-Supervised U-statistics
Authors:
Ilmun Kim,
Larry Wasserman,
Sivaraman Balakrishnan,
Matey Neykov
Abstract:
Semi-supervised datasets are ubiquitous across diverse domains where obtaining fully labeled data is costly or time-consuming. The prevalence of such datasets has consistently driven the demand for new tools and methods that exploit the potential of unlabeled data. Responding to this demand, we introduce semi-supervised U-statistics enhanced by the abundance of unlabeled data, and investigate thei…
▽ More
Semi-supervised datasets are ubiquitous across diverse domains where obtaining fully labeled data is costly or time-consuming. The prevalence of such datasets has consistently driven the demand for new tools and methods that exploit the potential of unlabeled data. Responding to this demand, we introduce semi-supervised U-statistics enhanced by the abundance of unlabeled data, and investigate their statistical properties. We show that the proposed approach is asymptotically Normal and exhibits notable efficiency gains over classical U-statistics by effectively integrating various powerful prediction tools into the framework. To understand the fundamental difficulty of the problem, we derive minimax lower bounds in semi-supervised settings and showcase that our procedure is semi-parametrically efficient under regularity conditions. Moreover, tailored to bivariate kernels, we propose a refined approach that outperforms the classical U-statistic across all degeneracy regimes, and demonstrate its optimality properties. Simulation studies are conducted to corroborate our findings and to further demonstrate our framework.
△ Less
Submitted 9 March, 2024; v1 submitted 29 February, 2024;
originally announced February 2024.
-
Central Limit Theorems for Smooth Optimal Transport Maps
Authors:
Tudor Manole,
Sivaraman Balakrishnan,
Jonathan Niles-Weed,
Larry Wasserman
Abstract:
One of the central objects in the theory of optimal transport is the Brenier map: the unique monotone transformation which pushes forward an absolutely continuous probability law onto any other given law. A line of recent work has analyzed $L^2$ convergence rates of plugin estimators of Brenier maps, which are defined as the Brenier map between density estimators of the underlying distributions. I…
▽ More
One of the central objects in the theory of optimal transport is the Brenier map: the unique monotone transformation which pushes forward an absolutely continuous probability law onto any other given law. A line of recent work has analyzed $L^2$ convergence rates of plugin estimators of Brenier maps, which are defined as the Brenier map between density estimators of the underlying distributions. In this work, we show that such estimators satisfy a pointwise central limit theorem when the underlying laws are supported on the flat torus of dimension $d \geq 3$. We also derive a negative result, showing that these estimators do not converge weakly in $L^2$ when the dimension is sufficiently large. Our proofs hinge upon a quantitative linearization of the Monge-Ampère equation, which may be of independent interest. This result allows us to reduce our problem to that of deriving limit laws for the solution of a uniformly elliptic partial differential equation with a stochastic right-hand side, subject to periodic boundary conditions.
△ Less
Submitted 16 September, 2024; v1 submitted 19 December, 2023;
originally announced December 2023.
-
Conservative Inference for Counterfactuals
Authors:
Sivaraman Balakrishnan,
Edward Kennedy,
Larry Wasserman
Abstract:
In causal inference, the joint law of a set of counterfactual random variables is generally not identified. We show that a conservative version of the joint law - corresponding to the smallest treatment effect - is identified. Finding this law uses recent results from optimal transport theory. Under this conservative law we can bound causal effects and we may construct inferences for each individu…
▽ More
In causal inference, the joint law of a set of counterfactual random variables is generally not identified. We show that a conservative version of the joint law - corresponding to the smallest treatment effect - is identified. Finding this law uses recent results from optimal transport theory. Under this conservative law we can bound causal effects and we may construct inferences for each individual's counterfactual dose-response curve. Intuitively, this is the flattest counterfactual curve for each subject that is consistent with the distribution of the observables. If the outcome is univariate then, under mild conditions, this curve is simply the quantile function of the counterfactual distribution that passes through the observed point. This curve corresponds to a nonparametric rank preserving structural model.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
Causal Effect Estimation after Propensity Score Trimming with Continuous Treatments
Authors:
Zach Branson,
Edward H. Kennedy,
Sivaraman Balakrishnan,
Larry Wasserman
Abstract:
Propensity score trimming, which discards subjects with propensity scores below a threshold, is a common way to address positivity violations that complicate causal effect estimation. However, most works on trimming assume treatment is discrete and models for the outcome regression and propensity score are parametric. This work proposes nonparametric estimators for trimmed average causal effects i…
▽ More
Propensity score trimming, which discards subjects with propensity scores below a threshold, is a common way to address positivity violations that complicate causal effect estimation. However, most works on trimming assume treatment is discrete and models for the outcome regression and propensity score are parametric. This work proposes nonparametric estimators for trimmed average causal effects in the case of continuous treatments based on efficient influence functions. For continuous treatments, an efficient influence function for a trimmed causal effect does not exist, due to a lack of pathwise differentiability induced by trimming and a continuous treatment. Thus, we target a smoothed version of the trimmed causal effect for which an efficient influence function exists. Our resulting estimators exhibit doubly-robust style guarantees, with error involving products or squares of errors for the outcome regression and propensity score, which allows for valid inference even when nonparametric models are used. Our results allow the trimming threshold to be fixed or defined as a quantile of the propensity score, such that confidence intervals incorporate uncertainty involved in threshold estimation. These findings are validated via simulation and an application, thereby showing how to efficiently-but-flexibly estimate trimmed causal effects with continuous treatments.
△ Less
Submitted 29 July, 2024; v1 submitted 1 September, 2023;
originally announced September 2023.
-
Nearly Minimax Optimal Wasserstein Conditional Independence Testing
Authors:
Matey Neykov,
Larry Wasserman,
Ilmun Kim,
Sivaraman Balakrishnan
Abstract:
This paper is concerned with minimax conditional independence testing. In contrast to some previous works on the topic, which use the total variation distance to separate the null from the alternative, here we use the Wasserstein distance. In addition, we impose Wasserstein smoothness conditions which on bounded domains are weaker than the corresponding total variation smoothness imposed, for inst…
▽ More
This paper is concerned with minimax conditional independence testing. In contrast to some previous works on the topic, which use the total variation distance to separate the null from the alternative, here we use the Wasserstein distance. In addition, we impose Wasserstein smoothness conditions which on bounded domains are weaker than the corresponding total variation smoothness imposed, for instance, by Neykov et al. [2021]. This added flexibility expands the distributions which are allowed under the null and the alternative to include distributions which may contain point masses for instance. We characterize the optimal rate of the critical radius of testing up to logarithmic factors. Our test statistic which nearly achieves the optimal critical radius is novel, and can be thought of as a weighted multi-resolution version of the U-statistic studied by Neykov et al. [2021].
△ Less
Submitted 16 August, 2023;
originally announced August 2023.
-
Conditional Independence Testing for Discrete Distributions: Beyond $χ^2$- and $G$-tests
Authors:
Ilmun Kim,
Matey Neykov,
Sivaraman Balakrishnan,
Larry Wasserman
Abstract:
This paper is concerned with the problem of conditional independence testing for discrete data. In recent years, researchers have shed new light on this fundamental problem, emphasizing finite-sample optimality. The non-asymptotic viewpoint adapted in these works has led to novel conditional independence tests that enjoy certain optimality under various regimes. Despite their attractive theoretica…
▽ More
This paper is concerned with the problem of conditional independence testing for discrete data. In recent years, researchers have shed new light on this fundamental problem, emphasizing finite-sample optimality. The non-asymptotic viewpoint adapted in these works has led to novel conditional independence tests that enjoy certain optimality under various regimes. Despite their attractive theoretical properties, the considered tests are not necessarily practical, relying on a Poissonization trick and unspecified constants in their critical values. In this work, we attempt to bridge the gap between theory and practice by reproving optimality without Poissonization and calibrating tests using Monte Carlo permutations. Along the way, we also prove that classical asymptotic $χ^2$- and $G$-tests are notably sub-optimal in a high-dimensional regime, which justifies the demand for new tools. Our theoretical results are complemented by experiments on both simulated and real-world datasets. Accompanying this paper is an R package UCI that implements the proposed tests.
△ Less
Submitted 28 October, 2023; v1 submitted 10 August, 2023;
originally announced August 2023.
-
The Fundamental Limits of Structure-Agnostic Functional Estimation
Authors:
Sivaraman Balakrishnan,
Edward H. Kennedy,
Larry Wasserman
Abstract:
Many recent developments in causal inference, and functional estimation problems more generally, have been motivated by the fact that classical one-step (first-order) debiasing methods, or their more recent sample-split double machine-learning avatars, can outperform plugin estimators under surprisingly weak conditions. These first-order corrections improve on plugin estimators in a black-box fash…
▽ More
Many recent developments in causal inference, and functional estimation problems more generally, have been motivated by the fact that classical one-step (first-order) debiasing methods, or their more recent sample-split double machine-learning avatars, can outperform plugin estimators under surprisingly weak conditions. These first-order corrections improve on plugin estimators in a black-box fashion, and consequently are often used in conjunction with powerful off-the-shelf estimation methods. These first-order methods are however provably suboptimal in a minimax sense for functional estimation when the nuisance functions live in Holder-type function spaces. This suboptimality of first-order debiasing has motivated the development of "higher-order" debiasing methods. The resulting estimators are, in some cases, provably optimal over Holder-type spaces, but both the estimators which are minimax-optimal and their analyses are crucially tied to properties of the underlying function space.
In this paper we investigate the fundamental limits of structure-agnostic functional estimation, where relatively weak conditions are placed on the underlying nuisance functions. We show that there is a strong sense in which existing first-order methods are optimal. We achieve this goal by providing a formalization of the problem of functional estimation with black-box nuisance function estimates, and deriving minimax lower bounds for this problem. Our results highlight some clear tradeoffs in functional estimation -- if we wish to remain agnostic to the underlying nuisance function spaces, impose only high-level rate conditions, and maintain compatibility with black-box nuisance estimators then first-order methods are optimal. When we have an understanding of the structure of the underlying nuisance functions then carefully constructed higher-order estimators can outperform first-order estimators.
△ Less
Submitted 7 June, 2025; v1 submitted 6 May, 2023;
originally announced May 2023.
-
Sensitivity Analysis for Marginal Structural Models
Authors:
Matteo Bonvini,
Edward Kennedy,
Valerie Ventura,
Larry Wasserman
Abstract:
We introduce several methods for assessing sensitivity to unmeasured confounding in marginal structural models; importantly we allow treatments to be discrete or continuous, static or time-varying. We consider three sensitivity models: a propensity-based model, an outcome-based model, and a subset confounding model, in which only a fraction of the population is subject to unmeasured confounding. I…
▽ More
We introduce several methods for assessing sensitivity to unmeasured confounding in marginal structural models; importantly we allow treatments to be discrete or continuous, static or time-varying. We consider three sensitivity models: a propensity-based model, an outcome-based model, and a subset confounding model, in which only a fraction of the population is subject to unmeasured confounding. In each case we develop efficient estimators and confidence intervals for bounds on the causal parameters.
△ Less
Submitted 11 October, 2022; v1 submitted 10 October, 2022;
originally announced October 2022.
-
Median Regularity and Honest Inference
Authors:
Arun Kumar Kuchibhotla,
Sivaraman Balakrishnan,
Larry Wasserman
Abstract:
We introduce a new notion of regularity of an estimator called median regularity. We prove that uniformly valid (honest) inference for a functional is possible if and only if there exists a median regular estimator of that functional. To our knowledge, such a notion of regularity that is necessary for uniformly valid inference is unavailable in the literature.
We introduce a new notion of regularity of an estimator called median regularity. We prove that uniformly valid (honest) inference for a functional is possible if and only if there exists a median regular estimator of that functional. To our knowledge, such a notion of regularity that is necessary for uniformly valid inference is unavailable in the literature.
△ Less
Submitted 6 June, 2022;
originally announced June 2022.
-
Minimax rates for heterogeneous causal effect estimation
Authors:
Edward H. Kennedy,
Sivaraman Balakrishnan,
James M. Robins,
Larry Wasserman
Abstract:
Estimation of heterogeneous causal effects - i.e., how effects of policies and treatments vary across subjects - is a fundamental task in causal inference. Many methods for estimating conditional average treatment effects (CATEs) have been proposed in recent years, but questions surrounding optimality have remained largely unanswered. In particular, a minimax theory of optimality has yet to be dev…
▽ More
Estimation of heterogeneous causal effects - i.e., how effects of policies and treatments vary across subjects - is a fundamental task in causal inference. Many methods for estimating conditional average treatment effects (CATEs) have been proposed in recent years, but questions surrounding optimality have remained largely unanswered. In particular, a minimax theory of optimality has yet to be developed, with the minimax rate of convergence and construction of rate-optimal estimators remaining open problems. In this paper we derive the minimax rate for CATE estimation, in a Holder-smooth nonparametric model, and present a new local polynomial estimator, giving high-level conditions under which it is minimax optimal. Our minimax lower bound is derived via a localized version of the method of fuzzy hypotheses, combining lower bound constructions for nonparametric regression and functional estimation. Our proposed estimator can be viewed as a local polynomial R-Learner, based on a localized modification of higher-order influence function methods. The minimax rate we find exhibits several interesting features, including a non-standard elbow phenomenon and an unusual interpolation between nonparametric regression and functional estimation rates. The latter quantifies how the CATE, as an estimand, can be viewed as a regression/functional hybrid.
△ Less
Submitted 22 December, 2023; v1 submitted 1 March, 2022;
originally announced March 2022.
-
Local permutation tests for conditional independence
Authors:
Ilmun Kim,
Matey Neykov,
Sivaraman Balakrishnan,
Larry Wasserman
Abstract:
In this paper, we investigate local permutation tests for testing conditional independence between two random vectors $X$ and $Y$ given $Z$. The local permutation test determines the significance of a test statistic by locally shuffling samples which share similar values of the conditioning variables $Z$, and it forms a natural extension of the usual permutation approach for unconditional independ…
▽ More
In this paper, we investigate local permutation tests for testing conditional independence between two random vectors $X$ and $Y$ given $Z$. The local permutation test determines the significance of a test statistic by locally shuffling samples which share similar values of the conditioning variables $Z$, and it forms a natural extension of the usual permutation approach for unconditional independence testing. Despite its simplicity and empirical support, the theoretical underpinnings of the local permutation test remain unclear. Motivated by this gap, this paper aims to establish theoretical foundations of local permutation tests with a particular focus on binning-based statistics. We start by revisiting the hardness of conditional independence testing and provide an upper bound for the power of any valid conditional independence test, which holds when the probability of observing collisions in $Z$ is small. This negative result naturally motivates us to impose additional restrictions on the possible distributions under the null and alternate. To this end, we focus our attention on certain classes of smooth distributions and identify provably tight conditions under which the local permutation method is universally valid, i.e. it is valid when applied to any (binning-based) test statistic. To complement this result on type I error control, we also show that in some cases, a binning-based statistic calibrated via the local permutation method can achieve minimax optimal power. We also introduce a double-binning permutation strategy, which yields a valid test over less smooth null distributions than the typical single-binning method without compromising much power. Finally, we present simulation results to support our theoretical findings.
△ Less
Submitted 6 January, 2022; v1 submitted 21 December, 2021;
originally announced December 2021.
-
Data fission: splitting a single data point
Authors:
James Leiner,
Boyan Duan,
Larry Wasserman,
Aaditya Ramdas
Abstract:
Suppose we observe a random vector $X$ from some distribution $P$ in a known family with unknown parameters. We ask the following question: when is it possible to split $X$ into two parts $f(X)$ and $g(X)$ such that neither part is sufficient to reconstruct $X$ by itself, but both together can recover $X$ fully, and the joint distribution of $(f(X),g(X))$ is tractable? As one example, if…
▽ More
Suppose we observe a random vector $X$ from some distribution $P$ in a known family with unknown parameters. We ask the following question: when is it possible to split $X$ into two parts $f(X)$ and $g(X)$ such that neither part is sufficient to reconstruct $X$ by itself, but both together can recover $X$ fully, and the joint distribution of $(f(X),g(X))$ is tractable? As one example, if $X=(X_1,\dots,X_n)$ and $P$ is a product distribution, then for any $m<n$, we can split the sample to define $f(X)=(X_1,\dots,X_m)$ and $g(X)=(X_{m+1},\dots,X_n)$. Rasines and Young (2022) offers an alternative approach that uses additive Gaussian noise -- this enables post-selection inference in finite samples for Gaussian distributed data and asymptotically when errors are non-Gaussian. In this paper, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on a few prototypical applications, such as post-selection inference for trend filtering and other regression problems.
△ Less
Submitted 10 December, 2023; v1 submitted 21 December, 2021;
originally announced December 2021.
-
Universal Inference Meets Random Projections: A Scalable Test for Log-concavity
Authors:
Robin Dunn,
Aditya Gangrade,
Larry Wasserman,
Aaditya Ramdas
Abstract:
Shape constraints yield flexible middle grounds between fully nonparametric and fully parametric approaches to modeling distributions of data. The specific assumption of log-concavity is motivated by applications across economics, survival modeling, and reliability theory. However, there do not currently exist valid tests for whether the underlying density of given data is log-concave. The recent…
▽ More
Shape constraints yield flexible middle grounds between fully nonparametric and fully parametric approaches to modeling distributions of data. The specific assumption of log-concavity is motivated by applications across economics, survival modeling, and reliability theory. However, there do not currently exist valid tests for whether the underlying density of given data is log-concave. The recent universal inference methodology provides a valid test. The universal test relies on maximum likelihood estimation (MLE), and efficient methods already exist for finding the log-concave MLE. This yields the first test of log-concavity that is provably valid in finite samples in any dimension, for which we also establish asymptotic consistency results. Empirically, we find that a random projections approach that converts the d-dimensional testing problem into many one-dimensional problems can yield high power, leading to a simple procedure that is statistically and computationally efficient.
△ Less
Submitted 14 April, 2024; v1 submitted 17 November, 2021;
originally announced November 2021.
-
Plugin Estimation of Smooth Optimal Transport Maps
Authors:
Tudor Manole,
Sivaraman Balakrishnan,
Jonathan Niles-Weed,
Larry Wasserman
Abstract:
We analyze a number of natural estimators for the optimal transport map between two distributions and show that they are minimax optimal. We adopt the plugin approach: our estimators are simply optimal couplings between measures derived from our observations, appropriately extended so that they define functions on $\mathbb{R}^d$. When the underlying map is assumed to be Lipschitz, we show that com…
▽ More
We analyze a number of natural estimators for the optimal transport map between two distributions and show that they are minimax optimal. We adopt the plugin approach: our estimators are simply optimal couplings between measures derived from our observations, appropriately extended so that they define functions on $\mathbb{R}^d$. When the underlying map is assumed to be Lipschitz, we show that computing the optimal coupling between the empirical measures, and extending it using linear smoothers, already gives a minimax optimal estimator. When the underlying map enjoys higher regularity, we show that the optimal coupling between appropriate nonparametric density estimates yields faster rates. Our work also provides new bounds on the risk of corresponding plugin estimators for the quadratic Wasserstein distance, and we show how this problem relates to that of estimating optimal transport maps using stability arguments for smooth and strongly convex Brenier potentials. As an application of our results, we derive central limit theorems for plugin estimators of the squared Wasserstein distance, which are centered at their population counterpart when the underlying distributions have sufficiently smooth densities. In contrast to known central limit theorems for empirical estimators, this result easily lends itself to statistical inference for the quadratic Wasserstein distance.
△ Less
Submitted 16 June, 2024; v1 submitted 26 July, 2021;
originally announced July 2021.
-
The HulC: Confidence Regions from Convex Hulls
Authors:
Arun Kumar Kuchibhotla,
Sivaraman Balakrishnan,
Larry Wasserman
Abstract:
We develop and analyze the HulC, an intuitive and general method for constructing confidence sets using the convex hull of estimates constructed from subsets of the data. Unlike classical methods which are based on estimating the (limiting) distribution of an estimator, the HulC is often simpler to use and effectively bypasses this step. In comparison to the bootstrap, the HulC requires fewer regu…
▽ More
We develop and analyze the HulC, an intuitive and general method for constructing confidence sets using the convex hull of estimates constructed from subsets of the data. Unlike classical methods which are based on estimating the (limiting) distribution of an estimator, the HulC is often simpler to use and effectively bypasses this step. In comparison to the bootstrap, the HulC requires fewer regularity conditions and succeeds in many examples where the bootstrap provably fails. Unlike subsampling, the HulC does not require knowledge of the rate of convergence of the estimators on which it is based. The validity of the HulC requires knowledge of the (asymptotic) median-bias of the estimators. We further analyze a variant of our basic method, called the Adaptive HulC, which is fully data-driven and estimates the median-bias using subsampling. We show that the Adaptive HulC retains the aforementioned strengths of the HulC. In certain cases where the underlying estimators are pathologically asymmetric the HulC and Adaptive HulC can fail to provide useful confidence sets. We propose a final variant, the Unimodal HulC, which can salvage the situation in cases where the distribution of the underlying estimator is (asymptotically) unimodal. We discuss these methods in the context of several challenging inferential problems which arise in parametric, semi-parametric, and non-parametric inference. Although our focus is on validity under weak regularity conditions, we also provide some general results on the width of the HulC confidence sets, showing that in many cases the HulC confidence sets have near-optimal width.
△ Less
Submitted 8 September, 2023; v1 submitted 30 May, 2021;
originally announced May 2021.
-
Semiparametric counterfactual density estimation
Authors:
Edward H. Kennedy,
Sivaraman Balakrishnan,
Larry Wasserman
Abstract:
Causal effects are often characterized with averages, which can give an incomplete picture of the underlying counterfactual distributions. Here we consider estimating the entire counterfactual density and generic functionals thereof. We focus on two kinds of target parameters. The first is a density approximation, defined by a projection onto a finite-dimensional model using a generalized distance…
▽ More
Causal effects are often characterized with averages, which can give an incomplete picture of the underlying counterfactual distributions. Here we consider estimating the entire counterfactual density and generic functionals thereof. We focus on two kinds of target parameters. The first is a density approximation, defined by a projection onto a finite-dimensional model using a generalized distance metric, which includes f-divergences as well as $L_p$ norms. The second is the distance between counterfactual densities, which can be used as a more nuanced effect measure than the mean difference, and as a tool for model selection. We study nonparametric efficiency bounds for these targets, giving results for smooth but otherwise generic models and distances. Importantly, we show how these bounds connect to means of particular non-trivial functions of counterfactuals, linking the problems of density and mean estimation. We go on to propose doubly robust-style estimators for the density approximations and distances, and study their rates of convergence, showing they can be optimally efficient in large nonparametric models. We also give analogous methods for model selection and aggregation, when many models may be available and of interest. Our results all hold for generic models and distances, but throughout we highlight what happens for particular choices, such as $L_2$ projections on linear models, and KL projections on exponential families. Finally we illustrate by estimating the density of CD4 count among patients with HIV, had all been treated with combination therapy versus zidovudine alone, as well as a density effect. Our results suggest combination therapy may have increased CD4 count most for high-risk patients. Our methods are implemented in the freely available R package npcausal on GitHub.
△ Less
Submitted 23 February, 2021;
originally announced February 2021.
-
Berry-Esseen Bounds for Projection Parameters and Partial Correlations with Increasing Dimension
Authors:
Arun Kumar Kuchibhotla,
Alessandro Rinaldo,
Larry Wasserman
Abstract:
We provide finite sample bounds on the Normal approximation to the law of the least squares estimator of the projection parameters normalized by the sandwich-based standard errors. Our results hold in the increasing dimension setting and under minimal assumptions on the data generating distribution. In particular, we do not assume a linear regression function and only require the existence of fini…
▽ More
We provide finite sample bounds on the Normal approximation to the law of the least squares estimator of the projection parameters normalized by the sandwich-based standard errors. Our results hold in the increasing dimension setting and under minimal assumptions on the data generating distribution. In particular, we do not assume a linear regression function and only require the existence of finitely many moments for the response and the covariates. Furthermore, we construct confidence sets for the projection parameters in the form of hyper-rectangles and establish finite sample bounds on their coverage and accuracy. We derive analogous results for partial correlations among the entries of sub-Gaussian vectors. \end{abstract}
△ Less
Submitted 22 October, 2021; v1 submitted 19 July, 2020;
originally announced July 2020.
-
The huge Package for High-dimensional Undirected Graph Estimation in R
Authors:
Tuo Zhao,
Han Liu,
Kathryn Roeder,
John Lafferty,
Larry Wasserman
Abstract:
We describe an R package named huge which provides easy-to-use functions for estimating high dimensional undirected graphs from data. This package implements recent results in the literature, including Friedman et al. (2007), Liu et al. (2009, 2012) and Liu et al. (2010). Compared with the existing graph estimation package glasso, the huge package provides extra features: (1) instead of using Fort…
▽ More
We describe an R package named huge which provides easy-to-use functions for estimating high dimensional undirected graphs from data. This package implements recent results in the literature, including Friedman et al. (2007), Liu et al. (2009, 2012) and Liu et al. (2010). Compared with the existing graph estimation package glasso, the huge package provides extra features: (1) instead of using Fortan, it is written in C, which makes the code more portable and easier to modify; (2) besides fitting Gaussian graphical models, it also provides functions for fitting high dimensional semiparametric Gaussian copula models; (3) more functions like data-dependent model selection, data generation and graph visualization; (4) a minor convergence problem of the graphical lasso algorithm is corrected; (5) the package allows the user to apply both lossless and lossy screening rules to scale up large-scale problems, making a tradeoff between computational and statistical efficiency.
△ Less
Submitted 25 June, 2020;
originally announced June 2020.
-
Minimax optimality of permutation tests
Authors:
Ilmun Kim,
Sivaraman Balakrishnan,
Larry Wasserman
Abstract:
Permutation tests are widely used in statistics, providing a finite-sample guarantee on the type I error rate whenever the distribution of the samples under the null hypothesis is invariant to some rearrangement. Despite its increasing popularity and empirical success, theoretical properties of the permutation test, especially its power, have not been fully explored beyond simple cases. In this pa…
▽ More
Permutation tests are widely used in statistics, providing a finite-sample guarantee on the type I error rate whenever the distribution of the samples under the null hypothesis is invariant to some rearrangement. Despite its increasing popularity and empirical success, theoretical properties of the permutation test, especially its power, have not been fully explored beyond simple cases. In this paper, we attempt to partly fill this gap by presenting a general non-asymptotic framework for analyzing the minimax power of the permutation test. The utility of our proposed framework is illustrated in the context of two-sample and independence testing under both discrete and continuous settings. In each setting, we introduce permutation tests based on U-statistics and study their minimax performance. We also develop exponential concentration bounds for permuted U-statistics based on a novel coupling idea, which may be of independent interest. Building on these exponential bounds, we introduce permutation tests which are adaptive to unknown smoothness parameters without losing much power. The proposed framework is further illustrated using more sophisticated test statistics including weighted U-statistics for multinomial testing and Gaussian kernel-based statistics for density testing. Finally, we provide some simulation results that further justify the permutation approach.
△ Less
Submitted 25 May, 2022; v1 submitted 30 March, 2020;
originally announced March 2020.
-
Minimax Optimal Conditional Independence Testing
Authors:
Matey Neykov,
Sivaraman Balakrishnan,
Larry Wasserman
Abstract:
We consider the problem of conditional independence testing of $X$ and $Y$ given $Z$ where $X,Y$ and $Z$ are three real random variables and $Z$ is continuous. We focus on two main cases - when $X$ and $Y$ are both discrete, and when $X$ and $Y$ are both continuous. In view of recent results on conditional independence testing (Shah and Peters, 2018), one cannot hope to design non-trivial tests, w…
▽ More
We consider the problem of conditional independence testing of $X$ and $Y$ given $Z$ where $X,Y$ and $Z$ are three real random variables and $Z$ is continuous. We focus on two main cases - when $X$ and $Y$ are both discrete, and when $X$ and $Y$ are both continuous. In view of recent results on conditional independence testing (Shah and Peters, 2018), one cannot hope to design non-trivial tests, which control the type I error for all absolutely continuous conditionally independent distributions, while still ensuring power against interesting alternatives. Consequently, we identify various, natural smoothness assumptions on the conditional distributions of $X,Y|Z=z$ as $z$ varies in the support of $Z$, and study the hardness of conditional independence testing under these smoothness assumptions. We derive matching lower and upper bounds on the critical radius of separation between the null and alternative hypotheses in the total variation metric. The tests we consider are easily implementable and rely on binning the support of the continuous variable $Z$. To complement these results, we provide a new proof of the hardness result of Shah and Peters.
△ Less
Submitted 1 July, 2021; v1 submitted 9 January, 2020;
originally announced January 2020.
-
Universal Inference
Authors:
Larry Wasserman,
Aaditya Ramdas,
Sivaraman Balakrishnan
Abstract:
We propose a general method for constructing hypothesis tests and confidence sets that have finite sample guarantees without regularity conditions. We refer to such procedures as "universal." The method is very simple and is based on a modified version of the usual likelihood ratio statistic, that we call "the split likelihood ratio test" (split LRT). The method is especially appealing for irregul…
▽ More
We propose a general method for constructing hypothesis tests and confidence sets that have finite sample guarantees without regularity conditions. We refer to such procedures as "universal." The method is very simple and is based on a modified version of the usual likelihood ratio statistic, that we call "the split likelihood ratio test" (split LRT). The method is especially appealing for irregular statistical models. Canonical examples include mixture models and models that arise in shape-constrained inference. Constructing tests and confidence sets for such models is notoriously difficult. Typical inference methods, like the likelihood ratio test, are not useful in these cases because they have intractable limiting distributions. In contrast, the method we suggest works for any parametric model and also for some nonparametric models. The split LRT can also be used with profile likelihoods to deal with nuisance parameters, and it can also be run sequentially to yield anytime-valid $p$-values and confidence sequences.
△ Less
Submitted 19 October, 2022; v1 submitted 24 December, 2019;
originally announced December 2019.
-
Minimax Confidence Intervals for the Sliced Wasserstein Distance
Authors:
Tudor Manole,
Sivaraman Balakrishnan,
Larry Wasserman
Abstract:
Motivated by the growing popularity of variants of the Wasserstein distance in statistics and machine learning, we study statistical inference for the Sliced Wasserstein distance--an easily computable variant of the Wasserstein distance. Specifically, we construct confidence intervals for the Sliced Wasserstein distance which have finite-sample validity under no assumptions or under mild moment as…
▽ More
Motivated by the growing popularity of variants of the Wasserstein distance in statistics and machine learning, we study statistical inference for the Sliced Wasserstein distance--an easily computable variant of the Wasserstein distance. Specifically, we construct confidence intervals for the Sliced Wasserstein distance which have finite-sample validity under no assumptions or under mild moment assumptions. These intervals are adaptive in length to the regularity of the underlying distributions. We also bound the minimax risk of estimating the Sliced Wasserstein distance, and as a consequence establish that the lengths of our proposed confidence intervals are minimax optimal over appropriate distribution classes. To motivate the choice of these classes, we also study minimax rates of estimating a distribution under the Sliced Wasserstein distance. These theoretical findings are complemented with a simulation study demonstrating the deficiencies of the classical bootstrap, and the advantages of our proposed methods. We also show strong correspondences between our theoretical predictions and the adaptivity of our confidence interval lengths in simulations. We conclude by demonstrating the use of our confidence intervals in the setting of simulator-based likelihood-free inference. In this setting, contrasting popular approximate Bayesian computation methods, we develop uncertainty quantification methods with rigorous frequentist coverage guarantees.
△ Less
Submitted 3 April, 2022; v1 submitted 17 September, 2019;
originally announced September 2019.
-
Homotopy Reconstruction via the Cech Complex and the Vietoris-Rips Complex
Authors:
Jisu Kim,
Jaehyeok Shin,
Frédéric Chazal,
Alessandro Rinaldo,
Larry Wasserman
Abstract:
We derive conditions under which the reconstruction of a target space is topologically correct via the Čech complex or the Vietoris-Rips complex obtained from possibly noisy point cloud data. We provide two novel theoretical results. First, we describe sufficient conditions under which any non-empty intersection of finitely many Euclidean balls intersected with a positive reach set is contractible…
▽ More
We derive conditions under which the reconstruction of a target space is topologically correct via the Čech complex or the Vietoris-Rips complex obtained from possibly noisy point cloud data. We provide two novel theoretical results. First, we describe sufficient conditions under which any non-empty intersection of finitely many Euclidean balls intersected with a positive reach set is contractible, so that the Nerve theorem applies for the restricted Čech complex. Second, we demonstrate the homotopy equivalence of a positive $μ$-reach set and its offsets. Applying these results to the restricted Čech complex and using the interleaving relations with the Čech complex (or the Vietoris-Rips complex), we formulate conditions guaranteeing that the target space is homotopy equivalent to the Čech complex (or the Vietoris-Rips complex), in terms of the $μ$-reach. Our results sharpen existing results.
△ Less
Submitted 12 May, 2020; v1 submitted 16 March, 2019;
originally announced March 2019.
-
Uniform Convergence Rate of the Kernel Density Estimator Adaptive to Intrinsic Volume Dimension
Authors:
Jisu Kim,
Jaehyeok Shin,
Alessandro Rinaldo,
Larry Wasserman
Abstract:
We derive concentration inequalities for the supremum norm of the difference between a kernel density estimator (KDE) and its point-wise expectation that hold uniformly over the selection of the bandwidth and under weaker conditions on the kernel and the data generating distribution than previously used in the literature. We first propose a novel concept, called the volume dimension, to measure th…
▽ More
We derive concentration inequalities for the supremum norm of the difference between a kernel density estimator (KDE) and its point-wise expectation that hold uniformly over the selection of the bandwidth and under weaker conditions on the kernel and the data generating distribution than previously used in the literature. We first propose a novel concept, called the volume dimension, to measure the intrinsic dimension of the support of a probability distribution based on the rates of decay of the probability of vanishing Euclidean balls. Our bounds depend on the volume dimension and generalize the existing bounds derived in the literature. In particular, when the data-generating distribution has a bounded Lebesgue density or is supported on a sufficiently well-behaved lower-dimensional manifold, our bound recovers the same convergence rate depending on the intrinsic dimension of the support as ones known in the literature. At the same time, our results apply to more general cases, such as the ones of distribution with unbounded densities or supported on a mixture of manifolds with different dimensions. Analogous bounds are derived for the derivative of the KDE, of any order. Our results are generally applicable but are especially useful for problems in geometric inference and topological data analysis, including level set estimation, density-based clustering, modal clustering and mode hunting, ridge estimation and persistent homology.
△ Less
Submitted 31 December, 2019; v1 submitted 13 October, 2018;
originally announced October 2018.
-
Distribution-Free Prediction Sets for Two-Layer Hierarchical Models
Authors:
Robin Dunn,
Larry Wasserman,
Aaditya Ramdas
Abstract:
We consider the problem of constructing distribution-free prediction sets for data from two-layer hierarchical distributions. For iid data, prediction sets can be constructed using the method of conformal prediction. The validity of conformal prediction hinges on the exchangeability of the data, which does not hold when groups of observations come from distinct distributions, such as multiple obse…
▽ More
We consider the problem of constructing distribution-free prediction sets for data from two-layer hierarchical distributions. For iid data, prediction sets can be constructed using the method of conformal prediction. The validity of conformal prediction hinges on the exchangeability of the data, which does not hold when groups of observations come from distinct distributions, such as multiple observations on each patient in a medical database. We extend conformal methods to this hierarchical setting. We develop CDF pooling, single subsampling, and repeated subsampling approaches to construct prediction sets in unsupervised and supervised settings. We compare these approaches in terms of coverage and average set size. If asymptotic coverage is acceptable, we recommend CDF pooling for its balance between empirical coverage and average set size. If we desire coverage guarantees, then we recommend the repeated subsampling approach.
△ Less
Submitted 23 February, 2022; v1 submitted 19 September, 2018;
originally announced September 2018.
-
Robust Multivariate Nonparametric Tests via Projection-Averaging
Authors:
Ilmun Kim,
Sivaraman Balakrishnan,
Larry Wasserman
Abstract:
In this work, we generalize the Cramér-von Mises statistic via projection-averaging to obtain a robust test for the multivariate two-sample problem. The proposed test is consistent against all fixed alternatives, robust to heavy-tailed data and minimax rate optimal against a certain class of alternatives. Our test statistic is completely free of tuning parameters and is computationally efficient e…
▽ More
In this work, we generalize the Cramér-von Mises statistic via projection-averaging to obtain a robust test for the multivariate two-sample problem. The proposed test is consistent against all fixed alternatives, robust to heavy-tailed data and minimax rate optimal against a certain class of alternatives. Our test statistic is completely free of tuning parameters and is computationally efficient even in high dimensions. When the dimension tends to infinity, the proposed test is shown to have comparable power to the existing high-dimensional mean tests under certain location models. As a by-product of our approach, we introduce a new metric called the angular distance which can be thought of as a robust alternative to the Euclidean distance. Using the angular distance, we connect the proposed method to the reproducing kernel Hilbert space approach. In addition to the Cramér-von Mises statistic, we demonstrate that the projection-averaging technique can be used to define robust, multivariate tests in many other problems.
△ Less
Submitted 21 May, 2019; v1 submitted 1 March, 2018;
originally announced March 2018.
-
Hypothesis Testing For Densities and High-Dimensional Multinomials: Sharp Local Minimax Rates
Authors:
Sivaraman Balakrishnan,
Larry Wasserman
Abstract:
We consider the goodness-of-fit testing problem of distinguishing whether the data are drawn from a specified distribution, versus a composite alternative separated from the null in the total variation metric. In the discrete case, we consider goodness-of-fit testing when the null distribution has a possibly growing or unbounded number of categories. In the continuous case, we consider testing a L…
▽ More
We consider the goodness-of-fit testing problem of distinguishing whether the data are drawn from a specified distribution, versus a composite alternative separated from the null in the total variation metric. In the discrete case, we consider goodness-of-fit testing when the null distribution has a possibly growing or unbounded number of categories. In the continuous case, we consider testing a Lipschitz density, with possibly unbounded support, in the low-smoothness regime where the Lipschitz parameter is not assumed to be constant. In contrast to existing results, we show that the minimax rate and critical testing radius in these settings depend strongly, and in a precise way, on the null distribution being tested and this motivates the study of the (local) minimax rate as a function of the null distribution. For multinomials the local minimax rate was recently studied in the work of Valiant and Valiant. We re-visit and extend their results and develop two modifications to the chi-squared test whose performance we characterize. For testing Lipschitz densities, we show that the usual binning tests are inadequate in the low-smoothness regime and we design a spatially adaptive partitioning scheme that forms the basis for our locally minimax optimal tests. Furthermore, we provide the first local minimax lower bounds for this problem which yield a sharp characterization of the dependence of the critical radius on the null hypothesis being tested. In the low-smoothness regime we also provide adaptive tests, that adapt to the unknown smoothness parameter. We illustrate our results with a variety of simulations that demonstrate the practical utility of our proposed tests.
△ Less
Submitted 29 June, 2017;
originally announced June 2017.
-
Estimating the Reach of a Manifold
Authors:
Eddie Aamari,
Jisu Kim,
Frédéric Chazal,
Bertrand Michel,
Alessandro Rinaldo,
Larry Wasserman
Abstract:
Various problems in manifold estimation make use of a quantity called the reach, denoted by $τ\_M$, which is a measure of the regularity of the manifold. This paper is the first investigation into the problem of how to estimate the reach. First, we study the geometry of the reach through an approximation perspective. We derive new geometric results on the reach for submanifolds without boundary. A…
▽ More
Various problems in manifold estimation make use of a quantity called the reach, denoted by $τ\_M$, which is a measure of the regularity of the manifold. This paper is the first investigation into the problem of how to estimate the reach. First, we study the geometry of the reach through an approximation perspective. We derive new geometric results on the reach for submanifolds without boundary. An estimator $\hatτ$ of $τ\_{M}$ is proposed in a framework where tangent spaces are known, and bounds assessing its efficiency are derived. In the case of i.i.d. random point cloud $\mathbb{X}\_{n}$, $\hatτ(\mathbb{X}\_{n})$ is showed to achieve uniform expected loss bounds over a $\mathcal{C}^3$-like model. Finally, we obtain upper and lower bounds on the minimax rate for estimating the reach.
△ Less
Submitted 8 April, 2019; v1 submitted 12 May, 2017;
originally announced May 2017.
-
Bootstrapping and Sample Splitting For High-Dimensional, Assumption-Free Inference
Authors:
Alessandro Rinaldo,
Larry Wasserman,
Max G'Sell,
Jing Lei
Abstract:
Several new methods have been proposed for performing valid inference after model selection. An older method is sampling splitting: use part of the data for model selection and part for inference. In this paper we revisit sample splitting combined with the bootstrap (or the Normal approximation). We show that this leads to a simple, assumption-free approach to inference and we establish results on…
▽ More
Several new methods have been proposed for performing valid inference after model selection. An older method is sampling splitting: use part of the data for model selection and part for inference. In this paper we revisit sample splitting combined with the bootstrap (or the Normal approximation). We show that this leads to a simple, assumption-free approach to inference and we establish results on the accuracy of the method. In fact, we find new bounds on the accuracy of the bootstrap and the Normal approximation for general nonlinear parameters with increasing dimension which we then use to assess the accuracy of regression inference. We show that an alternative, called the image bootstrap, has higher coverage accuracy at the cost of more computation. We define new parameters that measure variable importance and that can be inferred with greater accuracy than the usual regression coefficients. There is a inference-prediction tradeoff: splitting increases the accuracy and robustness of inference but can decrease the accuracy of the predictions.
△ Less
Submitted 2 April, 2018; v1 submitted 16 November, 2016;
originally announced November 2016.
-
Statistical Inference for Cluster Trees
Authors:
Jisu Kim,
Yen-Chi Chen,
Sivaraman Balakrishnan,
Alessandro Rinaldo,
Larry Wasserman
Abstract:
A cluster tree provides a highly-interpretable summary of a density function by representing the hierarchy of its high-density clusters. It is estimated using the empirical tree, which is the cluster tree constructed from a density estimator. This paper addresses the basic question of quantifying our uncertainty by assessing the statistical significance of topological features of an empirical clus…
▽ More
A cluster tree provides a highly-interpretable summary of a density function by representing the hierarchy of its high-density clusters. It is estimated using the empirical tree, which is the cluster tree constructed from a density estimator. This paper addresses the basic question of quantifying our uncertainty by assessing the statistical significance of topological features of an empirical cluster tree. We first study a variety of metrics that can be used to compare different trees, analyze their properties and assess their suitability for inference. We then propose methods to construct and summarize confidence sets for the unknown true cluster tree. We introduce a partial ordering on cluster trees which we use to prune some of the statistically insignificant features of the empirical tree, yielding interpretable and parsimonious cluster trees. Finally, we illustrate the proposed methods on a variety of synthetic examples and furthermore demonstrate their utility in the analysis of a Graft-versus-Host Disease (GvHD) data set.
△ Less
Submitted 12 February, 2017; v1 submitted 20 May, 2016;
originally announced May 2016.
-
Minimax Rates for Estimating the Dimension of a Manifold
Authors:
Jisu Kim,
Alessandro Rinaldo,
Larry Wasserman
Abstract:
Many algorithms in machine learning and computational geometry require, as input, the intrinsic dimension of the manifold that supports the probability distribution of the data. This parameter is rarely known and therefore has to be estimated. We characterize the statistical difficulty of this problem by deriving upper and lower bounds on the minimax rate for estimating the dimension. First, we co…
▽ More
Many algorithms in machine learning and computational geometry require, as input, the intrinsic dimension of the manifold that supports the probability distribution of the data. This parameter is rarely known and therefore has to be estimated. We characterize the statistical difficulty of this problem by deriving upper and lower bounds on the minimax rate for estimating the dimension. First, we consider the problem of testing the hypothesis that the support of the data-generating probability distribution is a well-behaved manifold of intrinsic dimension $d_1$ versus the alternative that it is of dimension $d_2$, with $d_{1}<d_{2}$. With an i.i.d. sample of size $n$, we provide an upper bound on the probability of choosing the wrong dimension of $O\left( n^{-\left(d_{2}/d_{1}-1-ε\right)n} \right)$, where $ε$ is an arbitrarily small positive number. The proof is based on bounding the length of the traveling salesman path through the data points. We also demonstrate a lower bound of $Ω\left( n^{-(2d_{2}-2d_{1}+ε)n} \right)$, by applying Le Cam's lemma with a specific set of $d_{1}$-dimensional probability distributions. We then extend these results to get minimax rates for estimating the dimension of well-behaved manifolds. We obtain an upper bound of order $O \left( n^{-(\frac{1}{m-1}-ε)n} \right)$ and a lower bound of order $Ω\left( n^{-(2+ε)n} \right)$, where $m$ is the embedding dimension.
△ Less
Submitted 30 December, 2019; v1 submitted 3 May, 2016;
originally announced May 2016.
-
Distribution-Free Predictive Inference For Regression
Authors:
Jing Lei,
Max G'Sell,
Alessandro Rinaldo,
Ryan J. Tibshirani,
Larry Wasserman
Abstract:
We develop a general framework for distribution-free predictive inference in regression, using conformal inference. The proposed methodology allows for the construction of a prediction band for the response variable using any estimator of the regression function. The resulting prediction band preserves the consistency properties of the original estimator under standard assumptions, while guarantee…
▽ More
We develop a general framework for distribution-free predictive inference in regression, using conformal inference. The proposed methodology allows for the construction of a prediction band for the response variable using any estimator of the regression function. The resulting prediction band preserves the consistency properties of the original estimator under standard assumptions, while guaranteeing finite-sample marginal coverage even when these assumptions do not hold. We analyze and compare, both empirically and theoretically, the two major variants of our conformal framework: full conformal inference and split conformal inference, along with a related jackknife method. These methods offer different tradeoffs between statistical accuracy (length of resulting prediction intervals) and computational efficiency. As extensions, we develop a method for constructing valid in-sample prediction intervals called {\it rank-one-out} conformal inference, which has essentially the same computational efficiency as split conformal inference. We also describe an extension of our procedures for producing prediction bands with locally varying length, in order to adapt to heteroskedascity in the data. Finally, we propose a model-free notion of variable importance, called {\it leave-one-covariate-out} or LOCO inference. Accompanying this paper is an R package {\tt conformalInference} that implements all of the proposals we have introduced. In the spirit of reproducibility, all of our empirical results can also be easily (re)generated using this package.
△ Less
Submitted 8 March, 2017; v1 submitted 14 April, 2016;
originally announced April 2016.
-
Classification accuracy as a proxy for two sample testing
Authors:
Ilmun Kim,
Aaditya Ramdas,
Aarti Singh,
Larry Wasserman
Abstract:
When data analysts train a classifier and check if its accuracy is significantly different from chance, they are implicitly performing a two-sample test. We investigate the statistical properties of this flexible approach in the high-dimensional setting. We prove two results that hold for all classifiers in any dimensions: if its true error remains $ε$-better than chance for some $ε>0$ as…
▽ More
When data analysts train a classifier and check if its accuracy is significantly different from chance, they are implicitly performing a two-sample test. We investigate the statistical properties of this flexible approach in the high-dimensional setting. We prove two results that hold for all classifiers in any dimensions: if its true error remains $ε$-better than chance for some $ε>0$ as $d,n \to \infty$, then (a) the permutation-based test is consistent (has power approaching to one), (b) a computationally efficient test based on a Gaussian approximation of the null distribution is also consistent. To get a finer understanding of the rates of consistency, we study a specialized setting of distinguishing Gaussians with mean-difference $δ$ and common (known or unknown) covariance $Σ$, when $d/n \to c \in (0,\infty)$. We study variants of Fisher's linear discriminant analysis (LDA) such as "naive Bayes" in a nontrivial regime when $ε\to 0$ (the Bayes classifier has true accuracy approaching 1/2), and contrast their power with corresponding variants of Hotelling's test. Surprisingly, the expressions for their power match exactly in terms of $n,d,δ,Σ$, and the LDA approach is only worse by a constant factor, achieving an asymptotic relative efficiency (ARE) of $1/\sqrtπ$ for balanced samples. We also extend our results to high-dimensional elliptical distributions with finite kurtosis. Other results of independent interest include minimax lower bounds, and the optimality of Hotelling's test when $d=o(n)$. Simulation results validate our theory, and we present practical takeaway messages along with natural open problems.
△ Less
Submitted 17 February, 2020; v1 submitted 5 February, 2016;
originally announced February 2016.
-
Minimax Lower Bounds for Linear Independence Testing
Authors:
Aaditya Ramdas,
David Isenberg,
Aarti Singh,
Larry Wasserman
Abstract:
Linear independence testing is a fundamental information-theoretic and statistical problem that can be posed as follows: given $n$ points $\{(X_i,Y_i)\}^n_{i=1}$ from a $p+q$ dimensional multivariate distribution where $X_i \in \mathbb{R}^p$ and $Y_i \in\mathbb{R}^q$, determine whether $a^T X$ and $b^T Y$ are uncorrelated for every $a \in \mathbb{R}^p, b\in \mathbb{R}^q$ or not. We give minimax lo…
▽ More
Linear independence testing is a fundamental information-theoretic and statistical problem that can be posed as follows: given $n$ points $\{(X_i,Y_i)\}^n_{i=1}$ from a $p+q$ dimensional multivariate distribution where $X_i \in \mathbb{R}^p$ and $Y_i \in\mathbb{R}^q$, determine whether $a^T X$ and $b^T Y$ are uncorrelated for every $a \in \mathbb{R}^p, b\in \mathbb{R}^q$ or not. We give minimax lower bound for this problem (when $p+q,n \to \infty$, $(p+q)/n \leq κ< \infty$, without sparsity assumptions). In summary, our results imply that $n$ must be at least as large as $\sqrt {pq}/\|Σ_{XY}\|_F^2$ for any procedure (test) to have non-trivial power, where $Σ_{XY}$ is the cross-covariance matrix of $X,Y$. We also provide some evidence that the lower bound is tight, by connections to two-sample testing and regression in specific settings.
△ Less
Submitted 23 January, 2016;
originally announced January 2016.
-
Adaptivity and Computation-Statistics Tradeoffs for Kernel and Distance based High Dimensional Two Sample Testing
Authors:
Aaditya Ramdas,
Sashank J. Reddi,
Barnabas Poczos,
Aarti Singh,
Larry Wasserman
Abstract:
Nonparametric two sample testing is a decision theoretic problem that involves identifying differences between two random variables without making parametric assumptions about their underlying distributions. We refer to the most common settings as mean difference alternatives (MDA), for testing differences only in first moments, and general difference alternatives (GDA), which is about testing for…
▽ More
Nonparametric two sample testing is a decision theoretic problem that involves identifying differences between two random variables without making parametric assumptions about their underlying distributions. We refer to the most common settings as mean difference alternatives (MDA), for testing differences only in first moments, and general difference alternatives (GDA), which is about testing for any difference in distributions. A large number of test statistics have been proposed for both these settings. This paper connects three classes of statistics - high dimensional variants of Hotelling's t-test, statistics based on Reproducing Kernel Hilbert Spaces, and energy statistics based on pairwise distances. We ask the question: how much statistical power do popular kernel and distance based tests for GDA have when the unknown distributions differ in their means, compared to specialized tests for MDA?
We formally characterize the power of popular tests for GDA like the Maximum Mean Discrepancy with the Gaussian kernel (gMMD) and bandwidth-dependent variants of the Energy Distance with the Euclidean norm (eED) in the high-dimensional MDA regime. Some practically important properties include (a) eED and gMMD have asymptotically equal power; furthermore they enjoy a free lunch because, while they are additionally consistent for GDA, they also have the same power as specialized high-dimensional t-test variants for MDA. All these tests are asymptotically optimal (including matching constants) under MDA for spherical covariances, according to simple lower bounds, (b) The power of gMMD is independent of the kernel bandwidth, as long as it is larger than the choice made by the median heuristic, (c) There is a clear and smooth computation-statistics tradeoff for linear-time, subquadratic-time and quadratic-time versions of these tests, with more computation resulting in higher power.
△ Less
Submitted 4 August, 2015;
originally announced August 2015.
-
Statistical Inference using the Morse-Smale Complex
Authors:
Yen-Chi Chen,
Christopher R. Genovese,
Larry Wasserman
Abstract:
The Morse-Smale complex of a function $f$ decomposes the sample space into cells where $f$ is increasing or decreasing. When applied to nonparametric density estimation and regression, it provides a way to represent, visualize, and compare multivariate functions. In this paper, we present some statistical results on estimating Morse-Smale complexes. This allows us to derive new results for two exi…
▽ More
The Morse-Smale complex of a function $f$ decomposes the sample space into cells where $f$ is increasing or decreasing. When applied to nonparametric density estimation and regression, it provides a way to represent, visualize, and compare multivariate functions. In this paper, we present some statistical results on estimating Morse-Smale complexes. This allows us to derive new results for two existing methods: mode clustering and Morse-Smale regression. We also develop two new methods based on the Morse-Smale complex: a visualization technique for multivariate functions and a two-sample, multivariate hypothesis test.
△ Less
Submitted 3 April, 2017; v1 submitted 29 June, 2015;
originally announced June 2015.
-
Uniform Asymptotic Inference and the Bootstrap After Model Selection
Authors:
Ryan J. Tibshirani,
Alessandro Rinaldo,
Robert Tibshirani,
Larry Wasserman
Abstract:
Recently, Tibshirani et al. (2016) proposed a method for making inferences about parameters defined by model selection, in a typical regression setting with normally distributed errors. Here, we study the large sample properties of this method, without assuming normality. We prove that the test statistic of Tibshirani et al. (2016) is asymptotically valid, as the number of samples n grows and the…
▽ More
Recently, Tibshirani et al. (2016) proposed a method for making inferences about parameters defined by model selection, in a typical regression setting with normally distributed errors. Here, we study the large sample properties of this method, without assuming normality. We prove that the test statistic of Tibshirani et al. (2016) is asymptotically valid, as the number of samples n grows and the dimension d of the regression problem stays fixed. Our asymptotic result holds uniformly over a wide class of nonnormal error distributions. We also propose an efficient bootstrap version of this test that is provably (asymptotically) conservative, and in practice, often delivers shorter intervals than those from the original normality-based approach. Finally, we prove that the test statistic of Tibshirani et al. (2016) does not enjoy uniform validity in a high-dimensional setting, when the dimension d is allowed grow.
△ Less
Submitted 9 August, 2017; v1 submitted 20 June, 2015;
originally announced June 2015.
-
An Analysis of Active Learning With Uniform Feature Noise
Authors:
Aaditya Ramdas,
Barnabas Poczos,
Aarti Singh,
Larry Wasserman
Abstract:
In active learning, the user sequentially chooses values for feature $X$ and an oracle returns the corresponding label $Y$. In this paper, we consider the effect of feature noise in active learning, which could arise either because $X$ itself is being measured, or it is corrupted in transmission to the oracle, or the oracle returns the label of a noisy version of the query point. In statistics, fe…
▽ More
In active learning, the user sequentially chooses values for feature $X$ and an oracle returns the corresponding label $Y$. In this paper, we consider the effect of feature noise in active learning, which could arise either because $X$ itself is being measured, or it is corrupted in transmission to the oracle, or the oracle returns the label of a noisy version of the query point. In statistics, feature noise is known as "errors in variables" and has been studied extensively in non-active settings. However, the effect of feature noise in active learning has not been studied before. We consider the well-known Berkson errors-in-variables model with additive uniform noise of width $σ$.
Our simple but revealing setting is that of one-dimensional binary classification setting where the goal is to learn a threshold (point where the probability of a $+$ label crosses half). We deal with regression functions that are antisymmetric in a region of size $σ$ around the threshold and also satisfy Tsybakov's margin condition around the threshold. We prove minimax lower and upper bounds which demonstrate that when $σ$ is smaller than the minimiax active/passive noiseless error derived in \cite{CN07}, then noise has no effect on the rates and one achieves the same noiseless rates. For larger $σ$, the \textit{unflattening} of the regression function on convolution with uniform noise, along with its local antisymmetry around the threshold, together yield a behaviour where noise \textit{appears} to be beneficial. Our key result is that active learning can buy significant improvement over a passive strategy even in the presence of feature noise.
△ Less
Submitted 15 May, 2015;
originally announced May 2015.
-
Risk Bounds For Mode Clustering
Authors:
Martin Azizyan,
Yen-Chi Chen,
Aarti Singh,
Larry Wasserman
Abstract:
Density mode clustering is a nonparametric clustering method. The clusters are the basins of attraction of the modes of a density estimator. We study the risk of mode-based clustering. We show that the clustering risk over the cluster cores --- the regions where the density is high --- is very small even in high dimensions. And under a low noise condition, the overall cluster risk is small even be…
▽ More
Density mode clustering is a nonparametric clustering method. The clusters are the basins of attraction of the modes of a density estimator. We study the risk of mode-based clustering. We show that the clustering risk over the cluster cores --- the regions where the density is high --- is very small even in high dimensions. And under a low noise condition, the overall cluster risk is small even beyond the cores, in high dimensions.
△ Less
Submitted 3 May, 2015;
originally announced May 2015.
-
Density Level Sets: Asymptotics, Inference, and Visualization
Authors:
Yen-Chi Chen,
Christopher R. Genovese,
Larry Wasserman
Abstract:
We derive asymptotic theory for the plug-in estimate for density level sets under Hausdoff loss. Based on the asymptotic theory, we propose two bootstrap confidence regions for level sets. The confidence regions can be used to perform tests for anomaly detection and clustering. We also introduce a technique to visualize high dimensional density level sets by combining mode clustering and multidime…
▽ More
We derive asymptotic theory for the plug-in estimate for density level sets under Hausdoff loss. Based on the asymptotic theory, we propose two bootstrap confidence regions for level sets. The confidence regions can be used to perform tests for anomaly detection and clustering. We also introduce a technique to visualize high dimensional density level sets by combining mode clustering and multidimensional scaling.
△ Less
Submitted 5 September, 2016; v1 submitted 21 April, 2015;
originally announced April 2015.
-
Robust Topological Inference: Distance To a Measure and Kernel Distance
Authors:
Frédéric Chazal,
Brittany T. Fasy,
Fabrizio Lecci,
Bertrand Michel,
Alessandro Rinaldo,
Larry Wasserman
Abstract:
Let P be a distribution with support S. The salient features of S can be quantified with persistent homology, which summarizes topological features of the sublevel sets of the distance function (the distance of any point x to S). Given a sample from P we can infer the persistent homology using an empirical version of the distance function. However, the empirical distance function is highly non-rob…
▽ More
Let P be a distribution with support S. The salient features of S can be quantified with persistent homology, which summarizes topological features of the sublevel sets of the distance function (the distance of any point x to S). Given a sample from P we can infer the persistent homology using an empirical version of the distance function. However, the empirical distance function is highly non-robust to noise and outliers. Even one outlier is deadly. The distance-to-a-measure (DTM), introduced by Chazal et al. (2011), and the kernel distance, introduced by Phillips et al. (2014), are smooth functions that provide useful topological information but are robust to noise and outliers. Chazal et al. (2014) derived concentration bounds for DTM. Building on these results, we derive limiting distributions and confidence sets, and we propose a method for choosing tuning parameters.
△ Less
Submitted 22 December, 2014;
originally announced December 2014.
-
Nonparametric modal regression
Authors:
Yen-Chi Chen,
Christopher R. Genovese,
Ryan J. Tibshirani,
Larry Wasserman
Abstract:
Modal regression estimates the local modes of the distribution of $Y$ given $X=x$, instead of the mean, as in the usual regression sense, and can hence reveal important structure missed by usual regression methods. We study a simple nonparametric method for modal regression, based on a kernel density estimate (KDE) of the joint distribution of $Y$ and $X$. We derive asymptotic error bounds for thi…
▽ More
Modal regression estimates the local modes of the distribution of $Y$ given $X=x$, instead of the mean, as in the usual regression sense, and can hence reveal important structure missed by usual regression methods. We study a simple nonparametric method for modal regression, based on a kernel density estimate (KDE) of the joint distribution of $Y$ and $X$. We derive asymptotic error bounds for this method, and propose techniques for constructing confidence sets and prediction sets. The latter is used to select the smoothing bandwidth of the underlying KDE. The idea behind modal regression is connected to many others, such as mixture regression and density ridge estimation, and we discuss these ties as well.
△ Less
Submitted 30 March, 2016; v1 submitted 4 December, 2014;
originally announced December 2014.
-
On the High-dimensional Power of Linear-time Kernel Two-Sample Testing under Mean-difference Alternatives
Authors:
Aaditya Ramdas,
Sashank J. Reddi,
Barnabas Poczos,
Aarti Singh,
Larry Wasserman
Abstract:
Nonparametric two sample testing deals with the question of consistently deciding if two distributions are different, given samples from both, without making any parametric assumptions about the form of the distributions. The current literature is split into two kinds of tests - those which are consistent without any assumptions about how the distributions may differ (\textit{general} alternatives…
▽ More
Nonparametric two sample testing deals with the question of consistently deciding if two distributions are different, given samples from both, without making any parametric assumptions about the form of the distributions. The current literature is split into two kinds of tests - those which are consistent without any assumptions about how the distributions may differ (\textit{general} alternatives), and those which are designed to specifically test easier alternatives, like a difference in means (\textit{mean-shift} alternatives).
The main contribution of this paper is to explicitly characterize the power of a popular nonparametric two sample test, designed for general alternatives, under a mean-shift alternative in the high-dimensional setting. Specifically, we explicitly derive the power of the linear-time Maximum Mean Discrepancy statistic using the Gaussian kernel, where the dimension and sample size can both tend to infinity at any rate, and the two distributions differ in their means. As a corollary, we find that if the signal-to-noise ratio is held constant, then the test's power goes to one if the number of samples increases faster than the dimension increases. This is the first explicit power derivation for a general nonparametric test in the high-dimensional setting, and also the first analysis of how tests designed for general alternatives perform when faced with easier ones.
△ Less
Submitted 23 November, 2014;
originally announced November 2014.
-
Asymptotic theory for density ridges
Authors:
Yen-Chi Chen,
Christopher R. Genovese,
Larry Wasserman
Abstract:
The large sample theory of estimators for density modes is well understood. In this paper we consider density ridges, which are a higher-dimensional extension of modes. Modes correspond to zero-dimensional, local high-density regions in point clouds. Density ridges correspond to $s$-dimensional, local high-density regions in point clouds. We establish three main results. First we show that under a…
▽ More
The large sample theory of estimators for density modes is well understood. In this paper we consider density ridges, which are a higher-dimensional extension of modes. Modes correspond to zero-dimensional, local high-density regions in point clouds. Density ridges correspond to $s$-dimensional, local high-density regions in point clouds. We establish three main results. First we show that under appropriate regularity conditions, the local variation of the estimated ridge can be approximated by an empirical process. Second, we show that the distribution of the estimated ridge converges to a Gaussian process. Third, we establish that the bootstrap leads to valid confidence sets for density ridges.
△ Less
Submitted 13 October, 2015; v1 submitted 21 June, 2014;
originally announced June 2014.
-
Feature Selection For High-Dimensional Clustering
Authors:
Larry Wasserman,
Martin Azizyan,
Aarti Singh
Abstract:
We present a nonparametric method for selecting informative features in high-dimensional clustering problems. We start with a screening step that uses a test for multimodality. Then we apply kernel density estimation and mode clustering to the selected features. The output of the method consists of a list of relevant features, and cluster assignments. We provide explicit bounds on the error rate o…
▽ More
We present a nonparametric method for selecting informative features in high-dimensional clustering problems. We start with a screening step that uses a test for multimodality. Then we apply kernel density estimation and mode clustering to the selected features. The output of the method consists of a list of relevant features, and cluster assignments. We provide explicit bounds on the error rate of the resulting clustering. In addition, we provide the first error bounds on mode based clustering.
△ Less
Submitted 9 June, 2014;
originally announced June 2014.