Search | arXiv e-print repository

Finite sample-optimal adjustment sets in linear Gaussian causal models

Authors: Nadja Rutsch, Sara Magliacane, Stéphanie van der Pas

Abstract: Traditional covariate selection methods for causal inference focus on achieving unbiasedness and asymptotic efficiency. In many practical scenarios, researchers must estimate causal effects from observational data with limited sample sizes or in cases where covariates are difficult or costly to measure. Their needs might be better met by selecting adjustment sets that are finite sample-optimal in… ▽ More Traditional covariate selection methods for causal inference focus on achieving unbiasedness and asymptotic efficiency. In many practical scenarios, researchers must estimate causal effects from observational data with limited sample sizes or in cases where covariates are difficult or costly to measure. Their needs might be better met by selecting adjustment sets that are finite sample-optimal in terms of mean squared error. In this paper, we aim to find the adjustment set that minimizes the mean squared error of the causal effect estimator, taking into account the joint distribution of the variables and the sample size. We call this finite sample-optimal set the MSE-optimal adjustment set and present examples in which the MSE-optimal adjustment set differs from the asymptotically optimal adjustment set. To identify the MSE-optimal adjustment set, we then introduce a sample size criterion for comparing adjustment sets in linear Gaussian models. We also develop graphical criteria to reduce the search space for this adjustment set based on the causal graph. In experiments with simulated data, we show that the MSE-optimal adjustment set can outperform the asymptotically optimal adjustment set in finite sample size settings, making causal inference more practical in such scenarios. △ Less

Submitted 15 June, 2025; originally announced June 2025.

arXiv:2305.06816 [pdf, other]

Bayesian sensitivity analysis for a missing data model

Authors: Bart Eggen, Stéphanie L. van der Pas, Aad W. van der Vaart

Abstract: In causal inference, sensitivity analysis is important to assess the robustness of study conclusions to key assumptions. We perform sensitivity analysis of the assumption that missing outcomes are missing completely at random. We follow a Bayesian approach, which is nonparametric for the outcome distribution and can be combined with an informative prior on the sensitivity parameter. We give insigh… ▽ More In causal inference, sensitivity analysis is important to assess the robustness of study conclusions to key assumptions. We perform sensitivity analysis of the assumption that missing outcomes are missing completely at random. We follow a Bayesian approach, which is nonparametric for the outcome distribution and can be combined with an informative prior on the sensitivity parameter. We give insight in the posterior and provide theoretical guarantees in the form of Bernstein-von Mises theorems for estimating the mean outcome. We study different parametrisations of the model involving Dirichlet process priors on the distribution of the outcome and on the distribution of the outcome conditional on the subject being treated. We show that these parametrisations incorporate a prior on the sensitivity parameter in different ways and discuss the relative merits. We also present a simulation study, showing the performance of the methods in finite sample scenarios. △ Less

Submitted 11 May, 2023; originally announced May 2023.

arXiv:2304.08373 [pdf, ps, other]

Asymptotics of Caliper Matching Estimators for Average Treatment Effects

Authors: Máté Kormos, Stéphanie van der Pas, Aad van der Vaart

Abstract: Caliper matching is used to estimate causal effects of a binary treatment from observational data by comparing matched treated and control units. Units are matched when their propensity scores, the conditional probability of receiving treatment given pretreatment covariates, are within a certain distance called caliper. So far, theoretical results on caliper matching are lacking, leaving practitio… ▽ More Caliper matching is used to estimate causal effects of a binary treatment from observational data by comparing matched treated and control units. Units are matched when their propensity scores, the conditional probability of receiving treatment given pretreatment covariates, are within a certain distance called caliper. So far, theoretical results on caliper matching are lacking, leaving practitioners with ad-hoc caliper choices and inference procedures. We bridge this gap by proposing a caliper that balances the quality and the number of matches. We prove that the resulting estimator of the average treatment effect, and average treatment effect on the treated, is asymptotically unbiased and normal at parametric rate. We describe the conditions under which semiparametric efficiency is obtainable, and show that when the parametric propensity score is estimated, the variance is increased for both estimands. Finally, we construct asymptotic confidence intervals for the two estimands. △ Less

Submitted 17 April, 2023; originally announced April 2023.

arXiv:2005.02889 [pdf, other]

Multiscale Bayesian Survival Analysis

Authors: Ismaël Castillo, Stéphanie van der Pas

Abstract: We consider Bayesian nonparametric inference in the right-censoring survival model, where modeling is made at the level of the hazard rate. We derive posterior limiting distributions for linear functionals of the hazard, and then for `many' functionals simultaneously in appropriate multiscale spaces. As an application, we derive Bernstein-von Mises theorems for the cumulative hazard and survival f… ▽ More We consider Bayesian nonparametric inference in the right-censoring survival model, where modeling is made at the level of the hazard rate. We derive posterior limiting distributions for linear functionals of the hazard, and then for `many' functionals simultaneously in appropriate multiscale spaces. As an application, we derive Bernstein-von Mises theorems for the cumulative hazard and survival functions, which lead to asymptotically efficient confidence bands for these quantities. Further, we show optimal posterior contraction rates for the hazard in terms of the supremum norm. In medical studies, a popular approach is to model hazards a priori as random histograms with possibly dependent heights. This and more general classes of arbitrarily smooth prior distributions are considered as applications of our theory. A sampler is provided for possibly dependent histogram posteriors. Its finite sample properties are investigated on both simulated and real data experiments. △ Less

Submitted 31 May, 2021; v1 submitted 6 May, 2020; originally announced May 2020.

MSC Class: 62G15 (Primary); 62G20 (Secondary)

arXiv:1708.08734 [pdf, other]

Posterior Concentration for Bayesian Regression Trees and Forests

Authors: Veronika Rockova, Stephanie van der Pas

Abstract: Since their inception in the 1980's, regression trees have been one of the more widely used non-parametric prediction methods. Tree-structured methods yield a histogram reconstruction of the regression surface, where the bins correspond to terminal nodes of recursive partitioning. Trees are powerful, yet susceptible to over-fitting. Strategies against overfitting have traditionally relied on pruni… ▽ More Since their inception in the 1980's, regression trees have been one of the more widely used non-parametric prediction methods. Tree-structured methods yield a histogram reconstruction of the regression surface, where the bins correspond to terminal nodes of recursive partitioning. Trees are powerful, yet susceptible to over-fitting. Strategies against overfitting have traditionally relied on pruning greedily grown trees. The Bayesian framework offers an alternative remedy against overfitting through priors. Roughly speaking, a good prior charges smaller trees where overfitting does not occur. While the consistency of random histograms, trees and their ensembles has been studied quite extensively, the theoretical understanding of the Bayesian counterparts has been missing. In this paper, we take a step towards understanding why/when do Bayesian trees and their ensembles not overfit. To address this question, we study the speed at which the posterior concentrates around the true smooth regression function. We propose a spike-and-tree variant of the popular Bayesian CART prior and establish new theoretical results showing that regression trees (and their ensembles) (a) are capable of recovering smooth regression surfaces, achieving optimal rates up to a log factor, (b) can adapt to the unknown level of smoothness and (c) can perform effective dimension reduction when p>n. These results provide a piece of missing theoretical evidence explaining why Bayesian trees (and additive variants thereof) have worked so well in practice. △ Less

Submitted 13 June, 2019; v1 submitted 29 August, 2017; originally announced August 2017.

arXiv:1708.00078 [pdf, other]

Bayesian Dyadic Trees and Histograms for Regression

Authors: Stephanie van der Pas, Veronika Rockova

Abstract: Many machine learning tools for regression are based on recursive partitioning of the covariate space into smaller regions, where the regression function can be estimated locally. Among these, regression trees and their ensembles have demonstrated impressive empirical performance. In this work, we shed light on the machinery behind Bayesian variants of these methods. In particular, we study Bayesi… ▽ More Many machine learning tools for regression are based on recursive partitioning of the covariate space into smaller regions, where the regression function can be estimated locally. Among these, regression trees and their ensembles have demonstrated impressive empirical performance. In this work, we shed light on the machinery behind Bayesian variants of these methods. In particular, we study Bayesian regression histograms, such as Bayesian dyadic trees, in the simple regression case with just one predictor. We focus on the reconstruction of regression surfaces that are piecewise constant, where the number of jumps is unknown. We show that with suitably designed priors, posterior distributions concentrate around the true step regression function at a near-minimax rate. These results do not require the knowledge of the true number of steps, nor the width of the true partitioning cells. Thus, Bayesian dyadic regression trees are fully adaptive and can recover the true piecewise regression function nearly as well as if we knew the exact number and location of jumps. Our results constitute the first step towards understanding why Bayesian trees and their ensembles have worked so well in practice. As an aside, we discuss prior distributions on balanced interval partitions and how they relate to an old problem in geometric probability. Namely, we relate the probability of covering the circumference of a circle with random arcs whose endpoints are confined to a grid, a new variant of the original problem. △ Less

Submitted 28 November, 2017; v1 submitted 31 July, 2017; originally announced August 2017.

arXiv:1702.03698 [pdf, other]

Adaptive posterior contraction rates for the horseshoe

Authors: Stéphanie van der Pas, Botond Szabó, Aad van der Vaart

Abstract: We investigate the frequentist properties of Bayesian procedures for estimation based on the horseshoe prior in the sparse multivariate normal means model. Previous theoretical results assumed that the sparsity level, that is, the number of signals, was known. We drop this assumption and characterize the behavior of the maximum marginal likelihood estimator (MMLE) of a key parameter of the horsesh… ▽ More We investigate the frequentist properties of Bayesian procedures for estimation based on the horseshoe prior in the sparse multivariate normal means model. Previous theoretical results assumed that the sparsity level, that is, the number of signals, was known. We drop this assumption and characterize the behavior of the maximum marginal likelihood estimator (MMLE) of a key parameter of the horseshoe prior. We prove that the MMLE is an effective estimator of the sparsity level, in the sense that it leads to (near) minimax optimal estimation of the underlying mean vector generating the data. Besides this empirical Bayes procedure, we consider the hierarchical Bayes method of putting a prior on the unknown sparsity level as well. We show that both Bayesian techniques lead to rate-adaptive optimal posterior contraction, which implies that the horseshoe posterior is a good candidate for generating rate-adaptive credible sets. △ Less

Submitted 13 February, 2017; originally announced February 2017.

Comments: arXiv admin note: substantial text overlap with arXiv:1607.01892

arXiv:1608.04242 [pdf, other]

Bayesian Community Detection

Authors: Stéphanie van der Pas, Aad van der Vaart

Abstract: We introduce a Bayesian estimator of the underlying class structure in the stochastic block model, when the number of classes is known. The estimator is the posterior mode corresponding to a Dirichlet prior on the class proportions, a generalized Bernoulli prior on the class labels, and a beta prior on the edge probabilities. We show that this estimator is strongly consistent when the expected deg… ▽ More We introduce a Bayesian estimator of the underlying class structure in the stochastic block model, when the number of classes is known. The estimator is the posterior mode corresponding to a Dirichlet prior on the class proportions, a generalized Bernoulli prior on the class labels, and a beta prior on the edge probabilities. We show that this estimator is strongly consistent when the expected degree is at least of order $\log^2{n}$, where $n$ is the number of nodes in the network. △ Less

Submitted 15 August, 2016; originally announced August 2016.

arXiv:1607.01892 [pdf, other]

Uncertainty quantification for the horseshoe

Authors: Stéphanie van der Pas, Botond Szabó, Aad van der Vaart

Abstract: We investigate the credible sets and marginal credible intervals resulting from the horseshoe prior in the sparse multivariate normal means model. We do so in an adaptive setting without assuming knowledge of the sparsity level (number of signals). We consider both the hierarchical Bayes method of putting a prior on the unknown sparsity level and the empirical Bayes method with the sparsity level… ▽ More We investigate the credible sets and marginal credible intervals resulting from the horseshoe prior in the sparse multivariate normal means model. We do so in an adaptive setting without assuming knowledge of the sparsity level (number of signals). We consider both the hierarchical Bayes method of putting a prior on the unknown sparsity level and the empirical Bayes method with the sparsity level estimated by maximum marginal likelihood. We show that credible balls and marginal credible intervals have good frequentist coverage and optimal size if the sparsity level of the prior is set correctly. By general theory honest confidence sets cannot adapt in size to an unknown sparsity level. Accordingly the hierarchical and empirical Bayes credible sets based on the horseshoe prior are not honest over the full parameter space. We show that this is due to over-shrinkage for certain parameters and characterise the set of parameters for which credible balls and marginal credible intervals do give correct uncertainty quantification. In particular we show that the fraction of false discoveries by the marginal Bayesian procedure is controlled by a correct choice of cut-off. △ Less

Submitted 13 February, 2017; v1 submitted 7 July, 2016; originally announced July 2016.

arXiv:1510.02232 [pdf, other]

doi 10.1214/16-EJS1130

Conditions for Posterior Contraction in the Sparse Normal Means Problem

Authors: Stéphanie van der Pas, Jean-Bernard Salomond, Johannes Schmidt-Hieber

Abstract: The first Bayesian results for the sparse normal means problem were proven for spike-and-slab priors. However, these priors are less convenient from a computational point of view. In the meanwhile, a large number of continuous shrinkage priors has been proposed. Many of these shrinkage priors can be written as a scale mixture of normals, which makes them particularly easy to implement. We propose… ▽ More The first Bayesian results for the sparse normal means problem were proven for spike-and-slab priors. However, these priors are less convenient from a computational point of view. In the meanwhile, a large number of continuous shrinkage priors has been proposed. Many of these shrinkage priors can be written as a scale mixture of normals, which makes them particularly easy to implement. We propose general conditions on the prior on the local variance in scale mixtures of normals, such that posterior contraction at the minimax rate is assured. The conditions require tails at least as heavy as Laplace, but not too heavy, and a large amount of mass around zero relative to the tails, more so as the sparsity increases. These conditions give some general guidelines for choosing a shrinkage prior for estimation under a nearly black sparsity assumption. We verify these conditions for the class of priors considered by Ghosh and Chakrabarti (2015), which includes the horseshoe and the normal-exponential gamma priors, and for the horseshoe+, the inverse-Gaussian prior, the normal-gamma prior, and the spike-and-slab Lasso, and thus extend the number of shrinkage priors which are known to lead to posterior contraction at the minimax estimation rate. △ Less

Submitted 13 October, 2015; v1 submitted 8 October, 2015; originally announced October 2015.

Journal ref: Electron. J. Statist. 10 (2016), no. 1, 976--1000. http://projecteuclid.org/euclid.ejs/1460463652

arXiv:1408.5724 [pdf, other]

Almost the Best of Three Worlds: Risk, Consistency and Optional Stopping for the Switch Criterion in Nested Model Selection

Authors: Stéphanie van der Pas, Peter Grünwald

Abstract: We study the switch distribution, introduced by Van Erven et al. (2012), applied to model selection and subsequent estimation. While switching was known to be strongly consistent, here we show that it achieves minimax optimal parametric risk rates up to a $\log\log n$ factor when comparing two nested exponential families, partially confirming a conjecture by Lauritzen (2012) and Cavanaugh (2012) t… ▽ More We study the switch distribution, introduced by Van Erven et al. (2012), applied to model selection and subsequent estimation. While switching was known to be strongly consistent, here we show that it achieves minimax optimal parametric risk rates up to a $\log\log n$ factor when comparing two nested exponential families, partially confirming a conjecture by Lauritzen (2012) and Cavanaugh (2012) that switching behaves asymptotically like the Hannan-Quinn criterion. Moreover, like Bayes factor model selection but unlike standard significance testing, when one of the models represents a simple hypothesis, the switch criterion defines a robust null hypothesis test, meaning that its Type-I error probability can be bounded irrespective of the stopping rule. Hence, switching is consistent, insensitive to optional stopping and almost minimax risk optimal, showing that, Yang's (2005) impossibility result notwithstanding, it is possible to `almost' combine the strengths of AIC and Bayes factor model selection. △ Less

Submitted 15 December, 2016; v1 submitted 25 August, 2014; originally announced August 2014.

Comments: To appear in Statistica Sinica

arXiv:1404.0202 [pdf, ps, other]

doi 10.1214/14-EJS962

The Horseshoe Estimator: Posterior Concentration around Nearly Black Vectors

Authors: S. L. van der Pas, B. J. K. Kleijn, A. W. van der Vaart

Abstract: We consider the horseshoe estimator due to Carvalho, Polson and Scott (2010) for the multivariate normal mean model in the situation that the mean vector is sparse in the nearly black sense. We assume the frequentist framework where the data is generated according to a fixed mean vector. We show that if the number of nonzero parameters of the mean vector is known, the horseshoe estimator attains t… ▽ More We consider the horseshoe estimator due to Carvalho, Polson and Scott (2010) for the multivariate normal mean model in the situation that the mean vector is sparse in the nearly black sense. We assume the frequentist framework where the data is generated according to a fixed mean vector. We show that if the number of nonzero parameters of the mean vector is known, the horseshoe estimator attains the minimax $\ell_2$ risk, possibly up to a multiplicative constant. We provide conditions under which the horseshoe estimator combined with an empirical Bayes estimate of the number of nonzero means still yields the minimax risk. We furthermore prove an upper bound on the rate of contraction of the posterior distribution around the horseshoe estimator, and a lower bound on the posterior variance. These bounds indicate that the posterior distribution of the horseshoe prior may be more informative than that of other one-component priors, including the Lasso. △ Less

Submitted 15 December, 2014; v1 submitted 1 April, 2014; originally announced April 2014.

Comments: This version differs from the final published version in pagination and typographical detail; Available at http://projecteuclid.org/euclid.ejs/1418134265

MSC Class: 62F15; 62F10

Journal ref: Electron. J. Statist. Volume 8, Number 2 (2014), 2585-2618

Showing 1–12 of 12 results for author: van der Pas, S