Search | arXiv e-print repository

Semi-parametric efficient estimation of small genetic effects in large-scale population cohorts

Authors: Olivier Labayle, Breeshey Roskams-Hieter, Joshua Slaughter, Kelsey Tetley-Campbell, Mark J. van der Laan, Chris P. Ponting, Sjoerd Viktor Beentjes, Ava Khamseh

Abstract: Population genetics seeks to quantify DNA variant associations with traits or diseases, as well as interactions among variants and with environmental factors. Computing millions of estimates in large cohorts in which small effect sizes are expected, necessitates minimising model-misspecification bias to control false discoveries. We present TarGene, a unified statistical workflow for the semi-para… ▽ More Population genetics seeks to quantify DNA variant associations with traits or diseases, as well as interactions among variants and with environmental factors. Computing millions of estimates in large cohorts in which small effect sizes are expected, necessitates minimising model-misspecification bias to control false discoveries. We present TarGene, a unified statistical workflow for the semi-parametric efficient and double robust estimation of genetic effects including k-point interactions among categorical variables in the presence of confounding and weak population dependence. k-point interactions, or Average Interaction Effects (AIEs), are a direct generalisation of the usual average treatment effect (ATE). We estimate AIEs with cross-validated and/or weighted versions of Targeted Minimum Loss-based Estimators (TMLE) and One-Step Estimators (OSE). The effect of dependence among data units on variance estimates is corrected by using sieve plateau variance estimators based on genetic relatedness across the units. We present extensive realistic simulations to demonstrate power, coverage, and control of type I error. Our motivating application is the targeted estimation of genetic effects on trait, including two-point and higher-order gene-gene and gene-environment interactions, in large-scale genomic databases such as UK Biobank and All of Us. All cross-validated and/or weighted TMLE and OSE for the AIE k-point interaction, as well as ATEs, conditional ATEs and functions thereof, are implemented in the general purpose Julia package TMLE.jl. For high-throughput applications in population genomics, we provide the open-source Nextflow pipeline and software TarGene which integrates seamlessly with modern high-performance and cloud computing platforms. △ Less

Submitted 20 May, 2025; originally announced May 2025.

Comments: 31 pages + appendix, 5 figures

arXiv:2504.11740 [pdf, other]

A cautionary note for plasmode simulation studies in the setting of causal inference

Authors: Pamela A Shaw, Susan Gruber, Brian D. Williamson, Rishi Desai, Susan M. Shortreed, Chloe Krakauer, Jennifer C. Nelson, Mark J. van der Laan

Abstract: Plasmode simulation has become an important tool for evaluating the operating characteristics of different statistical methods in complex settings, such as pharmacoepidemiological studies of treatment effectiveness using electronic health records (EHR) data. These studies provide insight into how estimator performance is impacted by challenges including rare events, small sample size, etc., that c… ▽ More Plasmode simulation has become an important tool for evaluating the operating characteristics of different statistical methods in complex settings, such as pharmacoepidemiological studies of treatment effectiveness using electronic health records (EHR) data. These studies provide insight into how estimator performance is impacted by challenges including rare events, small sample size, etc., that can indicate which among a set of methods performs best in a real-world dataset. Plasmode simulation combines data resampled from a real-world dataset with synthetic data to generate a known truth for an estimand in realistic data. There are different potential plasmode strategies currently in use. We compare two popular plasmode simulation frameworks. We provide numerical evidence and a theoretical result, which shows that one of these frameworks can cause certain estimators to incorrectly appear overly biased with lower than nominal confidence interval coverage. Detailed simulation studies using both synthetic and real-world EHR data demonstrate that these pitfalls remain at large sample sizes and when analyzing data from a randomized controlled trial. We conclude with guidance for the choice of a plasmode simulation approach that maintains good theoretical properties to allow a fair evaluation of statistical methods while also maintaining the desired similarity to real data. △ Less

Submitted 15 April, 2025; originally announced April 2025.

Comments: 55 pages, 6 tables, 2 figures, 8 supplemental tables, 4 supplemental figures

arXiv:2503.22284 [pdf, other]

Powering RCTs for marginal effects with GLMs using prognostic score adjustment

Authors: Emilie Højbjerre-Frandsen, Mark J. van der Laan, Alejandro Schuler

Abstract: In randomized clinical trials (RCTs), the accurate estimation of marginal treatment effects is crucial for determining the efficacy of interventions. Enhancing the statistical power of these analyses is a key objective for statisticians. The increasing availability of historical data from registries, prior trials, and health records presents an opportunity to improve trial efficiency. However, man… ▽ More In randomized clinical trials (RCTs), the accurate estimation of marginal treatment effects is crucial for determining the efficacy of interventions. Enhancing the statistical power of these analyses is a key objective for statisticians. The increasing availability of historical data from registries, prior trials, and health records presents an opportunity to improve trial efficiency. However, many methods for historical borrowing compromise strict type-I error rate control. Building on the work by Schuler et al. [2022] on prognostic score adjustment for linear models, this paper extends the methodology to the plug-in analysis proposed by Rosenblum et al. [2010] using generalized linear models (GLMs) to further enhance the efficiency of RCT analyses without introducing bias. Specifically, we train a prognostic model on historical control data and incorporate the resulting prognostic scores as covariates in the plug-in GLM analysis of the trial data. This approach leverages the predictive power of historical data to improve the precision of marginal treatment effect estimates. We demonstrate that this method achieves local semi-parametric efficiency under the assumption of an additive treatment effect on the link scale. We expand the GLM plug-in method to include negative binomial regression. Additionally, we provide a straightforward formula for conservatively estimating the asymptotic variance, facilitating power calculations that reflect these efficiency gains. Our simulation study supports the theory. Even without an additive treatment effect, we observe increased power or reduced standard error. While population shifts from historical to trial data may dilute benefits, they do not introduce bias. △ Less

Submitted 28 March, 2025; originally announced March 2025.

Comments: 41 pages, 7 figures

arXiv:2412.15012 [pdf, other]

Assessing treatment effects in observational data with missing confounders: A comparative study of practical doubly-robust and traditional missing data methods

Authors: Brian D. Williamson, Chloe Krakauer, Eric Johnson, Susan Gruber, Bryan E. Shepherd, Mark J. van der Laan, Thomas Lumley, Hana Lee, Jose J. Hernandez Munoz, Fengyu Zhao, Sarah K. Dutcher, Rishi Desai, Gregory E. Simon, Susan M. Shortreed, Jennifer C. Nelson, Pamela A. Shaw

Abstract: In pharmacoepidemiology, safety and effectiveness are frequently evaluated using readily available administrative and electronic health records data. In these settings, detailed confounder data are often not available in all data sources and therefore missing on a subset of individuals. Multiple imputation (MI) and inverse-probability weighting (IPW) are go-to analytical methods to handle missing… ▽ More In pharmacoepidemiology, safety and effectiveness are frequently evaluated using readily available administrative and electronic health records data. In these settings, detailed confounder data are often not available in all data sources and therefore missing on a subset of individuals. Multiple imputation (MI) and inverse-probability weighting (IPW) are go-to analytical methods to handle missing data and are dominant in the biomedical literature. Doubly-robust methods, which are consistent under fewer assumptions, can be more efficient with respect to mean-squared error. We discuss two practical-to-implement doubly-robust estimators, generalized raking and inverse probability-weighted targeted maximum likelihood estimation (TMLE), which are both currently under-utilized in biomedical studies. We compare their performance to IPW and MI in a detailed numerical study for a variety of synthetic data-generating and missingness scenarios, including scenarios with rare outcomes and a high missingness proportion. Further, we consider plasmode simulation studies that emulate the complex data structure of a large electronic health records cohort in order to compare anti-depressant therapies in a rare-outcome setting where a key confounder is prone to more than 50\% missingness. We provide guidance on selecting a missing data analysis approach, based on which methods excelled with respect to the bias-variance trade-off across the different scenarios studied. △ Less

Submitted 19 December, 2024; originally announced December 2024.

Comments: 142 pages (27 main, 115 supplemental); 6 figures, 2 tables

arXiv:2408.09060 [pdf]

[Invited Discussion] Randomization Tests to Address Disruptions in Clinical Trials: A Report from the NISS Ingram Olkin Forum Series on Unplanned Clinical Trial Disruptions

Authors: Rachael V. Phillips, Mark J. van der Laan

Abstract: Disruptions in clinical trials may be due to external events like pandemics, warfare, and natural disasters. Resulting complications may lead to unforeseen intercurrent events (events that occur after treatment initiation and affect the interpretation of the clinical question of interest or the existence of the measurements associated with it). In Uschner et al. (2023), several example clinical tr… ▽ More Disruptions in clinical trials may be due to external events like pandemics, warfare, and natural disasters. Resulting complications may lead to unforeseen intercurrent events (events that occur after treatment initiation and affect the interpretation of the clinical question of interest or the existence of the measurements associated with it). In Uschner et al. (2023), several example clinical trial disruptions are described: treatment effect drift, population shift, change of care, change of data collection, and change of availability of study medication. A complex randomized controlled trial (RCT) setting with (planned or unplanned) intercurrent events is then described, and randomization tests are presented as a means for non-parametric inference that is robust to violations of assumption typically made in clinical trials. While estimation methods like Targeted Learning (TL) are valid in such settings, we do not see where the authors make the case that one should be going for a randomization test in such disrupted RCTs. In this discussion, we comment on the appropriateness of TL and the accompanying TL Roadmap in the context of disrupted clinical trials. We highlight a few key articles related to the broad applicability of TL for RCTs and real-world data (RWD) analyses with intercurrent events. We begin by introducing TL and motivating its utility in Section 2, and then in Section 3 we provide a brief overview of the TL Roadmap. In Section 4 we recite the example clinical trial disruptions presented in Uschner et al. (2023), discussing considerations and solutions based on the principles of TL. We request in an authors' rejoinder a clear theoretical demonstration with specific examples in this setting that a randomization test is the only valid inferential method relative to one based on following the TL Roadmap. △ Less

Submitted 16 August, 2024; originally announced August 2024.

Comments: This article is an un-refereed, Authors Original Version

arXiv:2404.09847 [pdf, other]

Statistical learning for constrained functional parameters in infinite-dimensional models with applications in fair machine learning

Authors: Razieh Nabi, Nima S. Hejazi, Mark J. van der Laan, David Benkeser

Abstract: Constrained learning has become increasingly important, especially in the realm of algorithmic fairness and machine learning. In these settings, predictive models are developed specifically to satisfy pre-defined notions of fairness. Here, we study the general problem of constrained statistical machine learning through a statistical functional lens. We consider learning a function-valued parameter… ▽ More Constrained learning has become increasingly important, especially in the realm of algorithmic fairness and machine learning. In these settings, predictive models are developed specifically to satisfy pre-defined notions of fairness. Here, we study the general problem of constrained statistical machine learning through a statistical functional lens. We consider learning a function-valued parameter of interest under the constraint that one or several pre-specified real-valued functional parameters equal zero or are otherwise bounded. We characterize the constrained functional parameter as the minimizer of a penalized risk criterion using a Lagrange multiplier formulation. We show that closed-form solutions for the optimal constrained parameter are often available, providing insight into mechanisms that drive fairness in predictive models. Our results also suggest natural estimators of the constrained parameter that can be constructed by combining estimates of unconstrained parameters of the data generating distribution. Thus, our estimation procedure for constructing fair machine learning algorithms can be applied in conjunction with any statistical learning approach and off-the-shelf software. We demonstrate the generality of our method by explicitly considering a number of examples of statistical fairness constraints and implementing the approach using several popular learning approaches. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2404.01736 [pdf, other]

Nonparametric efficient causal estimation of the intervention-specific expected number of recurrent events with continuous-time targeted maximum likelihood and highly adaptive lasso estimation

Authors: Helene C. W. Rytgaard, Mark J. van der Laan

Abstract: Longitudinal settings involving outcome, competing risks and censoring events occurring and recurring in continuous time are common in medical research, but are often analyzed with methods that do not allow for taking post-baseline information into account. In this work, we define statistical and causal target parameters via the g-computation formula by carrying out interventions directly on the p… ▽ More Longitudinal settings involving outcome, competing risks and censoring events occurring and recurring in continuous time are common in medical research, but are often analyzed with methods that do not allow for taking post-baseline information into account. In this work, we define statistical and causal target parameters via the g-computation formula by carrying out interventions directly on the product integral representing the observed data distribution in a continuous-time counting process model framework. In recurrent events settings our target parameter identifies the expected number of recurrent events also in settings where the censoring mechanism or post-baseline treatment decisions depend on past information of post-baseline covariates such as the recurrent event process. We propose a flexible estimation procedure based on targeted maximum likelihood estimation coupled with highly adaptive lasso estimation to provide a novel approach for double robust and nonparametric inference for the considered target parameter. We illustrate the methods in a simulation study. △ Less

Submitted 11 April, 2025; v1 submitted 2 April, 2024; originally announced April 2024.

arXiv:2310.19197 [pdf, other]

concrete: Targeted Estimation of Survival and Competing Risks in Continuous Time

Authors: David Chen, Helene C. W. Rytgaard, Edwin C. H. Fong, Jens M. Tarp, Maya L. Petersen, Mark J. van der Laan, Thomas A. Gerds

Abstract: This article introduces the R package concrete, which implements a recently developed targeted maximum likelihood estimator (TMLE) for the cause-specific absolute risks of time-to-event outcomes measured in continuous time. Cross-validated Super Learner machine learning ensembles are used to estimate propensity scores and conditional cause-specific hazards, which are then targeted to produce robus… ▽ More This article introduces the R package concrete, which implements a recently developed targeted maximum likelihood estimator (TMLE) for the cause-specific absolute risks of time-to-event outcomes measured in continuous time. Cross-validated Super Learner machine learning ensembles are used to estimate propensity scores and conditional cause-specific hazards, which are then targeted to produce robust and efficient plug-in estimates of the effects of static or dynamic interventions on a binary treatment given at baseline quantified as risk differences or risk ratios. Influence curve-based asymptotic inference is provided for TMLE estimates and simultaneous confidence bands can be computed for target estimands spanning multiple multiple times or events. In this paper we review the one-step continuous-time TMLE methodology as it is situated in an overarching causal inference workflow, describe its implementation, and demonstrate the use of the package on the PBC dataset. △ Less

Submitted 20 March, 2025; v1 submitted 29 October, 2023; originally announced October 2023.

Comments: 18 pages, 4 figures, submitted to the R Journal

arXiv:2309.16099 [pdf, other]

Nonparametric estimation of a covariate-adjusted counterfactual treatment regimen response curve

Authors: Ashkan Ertefaie, Luke Duttweiler, Brent A. Johnson, Mark J. van der Laan

Abstract: Flexible estimation of the mean outcome under a treatment regimen (i.e., value function) is the key step toward personalized medicine. We define our target parameter as a conditional value function given a set of baseline covariates which we refer to as a stratum based value function. We focus on semiparametric class of decision rules and propose a sieve based nonparametric covariate adjusted regi… ▽ More Flexible estimation of the mean outcome under a treatment regimen (i.e., value function) is the key step toward personalized medicine. We define our target parameter as a conditional value function given a set of baseline covariates which we refer to as a stratum based value function. We focus on semiparametric class of decision rules and propose a sieve based nonparametric covariate adjusted regimen-response curve estimator within that class. Our work contributes in several ways. First, we propose an inverse probability weighted nonparametrically efficient estimator of the smoothed regimen-response curve function. We show that asymptotic linearity is achieved when the nuisance functions are undersmoothed sufficiently. Asymptotic and finite sample criteria for undersmoothing are proposed. Second, using Gaussian process theory, we propose simultaneous confidence intervals for the smoothed regimen-response curve function. Third, we provide consistency and convergence rate for the optimizer of the regimen-response curve estimator; this enables us to estimate an optimal semiparametric rule. The latter is important as the optimizer corresponds with the optimal dynamic treatment regimen. Some finite-sample properties are explored with simulations. △ Less

Submitted 27 September, 2023; originally announced September 2023.

arXiv:2306.07736 [pdf, other]

An Approach to Nonparametric Inference on the Causal Dose Response Function

Authors: Aaron Hudson, Elvin H. Geng, Thomas A. Odeny, Elizabeth A. Bukusi, Maya L. Petersen, Mark J. van der Laan

Abstract: The causal dose response curve is commonly selected as the statistical parameter of interest in studies where the goal is to understand the effect of a continuous exposure on an outcome.Most of the available methodology for statistical inference on the dose-response function in the continuous exposure setting requires strong parametric assumptions on the probability distribution. Such parametric a… ▽ More The causal dose response curve is commonly selected as the statistical parameter of interest in studies where the goal is to understand the effect of a continuous exposure on an outcome.Most of the available methodology for statistical inference on the dose-response function in the continuous exposure setting requires strong parametric assumptions on the probability distribution. Such parametric assumptions are typically untenable in practice and lead to invalid inference. It is often preferable to instead use nonparametric methods for inference, which only make mild assumptions about the data-generating mechanism. We propose a nonparametric test of the null hypothesis that the dose-response function is equal to a constant function. We argue that when the null hypothesis holds, the dose-response function has zero variance. Thus, one can test the null hypothesis by assessing whether there is sufficient evidence to claim that the variance is positive. We construct a novel estimator for the variance of the dose-response function, for which we can fully characterize the null limiting distribution and thus perform well-calibrated tests of the null hypothesis. We also present an approach for constructing simultaneous confidence bands for the dose-response function by inverting our proposed hypothesis test. We assess the validity of our proposal in a simulation study. In a data example, we study, in a population of patients who have initiated treatment for HIV, how the distance required to travel to an HIV clinic affects retention in care. △ Less

Submitted 13 June, 2023; originally announced June 2023.

Comments: 39 pages, 5 figures

arXiv:2305.01849 [pdf, other]

Semiparametric Discovery and Estimation of Interaction in Mixed Exposures using Stochastic Interventions

Authors: David B. McCoy, Alan E. Hubbard, Alejandro Schuler, Mark J. van der Laan

Abstract: This study introduces a nonparametric definition of interaction and provides an approach to both interaction discovery and efficient estimation of this parameter. Using stochastic shift interventions and ensemble machine learning, our approach identifies and quantifies interaction effects through a model-independent target parameter, estimated via targeted maximum likelihood and cross-validation.… ▽ More This study introduces a nonparametric definition of interaction and provides an approach to both interaction discovery and efficient estimation of this parameter. Using stochastic shift interventions and ensemble machine learning, our approach identifies and quantifies interaction effects through a model-independent target parameter, estimated via targeted maximum likelihood and cross-validation. This method contrasts the expected outcomes of joint interventions with those of individual interventions. Validation through simulation and application to the National Institute of Environmental Health Sciences Mixtures Workshop data demonstrate the efficacy of our method in detecting true interaction directions and its consistency in identifying significant impacts of furan exposure on leukocyte telomere length. Our method, called InterXshift, advances the ability to analyze multi-exposure interactions within high-dimensional data, offering significant methodological improvements to understand complex exposure dynamics in health research. We provide peer-reviewed open-source software that employs or proposed methodology in the InterXshift R package. △ Less

Submitted 28 June, 2024; v1 submitted 2 May, 2023; originally announced May 2023.

arXiv:2301.12029 [pdf, other]

Multi-task Highly Adaptive Lasso

Authors: Ivana Malenica, Rachael V. Phillips, Daniel Lazzareschi, Jeremy R. Coyle, Romain Pirracchio, Mark J. van der Laan

Abstract: We propose a novel, fully nonparametric approach for the multi-task learning, the Multi-task Highly Adaptive Lasso (MT-HAL). MT-HAL simultaneously learns features, samples and task associations important for the common model, while imposing a shared sparse structure among similar tasks. Given multiple tasks, our approach automatically finds a sparse sharing structure. The proposed MTL algorithm at… ▽ More We propose a novel, fully nonparametric approach for the multi-task learning, the Multi-task Highly Adaptive Lasso (MT-HAL). MT-HAL simultaneously learns features, samples and task associations important for the common model, while imposing a shared sparse structure among similar tasks. Given multiple tasks, our approach automatically finds a sparse sharing structure. The proposed MTL algorithm attains a powerful dimension-free convergence rate of $o_p(n^{-1/4})$ or better. We show that MT-HAL outperforms sparsity-based MTL competitors across a wide range of simulation studies, including settings with nonlinear and linear relationships, varying levels of sparsity and task correlations, and different numbers of covariates and sample size. △ Less

Submitted 27 January, 2023; originally announced January 2023.

arXiv:2212.02422 [pdf, other]

Adaptive Sequential Surveillance with Network and Temporal Dependence

Authors: Ivana Malenica, Jeremy R. Coyle, Mark J. van der Laan, Maya L. Petersen

Abstract: Strategic test allocation plays a major role in the control of both emerging and existing pandemics (e.g., COVID-19, HIV). Widespread testing supports effective epidemic control by (1) reducing transmission via identifying cases, and (2) tracking outbreak dynamics to inform targeted interventions. However, infectious disease surveillance presents unique statistical challenges. For instance, the tr… ▽ More Strategic test allocation plays a major role in the control of both emerging and existing pandemics (e.g., COVID-19, HIV). Widespread testing supports effective epidemic control by (1) reducing transmission via identifying cases, and (2) tracking outbreak dynamics to inform targeted interventions. However, infectious disease surveillance presents unique statistical challenges. For instance, the true outcome of interest - one's positive infectious status, is often a latent variable. In addition, presence of both network and temporal dependence reduces the data to a single observation. As testing entire populations regularly is neither efficient nor feasible, standard approaches to testing recommend simple rule-based testing strategies (e.g., symptom based, contact tracing), without taking into account individual risk. In this work, we study an adaptive sequential design involving n individuals over a period of τ time-steps, which allows for unspecified dependence among individuals and across time. Our causal target parameter is the mean latent outcome we would have obtained after one time-step, if, starting at time t given the observed past, we had carried out a stochastic intervention that maximizes the outcome under a resource constraint. We propose an Online Super Learner for adaptive sequential surveillance that learns the optimal choice of tests strategies over time while adapting to the current state of the outbreak. Relying on a series of working models, the proposed method learns across samples, through time, or both: based on the underlying (unknown) structure in the data. We present an identification result for the latent outcome in terms of the observed data, and demonstrate the superior performance of the proposed strategy in a simulation modeling a residential university environment during the COVID-19 pandemic. △ Less

Submitted 5 December, 2022; originally announced December 2022.

arXiv:2211.14671 [pdf, other]

doi 10.1111/biom.13800

Efficient Targeted Learning of Heterogeneous Treatment Effects for Multiple Subgroups

Authors: Waverly Wei, Maya Petersen, Mark J van der Laan, Zeyu Zheng, Chong Wu, Jingshen Wang

Abstract: In biomedical science, analyzing treatment effect heterogeneity plays an essential role in assisting personalized medicine. The main goals of analyzing treatment effect heterogeneity include estimating treatment effects in clinically relevant subgroups and predicting whether a patient subpopulation might benefit from a particular treatment. Conventional approaches often evaluate the subgroup treat… ▽ More In biomedical science, analyzing treatment effect heterogeneity plays an essential role in assisting personalized medicine. The main goals of analyzing treatment effect heterogeneity include estimating treatment effects in clinically relevant subgroups and predicting whether a patient subpopulation might benefit from a particular treatment. Conventional approaches often evaluate the subgroup treatment effects via parametric modeling and can thus be susceptible to model mis-specifications. In this manuscript, we take a model-free semiparametric perspective and aim to efficiently evaluate the heterogeneous treatment effects of multiple subgroups simultaneously under the one-step targeted maximum-likelihood estimation (TMLE) framework. When the number of subgroups is large, we further expand this path of research by looking at a variation of the one-step TMLE that is robust to the presence of small estimated propensity scores in finite samples. From our simulations, our method demonstrates substantial finite sample improvements compared to conventional methods. In a case study, our method unveils the potential treatment effect heterogeneity of rs12916-T allele (a proxy for statin usage) in decreasing Alzheimer's disease risk. △ Less

Submitted 26 November, 2022; originally announced November 2022.

Comments: Accepted by Biometrics 2022

arXiv:2208.08065 [pdf, ps, other]

doi 10.1353/obs.2023.0001

Revisiting the propensity score's central role: Towards bridging balance and efficiency in the era of causal machine learning

Authors: Nima S. Hejazi, Mark J. van der Laan

Abstract: About forty years ago, in a now--seminal contribution, Rosenbaum & Rubin (1983) introduced a critical characterization of the propensity score as a central quantity for drawing causal inferences in observational study settings. In the decades since, much progress has been made across several research fronts in causal inference, notably including the re-weighting and matching paradigms. Focusing on… ▽ More About forty years ago, in a now--seminal contribution, Rosenbaum & Rubin (1983) introduced a critical characterization of the propensity score as a central quantity for drawing causal inferences in observational study settings. In the decades since, much progress has been made across several research fronts in causal inference, notably including the re-weighting and matching paradigms. Focusing on the former and specifically on its intersection with machine learning and semiparametric efficiency theory, we re-examine the role of the propensity score in modern methodological developments. As Rosenbaum & Rubin (1983)'s contribution spurred a focus on the balancing property of the propensity score, we re-examine the degree to which and how this property plays a role in the development of asymptotically efficient estimators of causal effects; moreover, we discuss a connection between the balancing property and efficient estimation in the form of score equations and propose a score test for evaluating whether an estimator achieves balance. △ Less

Submitted 30 September, 2022; v1 submitted 17 August, 2022; originally announced August 2022.

Comments: Accepted for publication in a forthcoming special issue of Observational Studies

Journal ref: Observational Studies, 2023

arXiv:2205.08643 [pdf]

Targeted learning: Towards a future informed by real-world evidence

Authors: Susan Gruber, Rachael V. Phillips, Hana Lee, Martin Ho, John Concato, Mark J. van der Laan

Abstract: The 21st Century Cures Act of 2016 includes a provision for the U.S. Food and Drug Administration (FDA) to evaluate the potential use of real-world evidence (RWE) to support new indications for use for previously approved drugs, and to satisfy post-approval study requirements. Extracting reliable evidence from real-world data (RWD) is often complicated by a lack of treatment randomization, potenti… ▽ More The 21st Century Cures Act of 2016 includes a provision for the U.S. Food and Drug Administration (FDA) to evaluate the potential use of real-world evidence (RWE) to support new indications for use for previously approved drugs, and to satisfy post-approval study requirements. Extracting reliable evidence from real-world data (RWD) is often complicated by a lack of treatment randomization, potential intercurrent events, and informative loss to follow up. Targeted Learning (TL) is a sub-field of statistics that provides a rigorous framework to help address these challenges. The TL Roadmap offers a step-by-step guide to generating valid evidence and assessing its reliability. Following these steps produces an extensive amount of information for assessing whether the study provides reliable scientific evidence in support regulatory decision making. This paper presents two case studies that illustrate the utility of following the roadmap. We use targeted minimum loss-based estimation combined with super learning to estimate causal effects. We also compared these findings with those obtained from an unadjusted analysis, propensity score matching, and inverse probability weighting. Non-parametric sensitivity analyses illuminate how departures from (untestable) causal assumptions would affect point estimates and confidence interval bounds that would impact the substantive conclusion drawn from the study. TL's thorough approach to learning from data provides transparency, allowing trust in RWE to be earned whenever it is warranted. △ Less

Submitted 13 June, 2022; v1 submitted 17 May, 2022; originally announced May 2022.

Comments: 34 pages (25 pages main paper + references, 9 page Appendix), 6 figures version 2 corrected minor typos, numbering errors, etc

arXiv:2205.05777 [pdf, other]

Efficient estimation of modified treatment policy effects based on the generalized propensity score

Authors: Nima S. Hejazi, David Benkeser, Iván Díaz, Mark J. van der Laan

Abstract: Continuous treatments have posed a significant challenge for causal inference, both in the formulation and identification of scientifically meaningful effects and in their robust estimation. Traditionally, focus has been placed on techniques applicable to binary or categorical treatments with few levels, allowing for the application of propensity score-based methodology with relative ease. Efforts… ▽ More Continuous treatments have posed a significant challenge for causal inference, both in the formulation and identification of scientifically meaningful effects and in their robust estimation. Traditionally, focus has been placed on techniques applicable to binary or categorical treatments with few levels, allowing for the application of propensity score-based methodology with relative ease. Efforts to accommodate continuous treatments introduced the generalized propensity score, yet estimators of this nuisance parameter commonly utilize parametric regression strategies that sharply limit the robustness and efficiency of inverse probability weighted estimators of causal effect parameters. We formulate and investigate a novel, flexible estimator of the generalized propensity score based on a nonparametric function estimator that provably converges at a suitably fast rate to the target functional so as to facilitate statistical inference. With this estimator, we demonstrate the construction of nonparametric inverse probability weighted estimators of a class of causal effect estimands tailored to continuous treatments. To ensure the asymptotic efficiency of our proposed estimators, we outline several non-restrictive selection procedures for utilizing a sieve estimation framework to undersmooth estimators of the generalized propensity score. We provide the first characterization of such inverse probability weighted estimators achieving the nonparametric efficiency bound in a setting with continuous treatments, demonstrating this in numerical experiments. We further evaluate the higher-order efficiency of our proposed estimators by deriving and numerically examining the second-order remainder of the corresponding efficient influence function in the nonparametric model. Open source software implementing our proposed estimation techniques, the haldensify R package, is briefly discussed. △ Less

Submitted 28 June, 2022; v1 submitted 11 May, 2022; originally announced May 2022.

arXiv:2205.01285 [pdf, other]

doi 10.1093/biostatistics/kxac029

A Flexible Approach for Predictive Biomarker Discovery

Authors: Philippe Boileau, Nina Ting Qi, Mark J. van der Laan, Sandrine Dudoit, Ning Leng

Abstract: An endeavor central to precision medicine is predictive biomarker discovery; they define patient subpopulations which stand to benefit most, or least, from a given treatment. The identification of these biomarkers is often the byproduct of the related but fundamentally different task of treatment rule estimation. Using treatment rule estimation methods to identify predictive biomarkers in clinical… ▽ More An endeavor central to precision medicine is predictive biomarker discovery; they define patient subpopulations which stand to benefit most, or least, from a given treatment. The identification of these biomarkers is often the byproduct of the related but fundamentally different task of treatment rule estimation. Using treatment rule estimation methods to identify predictive biomarkers in clinical trials where the number of covariates exceeds the number of participants often results in high false discovery rates. The higher than expected number of false positives translates to wasted resources when conducting follow-up experiments for drug target identification and diagnostic assay development. Patient outcomes are in turn negatively affected. We propose a variable importance parameter for directly assessing the importance of potentially predictive biomarkers, and develop a flexible nonparametric inference procedure for this estimand. We prove that our estimator is double-robust and asymptotically linear under loose conditions on the data-generating process, permitting valid inference about the importance metric. The statistical guarantees of the method are verified in a thorough simulation study representative of randomized control trials with moderate and high-dimensional covariate vectors. Our procedure is then used to discover predictive biomarkers from among the tumor gene expression data of metastatic renal cell carcinoma patients enrolled in recently completed clinical trials. We find that our approach more readily discerns predictive from non-predictive biomarkers than procedures whose primary purpose is treatment rule estimation. An open-source software implementation of the methodology, the uniCATE R package, is briefly introduced. △ Less

Submitted 1 June, 2022; v1 submitted 2 May, 2022; originally announced May 2022.

arXiv:2204.06139 [pdf]

doi 10.1093/ije/dyad023

Practical considerations for specifying a super learner

Authors: Rachael V. Phillips, Mark J. van der Laan, Hana Lee, Susan Gruber

Abstract: Common tasks encountered in epidemiology, including disease incidence estimation and causal inference, rely on predictive modeling. Constructing a predictive model can be thought of as learning a prediction function, i.e., a function that takes as input covariate data and outputs a predicted value. Many strategies for learning these functions from data are available, from parametric regressions to… ▽ More Common tasks encountered in epidemiology, including disease incidence estimation and causal inference, rely on predictive modeling. Constructing a predictive model can be thought of as learning a prediction function, i.e., a function that takes as input covariate data and outputs a predicted value. Many strategies for learning these functions from data are available, from parametric regressions to machine learning algorithms. It can be challenging to choose an approach, as it is impossible to know in advance which one is the most suitable for a particular dataset and prediction task at hand. The super learner (SL) is an algorithm that alleviates concerns over selecting the one "right" strategy while providing the freedom to consider many of them, such as those recommended by collaborators, used in related research, or specified by subject-matter experts. It is an entirely pre-specified and data-adaptive strategy for predictive modeling. To ensure the SL is well-specified for learning the prediction function, the analyst does need to make a few important choices. In this Education Corner article, we provide step-by-step guidelines for making these choices, walking the reader through each of them and providing intuition along the way. In doing so, we aim to empower the analyst to tailor the SL specification to their prediction task, thereby ensuring their SL performs as well as possible. A flowchart provides a concise, easy-to-follow summary of key suggestions and heuristics, based on our accumulated experience, and guided by theory. △ Less

Submitted 14 March, 2023; v1 submitted 12 April, 2022; originally announced April 2022.

Comments: A revised version of this article, which incorporates several modifications based on referees' suggestions, has been published in the International Journal of Epidemiology by Oxford University Press

Journal ref: International Journal of Epidemiology, Volume 52, Issue 4, August 2023, Pages 1276-1285

arXiv:2110.12112 [pdf, ps, other]

Why Machine Learning Cannot Ignore Maximum Likelihood Estimation

Authors: Mark J. van der Laan, Sherri Rose

Abstract: The growth of machine learning as a field has been accelerating with increasing interest and publications across fields, including statistics, but predominantly in computer science. How can we parse this vast literature for developments that exemplify the necessary rigor? How many of these manuscripts incorporate foundational theory to allow for statistical inference? Which advances have the great… ▽ More The growth of machine learning as a field has been accelerating with increasing interest and publications across fields, including statistics, but predominantly in computer science. How can we parse this vast literature for developments that exemplify the necessary rigor? How many of these manuscripts incorporate foundational theory to allow for statistical inference? Which advances have the greatest potential for impact in practice? One could posit many answers to these queries. Here, we assert that one essential idea is for machine learning to integrate maximum likelihood for estimation of functional parameters, such as prediction functions and conditional densities. △ Less

Submitted 22 October, 2021; originally announced October 2021.

Comments: 30 pages. Forthcoming as a chapter in the Handbook of Matching and Weighting in Causal Inference

arXiv:2110.09633 [pdf, other]

Defining and Estimating Effects in Cluster Randomized Trials: A Methods Comparison

Authors: Alejandra Benitez, Maya L. Petersen, Mark J. van der Laan, Nicole Santos, Elizabeth Butrick, Dilys Walker, Rakesh Ghosh, Phelgona Otieno, Peter Waiswa, Laura B. Balzer

Abstract: Across research disciplines, cluster randomized trials (CRTs) are commonly implemented to evaluate interventions delivered to groups of participants, such as communities and clinics. Despite advances in the design and analysis of CRTs, several challenges remain. First, there are many possible ways to specify the causal effect of interest (e.g., at the individual-level or at the cluster-level). Sec… ▽ More Across research disciplines, cluster randomized trials (CRTs) are commonly implemented to evaluate interventions delivered to groups of participants, such as communities and clinics. Despite advances in the design and analysis of CRTs, several challenges remain. First, there are many possible ways to specify the causal effect of interest (e.g., at the individual-level or at the cluster-level). Second, the theoretical and practical performance of common methods for CRT analysis remain poorly understood. Here, we present a general framework to formally define an array of causal effects in terms of summary measures of counterfactual outcomes. Next, we provide a comprehensive overview of CRT estimators, including the t-test, generalized estimating equations (GEE), augmented-GEE, and targeted maximum likelihood estimation (TMLE). Using finite sample simulations, we illustrate the practical performance of these estimators for different causal effects and when, as commonly occurs, there are limited numbers of clusters of different sizes. Finally, our application to data from the Preterm Birth Initiative (PTBi) study demonstrates the real-world impact of varying cluster sizes and targeting effects at the cluster-level or at the individual-level. Specifically, the relative effect of the PTBI intervention was 0.81 at the cluster-level, corresponding to a 19% reduction in outcome incidence, and was 0.66 at the individual-level, corresponding to a 34% reduction in outcome risk. Given its flexibility to estimate a variety of user-specified effects and ability to adaptively adjust for covariates for precision gains while maintaining Type-I error control, we conclude TMLE is a promising tool for CRT analysis. △ Less

Submitted 3 May, 2023; v1 submitted 18 October, 2021; originally announced October 2021.

arXiv:2109.14048 [pdf, other]

Evaluating the Robustness of Targeted Maximum Likelihood Estimators via Realistic Simulations in Nutrition Intervention Trials

Authors: Haodong Li, Sonali Rosete, Jeremy Coyle, Rachael V. Phillips, Nima S. Hejazi, Ivana Malenica, Benjamin F. Arnold, Jade Benjamin-Chung, Andrew Mertens, John M. Colford Jr, Mark J. van der Laan, Alan E. Hubbard

Abstract: Several recently developed methods have the potential to harness machine learning in the pursuit of target quantities inspired by causal inference, including inverse weighting, doubly robust estimating equations and substitution estimators like targeted maximum likelihood estimation. There are even more recent augmentations of these procedures that can increase robustness, by adding a layer of cro… ▽ More Several recently developed methods have the potential to harness machine learning in the pursuit of target quantities inspired by causal inference, including inverse weighting, doubly robust estimating equations and substitution estimators like targeted maximum likelihood estimation. There are even more recent augmentations of these procedures that can increase robustness, by adding a layer of cross-validation (cross-validated targeted maximum likelihood estimation and double machine learning, as applied to substitution and estimating equation approaches, respectively). While these methods have been evaluated individually on simulated and experimental data sets, a comprehensive analysis of their performance across ``real-world'' simulations have yet to be conducted. In this work, we benchmark multiple widely used methods for estimation of the average treatment effect using ten different nutrition intervention studies data. A realistic set of simulations, based on a novel method, highly adaptive lasso, for estimating the data-generating distribution that guarantees a certain level of complexity (undersmoothing) is used to better mimic the complexity of the true data-generating distribution. We have applied this novel method for estimating the data-generating distribution by individual study and to subsequently use these fits to simulate data and estimate treatment effects parameters as well as their standard errors and resulting confidence intervals. Based on the analytic results, a general recommendation is put forth for use of the cross-validated variants of both substitution and estimating equation estimators. We conclude that the additional layer of cross-validation helps in avoiding unintentional over-fitting of nuisance parameter functionals and leads to more robust inferences. △ Less

Submitted 28 September, 2021; originally announced September 2021.

arXiv:2109.10452 [pdf, other]

Personalized Online Machine Learning

Authors: Ivana Malenica, Rachael V. Phillips, Romain Pirracchio, Antoine Chambaz, Alan Hubbard, Mark J. van der Laan

Abstract: In this work, we introduce the Personalized Online Super Learner (POSL) -- an online ensembling algorithm for streaming data whose optimization procedure accommodates varying degrees of personalization. Namely, POSL optimizes predictions with respect to baseline covariates, so personalization can vary from completely individualized (i.e., optimization with respect to baseline covariate subject ID)… ▽ More In this work, we introduce the Personalized Online Super Learner (POSL) -- an online ensembling algorithm for streaming data whose optimization procedure accommodates varying degrees of personalization. Namely, POSL optimizes predictions with respect to baseline covariates, so personalization can vary from completely individualized (i.e., optimization with respect to baseline covariate subject ID) to many individuals (i.e., optimization with respect to common baseline covariates). As an online algorithm, POSL learns in real-time. POSL can leverage a diversity of candidate algorithms, including online algorithms with different training and update times, fixed algorithms that are never updated during the procedure, pooled algorithms that learn from many individuals' time-series, and individualized algorithms that learn from within a single time-series. POSL's ensembling of this hybrid of base learning strategies depends on the amount of data collected, the stationarity of the time-series, and the mutual characteristics of a group of time-series. In essence, POSL decides whether to learn across samples, through time, or both, based on the underlying (unknown) structure in the data. For a wide range of simulations that reflect realistic forecasting scenarios, and in a medical data application, we examine the performance of POSL relative to other current ensembling and online learning methods. We show that POSL is able to provide reliable predictions for time-series data and adjust to changing data-generating environments. We further cultivate POSL's practicality by extending it to settings where time-series enter/exit dynamically over chronological time. △ Less

Submitted 21 September, 2021; originally announced September 2021.

arXiv:2107.01537 [pdf, other]

One-step TMLE for targeting cause-specific absolute risks and survival curves

Authors: Helene C. W. Rytgaard, Mark J. van der Laan

Abstract: This paper considers one-step targeted maximum likelihood estimation method for general competing risks and survival analysis settings where event times take place on the positive real line R+ and are subject to right-censoring. Our interest is overall in the effects of baseline treatment decisions, static, dynamic or stochastic, possibly confounded by pre-treatment covariates. We point out two ov… ▽ More This paper considers one-step targeted maximum likelihood estimation method for general competing risks and survival analysis settings where event times take place on the positive real line R+ and are subject to right-censoring. Our interest is overall in the effects of baseline treatment decisions, static, dynamic or stochastic, possibly confounded by pre-treatment covariates. We point out two overall contributions of our work. First, our method can be used to obtain simultaneous inference across all absolute risks in competing risks settings. Second, we present a practical result for achieving inference for the full survival curve, or a full absolute risk curve, across time by targeting over a fine enough grid of points. The one-step procedure is based on a one-dimensional universal least favorable submodel for each cause-specific hazard that can be implemented in recursive steps along a corresponding universal least favorable submodel. We present a theorem for conditions to achieve weak convergence of the estimator for an infinite-dimensional target parameter. Our empirical study demonstrates the use of the methods. △ Less

Submitted 1 September, 2021; v1 submitted 4 July, 2021; originally announced July 2021.

Comments: 21 pages (including appendix), 1 figure, 5 tables

arXiv:2105.02088 [pdf, other]

Continuous-time targeted minimum loss-based estimation of intervention-specific mean outcomes

Authors: Helene C. Rytgaard, Thomas A. Gerds, Mark J. van der Laan

Abstract: This paper studies the generalization of the targeted minimum loss-based estimation (TMLE) framework to estimation of effects of time-varying interventions in settings where both interventions, covariates, and outcome can happen at subject-specific time-points on an arbitrarily fine time-scale. TMLE is a general template for constructing asymptotically linear substitution estimators for smooth low… ▽ More This paper studies the generalization of the targeted minimum loss-based estimation (TMLE) framework to estimation of effects of time-varying interventions in settings where both interventions, covariates, and outcome can happen at subject-specific time-points on an arbitrarily fine time-scale. TMLE is a general template for constructing asymptotically linear substitution estimators for smooth low-dimensional parameters in infinite-dimensional models. Existing longitudinal TMLE methods are developed for data where observations are made on a discrete time-grid. We consider a continuous-time counting process model where intensity measures track the monitoring of subjects, and focus on a low-dimensional target parameter defined as the intervention-specific mean outcome at the end of follow-up. To construct our TMLE algorithm for the given statistical estimation problem we derive an expression for the efficient influence curve and represent the target parameter as a functional of intensities and conditional expectations. The high-dimensional nuisance parameters of our model are estimated and updated in an iterative manner according to separate targeting steps for the involved intensities and conditional expectations. The resulting estimator solves the efficient influence curve equation. We state a general efficiency theorem and describe a highly adaptive lasso estimator for nuisance parameters that allows us to establish asymptotic linearity and efficiency of our estimator under minimal conditions on the underlying statistical model. △ Less

Submitted 5 May, 2021; originally announced May 2021.

Comments: 27 pages (excluding supplementary material), 1 figures

arXiv:2102.09715 [pdf, other]

doi 10.1080/10618600.2022.2110883

Cross-Validated Loss-Based Covariance Matrix Estimator Selection in High Dimensions

Authors: Philippe Boileau, Nima S. Hejazi, Mark J. van der Laan, Sandrine Dudoit

Abstract: The covariance matrix plays a fundamental role in many modern exploratory and inferential statistical procedures, including dimensionality reduction, hypothesis testing, and regression. In low-dimensional regimes, where the number of observations far exceeds the number of variables, the optimality of the sample covariance matrix as an estimator of this parameter is well-established. High-dimension… ▽ More The covariance matrix plays a fundamental role in many modern exploratory and inferential statistical procedures, including dimensionality reduction, hypothesis testing, and regression. In low-dimensional regimes, where the number of observations far exceeds the number of variables, the optimality of the sample covariance matrix as an estimator of this parameter is well-established. High-dimensional regimes do not admit such a convenience, however. As such, a variety of estimators have been derived to overcome the shortcomings of the sample covariance matrix in these settings. Yet, the question of selecting an optimal estimator from among the plethora available remains largely unaddressed. Using the framework of cross-validated loss-based estimation, we develop the theoretical underpinnings of just such an estimator selection procedure. In particular, we propose a general class of loss functions for covariance matrix estimation and establish finite-sample risk bounds and conditions for the asymptotic optimality of the cross-validated estimator selector with respect to these loss functions. We evaluate our proposed approach via a comprehensive set of simulation experiments and demonstrate its practical benefits by application in the exploratory analysis of two single-cell transcriptome sequencing datasets. A free and open-source software implementation of the proposed methodology, the cvCovEst R package, is briefly introduced. △ Less

Submitted 6 May, 2022; v1 submitted 18 February, 2021; originally announced February 2021.

Comments: 32 pages, 8 figures; updated contents of section 3, fixed typos

arXiv:2102.00102 [pdf, other]

Adaptive Sequential Design for a Single Time-Series

Authors: Ivana Malenica, Aurelien Bibaut, Mark J. van der Laan

Abstract: The current work is motivated by the need for robust statistical methods for precision medicine; as such, we address the need for statistical methods that provide actionable inference for a single unit at any point in time. We aim to learn an optimal, unknown choice of the controlled components of the design in order to optimize the expected outcome; with that, we adapt the randomization mechanism… ▽ More The current work is motivated by the need for robust statistical methods for precision medicine; as such, we address the need for statistical methods that provide actionable inference for a single unit at any point in time. We aim to learn an optimal, unknown choice of the controlled components of the design in order to optimize the expected outcome; with that, we adapt the randomization mechanism for future time-point experiments based on the data collected on the individual over time. Our results demonstrate that one can learn the optimal rule based on a single sample, and thereby adjust the design at any point t with valid inference for the mean target parameter. This work provides several contributions to the field of statistical precision medicine. First, we define a general class of averages of conditional causal parameters defined by the current context for the single unit time-series data. We define a nonparametric model for the probability distribution of the time-series under few assumptions, and aim to fully utilize the sequential randomization in the estimation procedure via the double robust structure of the efficient influence curve of the proposed target parameter. We present multiple exploration-exploitation strategies for assigning treatment, and methods for estimating the optimal rule. Lastly, we present the study of the data-adaptive inference on the mean under the optimal treatment rule, where the target parameter adapts over time in response to the observed context of the individual. Our target parameter is pathwise differentiable with an efficient influence function that is doubly robust - which makes it easier to estimate than previously proposed variations. We characterize the limit distribution of our estimator under a Donsker condition expressed in terms of a notion of bracketing entropy adapted to martingale settings. △ Less

Submitted 1 July, 2021; v1 submitted 29 January, 2021; originally announced February 2021.

Comments: arXiv admin note: text overlap with arXiv:1809.00734

arXiv:2009.06203 [pdf, other]

doi 10.1093/biostatistics/kxac002

Nonparametric causal mediation analysis for stochastic interventional (in)direct effects

Authors: Nima S. Hejazi, Kara E. Rudolph, Mark J. van der Laan, Iván Díaz

Abstract: Causal mediation analysis has historically been limited in two important ways: (i) a focus has traditionally been placed on binary treatments and static interventions, and (ii) direct and indirect effect decompositions have been pursued that are only identifiable in the absence of intermediate confounders affected by treatment. We present a theoretical study of an (in)direct effect decomposition o… ▽ More Causal mediation analysis has historically been limited in two important ways: (i) a focus has traditionally been placed on binary treatments and static interventions, and (ii) direct and indirect effect decompositions have been pursued that are only identifiable in the absence of intermediate confounders affected by treatment. We present a theoretical study of an (in)direct effect decomposition of the population intervention effect, defined by stochastic interventions jointly applied to the treatment and mediators. In contrast to existing proposals, our causal effects can be evaluated regardless of whether a treatment is categorical or continuous and remain well-defined even in the presence of intermediate confounders affected by treatment. Our (in)direct effects are identifiable without a restrictive assumption on cross-world counterfactual independencies, allowing for substantive conclusions drawn from them to be validated in randomized controlled trials. Beyond the novel effects introduced, we provide a careful study of nonparametric efficiency theory relevant for the construction of flexible, multiply robust estimators of our (in)direct effects, while avoiding undue restrictions induced by assuming parametric models of nuisance parameter functionals. To complement our nonparametric estimation strategy, we introduce inferential techniques for constructing confidence intervals and hypothesis tests, and discuss open source software implementing the proposed methodology. △ Less

Submitted 11 January, 2022; v1 submitted 14 September, 2020; originally announced September 2020.

Journal ref: Biostatistics, 2022

arXiv:2006.08675 [pdf, ps, other]

Targeted Maximum Likelihood Estimation of Community-based Causal Effect of Community-Level Stochastic Interventions

Authors: Chi Zhang, Jennifer Ahern, Mark J. van der Laan

Abstract: Unlike the commonly used parametric regression models such as mixed models, that can easily violate the required statistical assumptions and result in invalid statistical inference, target maximum likelihood estimation allows more realistic data-generative models and provides double-robust, semi-parametric and efficient estimators. Target maximum likelihood estimators (TMLEs) for the causal effect… ▽ More Unlike the commonly used parametric regression models such as mixed models, that can easily violate the required statistical assumptions and result in invalid statistical inference, target maximum likelihood estimation allows more realistic data-generative models and provides double-robust, semi-parametric and efficient estimators. Target maximum likelihood estimators (TMLEs) for the causal effect of a community-level static exposure were previously proposed by Balzer et al. In this manuscript, we build on this work and present identifiability results and develop two semi-parametric efficient TMLEs for the estimation of the causal effect of the single time-point community-level stochastic intervention whose assignment mechanism can depend on measured and unmeasured environmental factors and its individual-level covariates. The first community-level TMLE is developed under a general hierarchical non-parametric structural equation model, which can incorporate pooled individual-level regressions for estimating the outcome mechanism. The second individual-level TMLE is developed under a restricted hierarchical model in which the additional assumption of no covariate interference within communities holds. The proposed TMLEs have several crucial advantages. First, both TMLEs can make use of individual level data in the hierarchical setting, and potentially reduce finite sample bias and improve estimator efficiency. Second, the stochastic intervention framework provides a natural way for defining and estimating casual effects where the exposure variables are continuous or discrete with multiple levels, or even cannot be directly intervened on. Also, the positivity assumption needed for our proposed causal parameters can be weaker than the version of positivity required for other casual parameters. △ Less

Submitted 15 June, 2020; originally announced June 2020.

Comments: 20 pages. arXiv admin note: substantial text overlap with arXiv:2006.08553

arXiv:2006.08553 [pdf, other]

tmleCommunity: A R Package Implementing Target Maximum Likelihood Estimation for Community-level Data

Authors: Chi Zhang, Jennifer Ahern, Mark J. van der Laan, Oleg Sofrygin

Abstract: Over the past years, many applications aim to assess the causal effect of treatments assigned at the community level, while data are still collected at the individual level among individuals of the community. In many cases, one wants to evaluate the effect of a stochastic intervention on the community, where all communities in the target population receive probabilistically assigned treatments bas… ▽ More Over the past years, many applications aim to assess the causal effect of treatments assigned at the community level, while data are still collected at the individual level among individuals of the community. In many cases, one wants to evaluate the effect of a stochastic intervention on the community, where all communities in the target population receive probabilistically assigned treatments based on a known specified mechanism (e.g., implementing a community-level intervention policy that target stochastic changes in the behavior of a target population of communities). The tmleCommunity package is recently developed to implement targeted minimum loss-based estimation (TMLE) of the effect of community-level intervention(s) at a single time point on an individual-based outcome of interest, including the average causal effect. Implementations of the inverse-probability-of-treatment-weighting (IPTW) and the G-computation formula (GCOMP) are also available. The package supports multivariate arbitrary (i.e., static, dynamic or stochastic) interventions with a binary or continuous outcome. Besides, it allows user-specified data-adaptive machine learning algorithms through SuperLearner, sl3 and h2oEnsemble packages. The usage of the tmleCommunity package, along with a few examples, will be described in this paper. △ Less

Submitted 15 June, 2020; originally announced June 2020.

Comments: 42 pages

arXiv:2006.07333 [pdf]

Targeting Learning: Robust Statistics for Reproducible Research

Authors: Jeremy R. Coyle, Nima S. Hejazi, Ivana Malenica, Rachael V. Phillips, Benjamin F. Arnold, Andrew Mertens, Jade Benjamin-Chung, Weixin Cai, Sonali Dayal, John M. Colford Jr., Alan E. Hubbard, Mark J. van der Laan

Abstract: Targeted Learning is a subfield of statistics that unifies advances in causal inference, machine learning and statistical theory to help answer scientifically impactful questions with statistical confidence. Targeted Learning is driven by complex problems in data science and has been implemented in a diversity of real-world scenarios: observational studies with missing treatments and outcomes, per… ▽ More Targeted Learning is a subfield of statistics that unifies advances in causal inference, machine learning and statistical theory to help answer scientifically impactful questions with statistical confidence. Targeted Learning is driven by complex problems in data science and has been implemented in a diversity of real-world scenarios: observational studies with missing treatments and outcomes, personalized interventions, longitudinal settings with time-varying treatment regimes, survival analysis, adaptive randomized trials, mediation analysis, and networks of connected subjects. In contrast to the (mis)application of restrictive modeling strategies that dominate the current practice of statistics, Targeted Learning establishes a principled standard for statistical estimation and inference (i.e., confidence intervals and p-values). This multiply robust approach is accompanied by a guiding roadmap and a burgeoning software ecosystem, both of which provide guidance on the construction of estimators optimized to best answer the motivating question. The roadmap of Targeted Learning emphasizes tailoring statistical procedures so as to minimize their assumptions, carefully grounding them only in the scientific knowledge available. The end result is a framework that honestly reflects the uncertainty in both the background knowledge and the available data in order to draw reliable conclusions from statistical analyses - ultimately enhancing the reproducibility and rigor of scientific findings. △ Less

Submitted 12 June, 2020; originally announced June 2020.

Comments: 25 pages, 3 figures

MSC Class: 62A01 ACM Class: G.3

arXiv:2006.03632 [pdf, other]

Rate-adaptive model selection over a collection of black-box contextual bandit algorithms

Authors: Aurélien F. Bibaut, Antoine Chambaz, Mark J. van der Laan

Abstract: We consider the model selection task in the stochastic contextual bandit setting. Suppose we are given a collection of base contextual bandit algorithms. We provide a master algorithm that combines them and achieves the same performance, up to constants, as the best base algorithm would, if it had been run on its own. Our approach only requires that each algorithm satisfy a high probability regret… ▽ More We consider the model selection task in the stochastic contextual bandit setting. Suppose we are given a collection of base contextual bandit algorithms. We provide a master algorithm that combines them and achieves the same performance, up to constants, as the best base algorithm would, if it had been run on its own. Our approach only requires that each algorithm satisfy a high probability regret bound. Our procedure is very simple and essentially does the following: for a well chosen sequence of probabilities $(p_{t})_{t\geq 1}$, at each round $t$, it either chooses at random which candidate to follow (with probability $p_{t}$) or compares, at the same internal sample size for each candidate, the cumulative reward of each, and selects the one that wins the comparison (with probability $1-p_{t}$). To the best of our knowledge, our proposal is the first one to be rate-adaptive for a collection of general black-box contextual bandit algorithms: it achieves the same regret rate as the best candidate. We demonstrate the effectiveness of our method with simulation studies. △ Less

Submitted 5 June, 2020; originally announced June 2020.

arXiv:2005.11303 [pdf, other]

Nonparametric inverse probability weighted estimators based on the highly adaptive lasso

Authors: Ashkan Ertefaie, Nima S. Hejazi, Mark J. van der Laan

Abstract: Inverse probability weighted estimators are the oldest and potentially most commonly used class of procedures for the estimation of causal effects. By adjusting for selection biases via a weighting mechanism, these procedures estimate an effect of interest by constructing a pseudo-population in which selection biases are eliminated. Despite their ease of use, these estimators require the correct s… ▽ More Inverse probability weighted estimators are the oldest and potentially most commonly used class of procedures for the estimation of causal effects. By adjusting for selection biases via a weighting mechanism, these procedures estimate an effect of interest by constructing a pseudo-population in which selection biases are eliminated. Despite their ease of use, these estimators require the correct specification of a model for the weighting mechanism, are known to be inefficient, and suffer from the curse of dimensionality. We propose a class of nonparametric inverse probability weighted estimators in which the weighting mechanism is estimated via undersmoothing of the highly adaptive lasso, a nonparametric regression function proven to converge at $n^{-1/3}$-rate to the true weighting mechanism. We demonstrate that our estimators are asymptotically linear with variance converging to the nonparametric efficiency bound. Unlike doubly robust estimators, our procedures require neither derivation of the efficient influence function nor specification of the conditional outcome model. Our theoretical developments have broad implications for the construction of efficient inverse probability weighted estimators in large statistical models and a variety of problem settings. We assess the practical performance of our estimators in simulation studies and demonstrate use of our proposed methodology with data from a large-scale epidemiologic study. △ Less

Submitted 3 July, 2021; v1 submitted 22 May, 2020; originally announced May 2020.

arXiv:2003.13771 [pdf, other]

doi 10.1111/biom.13375

Efficient nonparametric inference on the effects of stochastic interventions under two-phase sampling, with applications to vaccine efficacy trials

Authors: Nima S. Hejazi, Mark J. van der Laan, Holly E. Janes, Peter B. Gilbert, David C. Benkeser

Abstract: The advent and subsequent widespread availability of preventive vaccines has altered the course of public health over the past century. Despite this success, effective vaccines to prevent many high-burden diseases, including HIV, have been slow to develop. Vaccine development can be aided by the identification of immune response markers that serve as effective surrogates for clinically significant… ▽ More The advent and subsequent widespread availability of preventive vaccines has altered the course of public health over the past century. Despite this success, effective vaccines to prevent many high-burden diseases, including HIV, have been slow to develop. Vaccine development can be aided by the identification of immune response markers that serve as effective surrogates for clinically significant infection or disease endpoints. However, measuring immune response marker activity is often costly, which has motivated the usage of two-phase sampling for immune response evaluation in clinical trials of preventive vaccines. In such trials, the measurement of immunological markers is performed on a subset of trial participants, where enrollment in this second phase is potentially contingent on the observed study outcome and other participant-level information. We propose nonparametric methodology for efficiently estimating a counterfactual parameter that quantifies the impact of a given immune response marker on the subsequent probability of infection. Along the way, we fill in theoretical gaps pertaining to the asymptotic behavior of nonparametric efficient estimators in the context of two-phase sampling, including a multiple robustness property enjoyed by our estimators. Techniques for constructing confidence intervals and hypothesis tests are presented, and an open source software implementation of the methodology, the txshift R package, is introduced. We illustrate the proposed techniques using data from a recent preventive HIV vaccine efficacy trial. △ Less

Submitted 3 April, 2020; v1 submitted 30 March, 2020; originally announced March 2020.

Journal ref: Biometrics, 2020

arXiv:2003.02873 [pdf, other]

Generalized Policy Elimination: an efficient algorithm for Nonparametric Contextual Bandits

Authors: Aurélien F. Bibaut, Antoine Chambaz, Mark J. van der Laan

Abstract: We propose the Generalized Policy Elimination (GPE) algorithm, an oracle-efficient contextual bandit (CB) algorithm inspired by the Policy Elimination algorithm of \cite{dudik2011}. We prove the first regret optimality guarantee theorem for an oracle-efficient CB algorithm competing against a nonparametric class with infinite VC-dimension. Specifically, we show that GPE is regret-optimal (up to lo… ▽ More We propose the Generalized Policy Elimination (GPE) algorithm, an oracle-efficient contextual bandit (CB) algorithm inspired by the Policy Elimination algorithm of \cite{dudik2011}. We prove the first regret optimality guarantee theorem for an oracle-efficient CB algorithm competing against a nonparametric class with infinite VC-dimension. Specifically, we show that GPE is regret-optimal (up to logarithmic factors) for policy classes with integrable entropy. For classes with larger entropy, we show that the core techniques used to analyze GPE can be used to design an $\varepsilon$-greedy algorithm with regret bound matching that of the best algorithms to date. We illustrate the applicability of our algorithms and theorems with examples of large nonparametric policy classes, for which the relevant optimization oracles can be efficiently implemented. △ Less

Submitted 5 March, 2020; originally announced March 2020.

arXiv:1912.09936 [pdf, other]

doi 10.1093/biomet/asaa085

Non-parametric efficient causal mediation with intermediate confounders

Authors: Iván Díaz, Nima S. Hejazi, Kara E. Rudolph, Mark J. van der Laan

Abstract: Interventional effects for mediation analysis were proposed as a solution to the lack of identifiability of natural (in)direct effects in the presence of a mediator-outcome confounder affected by exposure. We present a theoretical and computational study of the properties of the interventional (in)direct effect estimands based on the efficient influence fucntion (EIF) in the non-parametric statist… ▽ More Interventional effects for mediation analysis were proposed as a solution to the lack of identifiability of natural (in)direct effects in the presence of a mediator-outcome confounder affected by exposure. We present a theoretical and computational study of the properties of the interventional (in)direct effect estimands based on the efficient influence fucntion (EIF) in the non-parametric statistical model. We use the EIF to develop two asymptotically optimal, non-parametric estimators that leverage data-adaptive regression for estimation of the nuisance parameters: a one-step estimator and a targeted minimum loss estimator. A free and open source \texttt{R} package implementing our proposed estimators is made available on GitHub. We further present results establishing the conditions under which these estimators are consistent, multiply robust, $n^{1/2}$-consistent and efficient. We illustrate the finite-sample performance of the estimators and corroborate our theoretical results in a simulation study. We also demonstrate the use of the estimators in our motivating application to elucidate the mechanisms behind the unintended harmful effects that a housing intervention had on adolescent girls' risk behavior. △ Less

Submitted 29 May, 2020; v1 submitted 20 December, 2019; originally announced December 2019.

Journal ref: Biometrika, 2020

arXiv:1912.06675 [pdf, other]

Conditional Super Learner

Authors: Gilmer Valdes, Yannet Interian, Efstathios D. Gennatas Mark J. Van der Laan

Abstract: In this article we consider the Conditional Super Learner (CSL), an algorithm which selects the best model candidate from a library conditional on the covariates. The CSL expands the idea of using cross-validation to select the best model and merges it with meta learning. Here we propose a specific algorithm that finds a local minimum to the problem posed, proof that it converges at a rate faster… ▽ More In this article we consider the Conditional Super Learner (CSL), an algorithm which selects the best model candidate from a library conditional on the covariates. The CSL expands the idea of using cross-validation to select the best model and merges it with meta learning. Here we propose a specific algorithm that finds a local minimum to the problem posed, proof that it converges at a rate faster than $O_p(n^{-1/4})$ and offers extensive empirical evidence that it is an excellent candidate to substitute stacking or for the analysis of Hierarchical problems. △ Less

Submitted 13 December, 2019; originally announced December 2019.

arXiv:1912.06292 [pdf, other]

More Efficient Off-Policy Evaluation through Regularized Targeted Learning

Authors: Aurélien F. Bibaut, Ivana Malenica, Nikos Vlassis, Mark J. van der Laan

Abstract: We study the problem of off-policy evaluation (OPE) in Reinforcement Learning (RL), where the aim is to estimate the performance of a new policy given historical data that may have been generated by a different policy, or policies. In particular, we introduce a novel doubly-robust estimator for the OPE problem in RL, based on the Targeted Maximum Likelihood Estimation principle from the statistica… ▽ More We study the problem of off-policy evaluation (OPE) in Reinforcement Learning (RL), where the aim is to estimate the performance of a new policy given historical data that may have been generated by a different policy, or policies. In particular, we introduce a novel doubly-robust estimator for the OPE problem in RL, based on the Targeted Maximum Likelihood Estimation principle from the statistical causal inference literature. We also introduce several variance reduction techniques that lead to impressive performance gains in off-policy evaluation. We show empirically that our estimator uniformly wins over existing off-policy evaluation methods across multiple RL environments and various levels of model misspecification. Finally, we further the existing theoretical analysis of estimators for the RL off-policy estimation problem by showing their $O_P(1/\sqrt{n})$ rate of convergence and characterizing their asymptotic distribution. △ Less

Submitted 12 December, 2019; originally announced December 2019.

Comments: We are uploading the full paper with the appendix as of 12/12/2019, as we noticed that, unlike the main text, the appendix has not been made available on PMLR's website. The version of the appendix in this document is the same that we have been sending by email since June 2019 to readers who solicited it

Journal ref: Proceedings of the 36th International Conference on Machine Learning, PMLR 97:654-663, 2019

arXiv:1908.05607 [pdf, other]

Efficient Estimation of Pathwise Differentiable Target Parameters with the Undersmoothed Highly Adaptive Lasso

Authors: Mark J. van der Laan, David Benkeser, Weixin Cai

Abstract: We consider estimation of a functional parameter of a realistically modeled data distribution based on observing independent and identically distributed observations. We define an $m$-th order Spline Highly Adaptive Lasso Minimum Loss Estimator (Spline HAL-MLE) of a functional parameter that is defined by minimizing the empirical risk function over an $m$-th order smoothness class of functions. We… ▽ More We consider estimation of a functional parameter of a realistically modeled data distribution based on observing independent and identically distributed observations. We define an $m$-th order Spline Highly Adaptive Lasso Minimum Loss Estimator (Spline HAL-MLE) of a functional parameter that is defined by minimizing the empirical risk function over an $m$-th order smoothness class of functions. We show that this $m$-th order smoothness class consists of all functions that can be represented as an infinitesimal linear combination of tensor products of $\leq m$-th order spline-basis functions, and involves assuming $m$-derivatives in each coordinate. By selecting $m$ with cross-validation we obtain a Spline-HAL-MLE that is able to adapt to the underlying unknown smoothness of the true function, while guaranteeing a rate of convergence faster than $n^{-1/4}$, as long as the true function is cadlag (right-continuous with left-hand limits) and has finite sectional variation norm. The $m=0$-smoothness class consists of all cadlag functions with finite sectional variation norm and corresponds with the original HAL-MLE defined in van der Laan (2015). In this article we establish that this Spline-HAL-MLE yields an asymptotically efficient estimator of any smooth feature of the functional parameter under an easily verifiable global undersmoothing condition. A sufficient condition for the latter condition is that the minimum of the empirical mean of the selected basis functions is smaller than a constant times $n^{-1/2}$, which is not parameter specific and enforces the selection of the $L_1$-norm in the lasso to be large enough to include sparsely supported basis. We demonstrate our general result for the $m=0$-HAL-MLE of the average treatment effect and of the integral of the square of the data density. We also present simulations for these two examples confirming the theory. △ Less

Submitted 2 July, 2021; v1 submitted 14 August, 2019; originally announced August 2019.

arXiv:1905.13414 [pdf, other]

Targeted Estimation of L2 Distance Between Densities and its Application to Geo-spatial Data

Authors: George Shan, Mark J. van der Laan

Abstract: We examine the integrated squared difference, also known as the L2 distance (L2D), between two probability densities. Such a distance metric allows for comparison of differences between pairs of distributions or changes in a distribution over time. We propose a targeted maximum likelihood estimator for this parameter based on samples of independent and identically distributed observations from bot… ▽ More We examine the integrated squared difference, also known as the L2 distance (L2D), between two probability densities. Such a distance metric allows for comparison of differences between pairs of distributions or changes in a distribution over time. We propose a targeted maximum likelihood estimator for this parameter based on samples of independent and identically distributed observations from both underlying distributions. We compare our method to kernel density estimation and demonstrate superior performance for our method with regards to confidence interval coverage rate and mean squared error. △ Less

Submitted 31 May, 2019; originally announced May 2019.

Comments: 17 pages, 3 figures, 2 appendices included

arXiv:1903.09731 [pdf]

doi 10.1073/pnas.1906831117

Expert-Augmented Machine Learning

Authors: E. D. Gennatas, J. H. Friedman, L. H. Ungar, R. Pirracchio, E. Eaton, L. Reichman, Y. Interian, C. B. Simone, A. Auerbach, E. Delgado, M. J. Van der Laan, T. D. Solberg, G. Valdes

Abstract: Machine Learning is proving invaluable across disciplines. However, its success is often limited by the quality and quantity of available data, while its adoption by the level of trust that models afford users. Human vs. machine performance is commonly compared empirically to decide whether a certain task should be performed by a computer or an expert. In reality, the optimal learning strategy may… ▽ More Machine Learning is proving invaluable across disciplines. However, its success is often limited by the quality and quantity of available data, while its adoption by the level of trust that models afford users. Human vs. machine performance is commonly compared empirically to decide whether a certain task should be performed by a computer or an expert. In reality, the optimal learning strategy may involve combining the complementary strengths of man and machine. Here we present Expert-Augmented Machine Learning (EAML), an automated method that guides the extraction of expert knowledge and its integration into machine-learned models. We use a large dataset of intensive care patient data to predict mortality and show that we can extract expert knowledge using an online platform, help reveal hidden confounders, improve generalizability on a different population and learn using less data. EAML presents a novel framework for high performance and dependable machine learning in critical applications. △ Less

Submitted 5 January, 2021; v1 submitted 22 March, 2019; originally announced March 2019.

arXiv:1903.03690 [pdf, ps, other]

doi 10.1111/biom.13274

Transporting stochastic direct and indirect effects to new populations

Authors: Kara E Rudolph, Jonathan Levy, Mark J van der Laan

Abstract: Transported mediation effects may contribute to understanding how and why interventions may work differently when applied to new populations. However, we are not aware of any estimators for such effects. Thus, we propose several different estimators of transported stochastic direct and indirect effects: an inverse-probability of treatment stabilized weighted estimator, a doubly robust estimator th… ▽ More Transported mediation effects may contribute to understanding how and why interventions may work differently when applied to new populations. However, we are not aware of any estimators for such effects. Thus, we propose several different estimators of transported stochastic direct and indirect effects: an inverse-probability of treatment stabilized weighted estimator, a doubly robust estimator that solves the estimating equation, and a doubly robust substitution estimator in the targeted minimum loss-based framework. We demonstrate their finite sample properties in a simulation study. △ Less

Submitted 8 March, 2019; originally announced March 2019.

Journal ref: Biometrics. 2020

arXiv:1901.05056 [pdf, other]

A nonparametric super-efficient estimator of the average treatment effect

Authors: David Benkeser, Weixin Cai, Mark J van der Laan

Abstract: Doubly robust estimators of causal effects are a popular means of estimating causal effects. Such estimators combine an estimate of the conditional mean of the outcome given treatment and confounders (the so-called outcome regression) with an estimate of the conditional probability of treatment given confounders (the propensity score) to generate an estimate of the effect of interest. In addition… ▽ More Doubly robust estimators of causal effects are a popular means of estimating causal effects. Such estimators combine an estimate of the conditional mean of the outcome given treatment and confounders (the so-called outcome regression) with an estimate of the conditional probability of treatment given confounders (the propensity score) to generate an estimate of the effect of interest. In addition to enjoying the double-robustness property, these estimators have additional benefits. First, flexible regression tools, such as those developed in the field of machine learning, can be utilized to estimate the relevant regressions, while the estimators of the treatment effects retain desirable statistical properties. Furthermore, these estimators are often statistically efficient, achieving the lower bound on the variance of regular, asymptotically linear estimators. However, in spite of their asymptotic optimality, in problems where causal estimands are weakly identifiable, these estimators may behave erratically. We propose two new estimation techniques for use in these challenging settings. Our estimators build on two existing frameworks for efficient estimation: targeted minimum loss estimation and one-step estimation. However, rather than using an estimate of the propensity score in their construction, we instead opt for an alternative regression quantity when building our estimators: the conditional probability of treatment given the conditional mean outcome. We discuss the theoretical implications and demonstrate the estimators' performance in simulated and real data. △ Less

Submitted 15 January, 2019; originally announced January 2019.

arXiv:1810.12452 [pdf, other]

doi 10.1080/01621459.2019.1704292

Complier stochastic direct effects: identification and robust estimation

Authors: Kara E Rudolph, Oleg Sofrygin, Mark J van der Laan

Abstract: Mediation analysis is critical to understanding the mechanisms underlying exposure-outcome relationships. In this paper, we identify the instrumental variable (IV)-direct effect of the exposure on the outcome not through the mediator, using randomization of the instrument. To our knowledge, such an estimand has not previously been considered or estimated. We propose and evaluate several estimators… ▽ More Mediation analysis is critical to understanding the mechanisms underlying exposure-outcome relationships. In this paper, we identify the instrumental variable (IV)-direct effect of the exposure on the outcome not through the mediator, using randomization of the instrument. To our knowledge, such an estimand has not previously been considered or estimated. We propose and evaluate several estimators for this estimand: a ratio of inverse-probability of treatment-weighted estimators (IPTW), a ratio of estimating equation estimators (EE), a ratio of targeted minimum loss-based estimators (TMLE), and a TMLE that targets the CSDE directly. These estimators are applicable for a variety of study designs, including randomized encouragement trials, like the MTO housing voucher experiment we consider as an illustrative example, treatment discontinuities, and Mendelian randomization. We found the IPTW estimator to be the most sensitive to finite sample bias, resulting in bias of over 40% even when all models were correctly specified in a sample size of N=100. In contrast, the EE estimator and compatible TMLE estimator were far less sensitive to finite samples. The EE and TMLE estimators also have advantages over the IPTW estimator in terms of efficiency and reduced reliance on correct parametric model specification. △ Less

Submitted 29 October, 2018; originally announced October 2018.

Journal ref: Journal of the American Statistical Association. 2020

arXiv:1810.03030 [pdf, other]

Robust variance estimation and inference for causal effect estimation

Authors: Linh Tran, Maya Petersen, Joshua Schwab, Mark J van der Laan

Abstract: We consider a longitudinal data structure consisting of baseline covariates, time-varying treatment variables, intermediate time-dependent covariates, and a possibly time dependent outcome. Previous studies have shown that estimating the variance of asymptotically linear estimators using empirical influence functions in this setting result in anti-conservative estimates with increasing magnitudes… ▽ More We consider a longitudinal data structure consisting of baseline covariates, time-varying treatment variables, intermediate time-dependent covariates, and a possibly time dependent outcome. Previous studies have shown that estimating the variance of asymptotically linear estimators using empirical influence functions in this setting result in anti-conservative estimates with increasing magnitudes of positivity violations, leading to poor coverage and uncontrolled Type I errors. In this paper, we present two alternative approaches of estimating the variance of these estimators: (i) a robust approach which directly targets the variance of the influence function as a counterfactual mean outcome, and (ii) a non-parametric bootstrap based approach that is theoretically valid and lowers the computational cost, thereby increasing the feasibility in non-parametric settings using complex machine learning algorithms. The performance of these approaches are compared to that of the empirical influence function in simulations across different levels of positivity violations and treatment effect sizes. △ Less

Submitted 6 October, 2018; originally announced October 2018.

Comments: 20 pages, 8 figures

arXiv:1809.00734 [pdf, other]

Robust Estimation of Data-Dependent Causal Effects based on Observing a Single Time-Series

Authors: Mark J. van der Laan, Ivana Malenica

Abstract: Consider the case that one observes a single time-series, where at each time t one observes a data record O(t) involving treatment nodes A(t), possible covariates L(t) and an outcome node Y(t). The data record at time t carries information for an (potentially causal) effect of the treatment A(t) on the outcome Y(t), in the context defined by a fixed dimensional summary measure Co(t). We are concer… ▽ More Consider the case that one observes a single time-series, where at each time t one observes a data record O(t) involving treatment nodes A(t), possible covariates L(t) and an outcome node Y(t). The data record at time t carries information for an (potentially causal) effect of the treatment A(t) on the outcome Y(t), in the context defined by a fixed dimensional summary measure Co(t). We are concerned with defining causal effects that can be consistently estimated, with valid inference, for sequentially randomized experiments without further assumptions. More generally, we consider the case when the (possibly causal) effects can be estimated in a double robust manner, analogue to double robust estimation of effects in the i.i.d. causal inference literature. We propose a general class of averages of conditional (context-specific) causal parameters that can be estimated in a double robust manner, therefore fully utilizing the sequential randomization. We propose a targeted maximum likelihood estimator (TMLE) of these causal parameters, and present a general theorem establishing the asymptotic consistency and normality of the TMLE. We extend our general framework to a number of typically studied causal target parameters, including a sequentially adaptive design within a single unit that learns the optimal treatment rule for the unit over time. Our work opens up robust statistical inference for causal questions based on observing a single time-series on a particular unit. △ Less

Submitted 3 September, 2018; originally announced September 2018.

arXiv:1808.03231 [pdf, other]

Statistical Analysis Plan for SEARCH Phase I: Health Outcomes among Adults

Authors: Laura B. Balzer, Diane V. Havlir, Joshua Schwab, Mark J. Van Der Laan, Maya L. Petersen

Abstract: This document provides the analytic plan for evaluating adult HIV incidence, health, and implementation outcomes for the first phase of the SEARCH Study. Locked: November 27, 2017. Embargoed until July 25, 2018. This document provides the analytic plan for evaluating adult HIV incidence, health, and implementation outcomes for the first phase of the SEARCH Study. Locked: November 27, 2017. Embargoed until July 25, 2018. △ Less

Submitted 25 July, 2018; originally announced August 2018.

Comments: 40 pgs

arXiv:1806.06784 [pdf, other]

Robust inference on the average treatment effect using the outcome highly adaptive lasso

Authors: Cheng Ju, David Benkeser, Mark J. van der Laan

Abstract: Many estimators of the average effect of a treatment on an outcome require estimation of the propensity score, the outcome regression, or both. It is often beneficial to utilize flexible techniques such as semiparametric regression or machine learning to estimate these quantities. However, optimal estimation of these regressions does not necessarily lead to optimal estimation of the average treatm… ▽ More Many estimators of the average effect of a treatment on an outcome require estimation of the propensity score, the outcome regression, or both. It is often beneficial to utilize flexible techniques such as semiparametric regression or machine learning to estimate these quantities. However, optimal estimation of these regressions does not necessarily lead to optimal estimation of the average treatment effect, particularly in settings with strong instrumental variables. A recent proposal addressed these issues via the outcome-adaptive lasso, a penalized regression technique for estimating the propensity score that seeks to minimize the impact of instrumental variables on treatment effect estimators. However, a notable limitation of this approach is that its application is restricted to parametric models. We propose a more flexible alternative that we call the outcome highly adaptive lasso. We discuss large sample theory for this estimator and propose closed form confidence intervals based on the proposed estimator. We show via simulation that our method offers benefits over several popular approaches. △ Less

Submitted 12 May, 2019; v1 submitted 18 June, 2018; originally announced June 2018.

Comments: The first two authors contributed equally to this work

arXiv:1804.00102 [pdf, other]

Collaborative targeted inference from continuously indexed nuisance parameter estimators

Authors: Cheng Ju, Antoine Chambaz, Mark J. van der Laan

Abstract: We wish to infer the value of a parameter at a law from which we sample independent observations. The parameter is smooth and we can define two variation-independent features of the law, its $Q$- and $G$-components, such that estimating them consistently at a fast enough product of rates allows to build a confidence interval (CI) with a given asymptotic level from a plain targeted minimum loss est… ▽ More We wish to infer the value of a parameter at a law from which we sample independent observations. The parameter is smooth and we can define two variation-independent features of the law, its $Q$- and $G$-components, such that estimating them consistently at a fast enough product of rates allows to build a confidence interval (CI) with a given asymptotic level from a plain targeted minimum loss estimator (TMLE). Say that the above product is not fast enough and the algorithm for the $G$-component is fine-tuned by a real-valued $h$. A plain TMLE with an $h$ chosen by cross-validation would typically not yield a CI. We construct a collaborative TMLE (C-TMLE) and show under mild conditions that, if there exists an oracle $h$ that makes a bulky remainder term asymptotically Gaussian, then the C-TMLE yields a CI. We illustrate our findings with the inference of the average treatment effect. We conduct a simulation study where the $G$-component is estimated by the LASSO and $h$ is the bound on the coefficients' norms. It sheds light on small sample properties, in the face of low- to high-dimensional baseline covariates, and possibly positivity violation. △ Less

Submitted 5 April, 2018; v1 submitted 30 March, 2018; originally announced April 2018.

Comments: 38 pages

arXiv:1802.09642 [pdf]

Selecting optimal subgroups for treatment using many covariates

Authors: Tyler J. VanderWeele, Alex R. Luedtke, Mark J. van der Laan, Ronald C. Kessler

Abstract: We consider the problem of selecting the optimal subgroup to treat when data on covariates is available from a randomized trial or observational study. We distinguish between four different settings including (i) treatment selection when resources are constrained, (ii) treatment selection when resources are not constrained, (iii) treatment selection in the presence of side effects and costs, and (… ▽ More We consider the problem of selecting the optimal subgroup to treat when data on covariates is available from a randomized trial or observational study. We distinguish between four different settings including (i) treatment selection when resources are constrained, (ii) treatment selection when resources are not constrained, (iii) treatment selection in the presence of side effects and costs, and (iv) treatment selection to maximize effect heterogeneity. We show that, in each of these cases, the optimal treatment selection rule involves treating those for whom the predicted mean difference in outcomes comparing those with versus without treatment, conditional on covariates, exceeds a certain threshold. The threshold varies across these four scenarios but the form of the optimal treatment selection rule does not. The results suggest a move away from traditional subgroup analysis for personalized medicine. New randomized trial designs are proposed so as to implement and make use of optimal treatment selection rules in health care practice. △ Less

Submitted 26 February, 2018; originally announced February 2018.

Showing 1–50 of 74 results for author: Van Der Laan, M J