-
Debiased machine learning for counterfactual survival functionals based on left-truncated right-censored data
Authors:
Eric R. Morenz,
Charles J. Wolock,
Marco Carone
Abstract:
Learning causal effects of a binary exposure on time-to-event endpoints can be challenging because survival times may be partially observed due to censoring and systematically biased due to truncation. In this work, we present debiased machine learning-based nonparametric estimators of the joint distribution of a counterfactual survival time and baseline covariates for use when the observed data a…
▽ More
Learning causal effects of a binary exposure on time-to-event endpoints can be challenging because survival times may be partially observed due to censoring and systematically biased due to truncation. In this work, we present debiased machine learning-based nonparametric estimators of the joint distribution of a counterfactual survival time and baseline covariates for use when the observed data are subject to covariate-dependent left truncation and right censoring and when baseline covariates suffice to deconfound the relationship between exposure and survival time. Our inferential procedures explicitly allow the integration of flexible machine learning tools for nuisance estimation, and enjoy certain robustness properties. The approach we propose can be directly used to make pointwise or uniform inference on smooth summaries of the joint counterfactual survival time and covariate distribution, and can be valuable even in the absence of interventions, when summaries of a marginal survival distribution are of interest. We showcase how our procedures can be used to learn a variety of inferential targets and illustrate their performance in simulation studies.
△ Less
Submitted 13 November, 2024;
originally announced November 2024.
-
Stabilized Inverse Probability Weighting via Isotonic Calibration
Authors:
Lars van der Laan,
Ziming Lin,
Marco Carone,
Alex Luedtke
Abstract:
Inverse weighting with an estimated propensity score is widely used by estimation methods in causal inference to adjust for confounding bias. However, directly inverting propensity score estimates can lead to instability, bias, and excessive variability due to large inverse weights, especially when treatment overlap is limited. In this work, we propose a post-hoc calibration algorithm for inverse…
▽ More
Inverse weighting with an estimated propensity score is widely used by estimation methods in causal inference to adjust for confounding bias. However, directly inverting propensity score estimates can lead to instability, bias, and excessive variability due to large inverse weights, especially when treatment overlap is limited. In this work, we propose a post-hoc calibration algorithm for inverse propensity weights that generates well-calibrated, stabilized weights from user-supplied, cross-fitted propensity score estimates. Our approach employs a variant of isotonic regression with a loss function specifically tailored to the inverse propensity weights. Through theoretical analysis and empirical studies, we demonstrate that isotonic calibration improves the performance of doubly robust estimators of the average treatment effect.
△ Less
Submitted 9 April, 2025; v1 submitted 9 November, 2024;
originally announced November 2024.
-
Automatic doubly robust inference for linear functionals via calibrated debiased machine learning
Authors:
Lars van der Laan,
Alex Luedtke,
Marco Carone
Abstract:
In causal inference, many estimands of interest can be expressed as a linear functional of the outcome regression function; this includes, for example, average causal effects of static, dynamic and stochastic interventions. For learning such estimands, in this work, we propose novel debiased machine learning estimators that are doubly robust asymptotically linear, thus providing not only doubly ro…
▽ More
In causal inference, many estimands of interest can be expressed as a linear functional of the outcome regression function; this includes, for example, average causal effects of static, dynamic and stochastic interventions. For learning such estimands, in this work, we propose novel debiased machine learning estimators that are doubly robust asymptotically linear, thus providing not only doubly robust consistency but also facilitating doubly robust inference (e.g., confidence intervals and hypothesis tests). To do so, we first establish a key link between calibration, a machine learning technique typically used in prediction and classification tasks, and the conditions needed to achieve doubly robust asymptotic linearity. We then introduce calibrated debiased machine learning (C-DML), a unified framework for doubly robust inference, and propose a specific C-DML estimator that integrates cross-fitting, isotonic calibration, and debiased machine learning estimation. A C-DML estimator maintains asymptotic linearity when either the outcome regression or the Riesz representer of the linear functional is estimated sufficiently well, allowing the other to be estimated at arbitrarily slow rates or even inconsistently. We propose a simple bootstrap-assisted approach for constructing doubly robust confidence intervals. Our theoretical and empirical results support the use of C-DML to mitigate bias arising from the inconsistent or slow estimation of nuisance functions.
△ Less
Submitted 4 November, 2024;
originally announced November 2024.
-
Propensity Score Augmentation in Matching-based Estimation of Causal Effects
Authors:
Ernesto Ulloa-Pérez,
Marco Carone,
Alex Luedtke
Abstract:
When assessing the causal effect of a binary exposure using observational data, confounder imbalance across exposure arms must be addressed. Matching methods, including propensity score-based matching, can be used to deconfound the causal relationship of interest. They have been particularly popular in practice, at least in part due to their simplicity and interpretability. However, these methods…
▽ More
When assessing the causal effect of a binary exposure using observational data, confounder imbalance across exposure arms must be addressed. Matching methods, including propensity score-based matching, can be used to deconfound the causal relationship of interest. They have been particularly popular in practice, at least in part due to their simplicity and interpretability. However, these methods can suffer from low statistical efficiency compared to many competing methods. In this work, we propose a novel matching-based estimator of the average treatment effect based on a suitably-augmented propensity score model. Our procedure can be shown to have greater statistical efficiency than traditional matching estimators whenever prognostic variables are available, and in some cases, can nearly reach the nonparametric efficiency bound. In addition to a theoretical study, we provide numerical results to illustrate our findings. Finally, we use our novel procedure to estimate the effect of circumcision on risk of HIV-1 infection using vaccine efficacy trial data.
△ Less
Submitted 28 September, 2024;
originally announced September 2024.
-
Towards a Unified Theory for Semiparametric Data Fusion with Individual-Level Data
Authors:
Ellen Graham,
Marco Carone,
Andrea Rotnitzky
Abstract:
We address the goal of conducting inference about a smooth finite-dimensional parameter by utilizing individual-level data from various independent sources. Recent advancements have led to the development of a comprehensive theory capable of handling scenarios where different data sources align with, possibly distinct subsets of, conditional distributions of a single factorization of the joint tar…
▽ More
We address the goal of conducting inference about a smooth finite-dimensional parameter by utilizing individual-level data from various independent sources. Recent advancements have led to the development of a comprehensive theory capable of handling scenarios where different data sources align with, possibly distinct subsets of, conditional distributions of a single factorization of the joint target distribution. While this theory proves effective in many significant contexts, it falls short in certain common data fusion problems, such as two-sample instrumental variable analysis, settings that integrate data from epidemiological studies with diverse designs (e.g., prospective cohorts and retrospective case-control studies), and studies with variables prone to measurement error that are supplemented by validation studies. In this paper, we extend the aforementioned comprehensive theory to allow for the fusion of individual-level data from sources aligned with conditional distributions that do not correspond to a single factorization of the target distribution. Assuming conditional and marginal distribution alignments, we provide universal results that characterize the class of all influence functions of regular asymptotically linear estimators and the efficient influence function of any pathwise differentiable parameter, irrespective of the number of data sources, the specific parameter of interest, or the statistical model for the target distribution. This theory paves the way for machine-learning debiased, semiparametric efficient estimation.
△ Less
Submitted 24 February, 2025; v1 submitted 16 September, 2024;
originally announced September 2024.
-
Investigating symptom duration using current status data: a case study of post-acute COVID-19 syndrome
Authors:
Charles J. Wolock,
Susan Jacob,
Julia C. Bennett,
Anna Elias-Warren,
Jessica O'Hanlon,
Avi Kenny,
Nicholas P. Jewell,
Andrea Rotnitzky,
Stephen R. Cole,
Ana A. Weil,
Helen Y. Chu,
Marco Carone
Abstract:
For infectious diseases, characterizing symptom duration is of clinical and public health importance. Symptom duration may be assessed by surveying infected individuals and querying symptom status at the time of survey response. For example, in a SARS-CoV-2 testing program at the University of Washington, participants were surveyed at least $28$ days after testing positive and asked to report curr…
▽ More
For infectious diseases, characterizing symptom duration is of clinical and public health importance. Symptom duration may be assessed by surveying infected individuals and querying symptom status at the time of survey response. For example, in a SARS-CoV-2 testing program at the University of Washington, participants were surveyed at least $28$ days after testing positive and asked to report current symptom status. This study design yielded current status data: outcome measurements for each respondent consisted only of the time of survey response and a binary indicator of whether symptoms had resolved by that time. Such study design benefits from limited risk of recall bias, but analyzing the resulting data necessitates tailored statistical tools. Here, we review methods for current status data and describe a novel application of modern nonparametric techniques to this setting. The proposed approach is valid under weaker assumptions compared to existing methods, allows use of flexible machine learning tools, and handles potential survey nonresponse. From the university study, under an assumption that the survey response time is conditionally independent of symptom resolution time within strata of measured covariates, we estimate that 19% of participants experienced ongoing symptoms 30 days after testing positive, decreasing to 7% at 90 days. We assess the sensitivity of these results to deviations from conditional independence, finding the estimates to be more sensitive to assumption violations at 30 days compared to 90 days. Female sex, fatigue during acute infection, and higher viral load were associated with slower symptom resolution.
△ Less
Submitted 17 March, 2025; v1 submitted 4 July, 2024;
originally announced July 2024.
-
Combining T-learning and DR-learning: a framework for oracle-efficient estimation of causal contrasts
Authors:
Lars van der Laan,
Marco Carone,
Alex Luedtke
Abstract:
We introduce efficient plug-in (EP) learning, a novel framework for the estimation of heterogeneous causal contrasts, such as the conditional average treatment effect and conditional relative risk. The EP-learning framework enjoys the same oracle-efficiency as Neyman-orthogonal learning strategies, such as DR-learning and R-learning, while addressing some of their primary drawbacks, including that…
▽ More
We introduce efficient plug-in (EP) learning, a novel framework for the estimation of heterogeneous causal contrasts, such as the conditional average treatment effect and conditional relative risk. The EP-learning framework enjoys the same oracle-efficiency as Neyman-orthogonal learning strategies, such as DR-learning and R-learning, while addressing some of their primary drawbacks, including that (i) their practical applicability can be hindered by loss function non-convexity; and (ii) they may suffer from poor performance and instability due to inverse probability weighting and pseudo-outcomes that violate bounds. To avoid these drawbacks, EP-learner constructs an efficient plug-in estimator of the population risk function for the causal contrast, thereby inheriting the stability and robustness properties of plug-in estimation strategies like T-learning. Under reasonable conditions, EP-learners based on empirical risk minimization are oracle-efficient, exhibiting asymptotic equivalence to the minimizer of an oracle-efficient one-step debiased estimator of the population risk function. In simulation experiments, we illustrate that EP-learners of the conditional average treatment effect and conditional relative risk outperform state-of-the-art competitors, including T-learner, R-learner, and DR-learner. Open-source implementations of the proposed methods are available in our R package hte3.
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
Assessing variable importance in survival analysis using machine learning
Authors:
Charles J. Wolock,
Peter B. Gilbert,
Noah Simon,
Marco Carone
Abstract:
Given a collection of features available for inclusion in a predictive model, it may be of interest to quantify the relative importance of a subset of features for the prediction task at hand. For example, in HIV vaccine trials, participant baseline characteristics are used to predict the probability of HIV acquisition over the intended follow-up period, and investigators may wish to understand ho…
▽ More
Given a collection of features available for inclusion in a predictive model, it may be of interest to quantify the relative importance of a subset of features for the prediction task at hand. For example, in HIV vaccine trials, participant baseline characteristics are used to predict the probability of HIV acquisition over the intended follow-up period, and investigators may wish to understand how much certain types of predictors, such as behavioral factors, contribute toward overall predictiveness. Time-to-event outcomes such as time to HIV acquisition are often subject to right censoring, and existing methods for assessing variable importance are typically not intended to be used in this setting. We describe a broad class of algorithm-agnostic variable importance measures for prediction in the context of survival data. We propose a nonparametric efficient estimation procedure that incorporates flexible learning of nuisance parameters, yields asymptotically valid inference, and enjoys double-robustness. We assess the performance of our proposed procedure via numerical simulations and analyze data from the HVTN 702 vaccine trial to inform enrollment strategies for future HIV vaccine trials.
△ Less
Submitted 12 August, 2024; v1 submitted 21 November, 2023;
originally announced November 2023.
-
Adaptive debiased machine learning using data-driven model selection techniques
Authors:
Lars van der Laan,
Marco Carone,
Alex Luedtke,
Mark van der Laan
Abstract:
Debiased machine learning estimators for nonparametric inference of smooth functionals of the data-generating distribution can suffer from excessive variability and instability. For this reason, practitioners may resort to simpler models based on parametric or semiparametric assumptions. However, such simplifying assumptions may fail to hold, and estimates may then be biased due to model misspecif…
▽ More
Debiased machine learning estimators for nonparametric inference of smooth functionals of the data-generating distribution can suffer from excessive variability and instability. For this reason, practitioners may resort to simpler models based on parametric or semiparametric assumptions. However, such simplifying assumptions may fail to hold, and estimates may then be biased due to model misspecification. To address this problem, we propose Adaptive Debiased Machine Learning (ADML), a nonparametric framework that combines data-driven model selection and debiased machine learning techniques to construct asymptotically linear, adaptive, and superefficient estimators for pathwise differentiable functionals. By learning model structure directly from data, ADML avoids the bias introduced by model misspecification and remains free from the restrictions of parametric and semiparametric models. While they may exhibit irregular behavior for the target parameter in a nonparametric statistical model, we demonstrate that ADML estimators provides regular and locally uniformly valid inference for a projection-based oracle parameter. Importantly, this oracle parameter agrees with the original target parameter for distributions within an unknown but correctly specified oracle statistical submodel that is learned from the data. This finding implies that there is no penalty, in a local asymptotic sense, for conducting data-driven model selection compared to having prior knowledge of the oracle submodel and oracle parameter. To demonstrate the practical applicability of our theory, we provide a broad class of ADML estimators for estimating the average treatment effect in adaptive partially linear regression models.
△ Less
Submitted 24 July, 2023;
originally announced July 2023.
-
Causal isotonic calibration for heterogeneous treatment effects
Authors:
Lars van der Laan,
Ernesto Ulloa-Pérez,
Marco Carone,
Alex Luedtke
Abstract:
We propose causal isotonic calibration, a novel nonparametric method for calibrating predictors of heterogeneous treatment effects. Furthermore, we introduce cross-calibration, a data-efficient variant of calibration that eliminates the need for hold-out calibration sets. Cross-calibration leverages cross-fitted predictors and generates a single calibrated predictor using all available data. Under…
▽ More
We propose causal isotonic calibration, a novel nonparametric method for calibrating predictors of heterogeneous treatment effects. Furthermore, we introduce cross-calibration, a data-efficient variant of calibration that eliminates the need for hold-out calibration sets. Cross-calibration leverages cross-fitted predictors and generates a single calibrated predictor using all available data. Under weak conditions that do not assume monotonicity, we establish that both causal isotonic calibration and cross-calibration achieve fast doubly-robust calibration rates, as long as either the propensity score or outcome regression is estimated accurately in a suitable sense. The proposed causal isotonic calibrator can be wrapped around any black-box learning algorithm, providing robust and distribution-free calibration guarantees while preserving predictive performance.
△ Less
Submitted 5 June, 2023; v1 submitted 27 February, 2023;
originally announced February 2023.
-
A framework for leveraging machine learning tools to estimate personalized survival curves
Authors:
Charles J. Wolock,
Peter B. Gilbert,
Noah Simon,
Marco Carone
Abstract:
The conditional survival function of a time-to-event outcome subject to censoring and truncation is a common target of estimation in survival analysis. This parameter may be of scientific interest and also often appears as a nuisance in nonparametric and semiparametric problems. In addition to classical parametric and semiparametric methods (e.g., based on the Cox proportional hazards model), flex…
▽ More
The conditional survival function of a time-to-event outcome subject to censoring and truncation is a common target of estimation in survival analysis. This parameter may be of scientific interest and also often appears as a nuisance in nonparametric and semiparametric problems. In addition to classical parametric and semiparametric methods (e.g., based on the Cox proportional hazards model), flexible machine learning approaches have been developed to estimate the conditional survival function. However, many of these methods are either implicitly or explicitly targeted toward risk stratification rather than overall survival function estimation. Others apply only to discrete-time settings or require inverse probability of censoring weights, which can be as difficult to estimate as the outcome survival function itself. Here, we employ a decomposition of the conditional survival function in terms of observable regression models in which censoring and truncation play no role. This allows application of an array of flexible regression and classification methods rather than only approaches that explicitly handle the complexities inherent to survival data. We outline estimation procedures based on this decomposition, empirically assess their performance, and demonstrate their use on data from an HIV vaccine trial.
△ Less
Submitted 31 October, 2023; v1 submitted 6 November, 2022;
originally announced November 2022.
-
Can the potential benefit of individualizing treatment be assessed using trial summary statistics alone?
Authors:
Nina Galanter,
Marco Carone,
Ronald C. Kessler,
Alex Luedtke
Abstract:
Individualizing treatment assignment can improve outcomes for diseases with patient-to-patient variability in comparative treatment effects. When a clinical trial demonstrates that some patients improve on treatment while others do not, it is tempting to assume that treatment effect heterogeneity exists. However, if variability in response is mainly driven by factors other than treatment, investig…
▽ More
Individualizing treatment assignment can improve outcomes for diseases with patient-to-patient variability in comparative treatment effects. When a clinical trial demonstrates that some patients improve on treatment while others do not, it is tempting to assume that treatment effect heterogeneity exists. However, if variability in response is mainly driven by factors other than treatment, investigating the extent to which covariate data can predict differential treatment response is a potential waste of resources. Motivated by recent meta-analyses assessing the potential of individualizing treatment for major depressive disorder using only summary statistics, we provide a method that uses summary statistics widely available in published clinical trial results to bound the benefit of optimally assigning treatment to each patient. We also offer alternate bounds for settings in which trial results are stratified by another covariate. We demonstrate our approach using summary statistics from a depression treatment trial. Our methods are implemented in the rct2otrbounds R package, which is available at https://github.com/ngalanter/rct2otrbounds .
△ Less
Submitted 31 October, 2022;
originally announced November 2022.
-
A general adaptive framework for multivariate point null testing
Authors:
Adam Elder,
Marco Carone,
Peter Gilbert,
Alex Luedtke
Abstract:
As a common step in refining their scientific inquiry, investigators are often interested in performing some screening of a collection of given statistical hypotheses. For example, they may wish to determine whether any one of several patient characteristics are associated with a health outcome of interest. Existing generic methods for testing a multivariate hypothesis -- such as multiplicity corr…
▽ More
As a common step in refining their scientific inquiry, investigators are often interested in performing some screening of a collection of given statistical hypotheses. For example, they may wish to determine whether any one of several patient characteristics are associated with a health outcome of interest. Existing generic methods for testing a multivariate hypothesis -- such as multiplicity corrections applied to individual hypothesis tests -- can easily be applied across a variety of problems but can suffer from low power in some settings. Tailor-made procedures can attain higher power by building around problem-specific information but typically cannot be easily adapted to novel settings. In this work, we propose a general framework for testing a multivariate point null hypothesis in which the test statistic is adaptively selected to provide increased power. We present theoretical large-sample guarantees for our test under both fixed and local alternatives. In simulation studies, we show that tests created using our framework can perform as well as tailor-made methods when the latter are available, and we illustrate how our procedure can be used to create tests in two settings in which tailor-made methods are not currently available.
△ Less
Submitted 3 March, 2022;
originally announced March 2022.
-
Individualized treatment rules under stochastic treatment cost constraints
Authors:
Hongxiang Qiu,
Marco Carone,
Alex Luedtke
Abstract:
Estimation and evaluation of individualized treatment rules have been studied extensively, but real-world treatment resource constraints have received limited attention in existing methods. We investigate a setting in which treatment is intervened upon based on covariates to optimize the mean counterfactual outcome under treatment cost constraints when the treatment cost is random. In a particular…
▽ More
Estimation and evaluation of individualized treatment rules have been studied extensively, but real-world treatment resource constraints have received limited attention in existing methods. We investigate a setting in which treatment is intervened upon based on covariates to optimize the mean counterfactual outcome under treatment cost constraints when the treatment cost is random. In a particularly interesting special case, an instrumental variable corresponding to encouragement to treatment is intervened upon with constraints on the proportion receiving treatment. For such settings, we first develop a method to estimate optimal individualized treatment rules. We further construct an asymptotically efficient plug-in estimator of the corresponding average treatment effect relative to a given reference rule.
△ Less
Submitted 22 November, 2022; v1 submitted 17 January, 2022;
originally announced January 2022.
-
Assessment of Immune Correlates of Protection via Controlled Vaccine Efficacy and Controlled Risk
Authors:
Peter B. Gilbert,
Youyi Fong,
Marco Carone
Abstract:
Immune correlates of protection (CoPs) are immunologic biomarkers accepted as a surrogate for an infectious disease clinical endpoint and thus can be used for traditional or provisional vaccine approval. To study CoPs in randomized, placebo-controlled trials, correlates of risk (CoRs) are first assessed in vaccine recipients. This analysis does not assess causation, as a CoR may fail to be a CoP.…
▽ More
Immune correlates of protection (CoPs) are immunologic biomarkers accepted as a surrogate for an infectious disease clinical endpoint and thus can be used for traditional or provisional vaccine approval. To study CoPs in randomized, placebo-controlled trials, correlates of risk (CoRs) are first assessed in vaccine recipients. This analysis does not assess causation, as a CoR may fail to be a CoP. We propose a causal CoP analysis that estimates the controlled vaccine efficacy curve across biomarker levels $s$, $CVE(s)$, equal to one minus the ratio of the controlled-risk curve $r_C(s)$ at $s$ and placebo risk, where $r_C(s)$ is causal risk if all participants are assigned vaccine and the biomarker is set to $s$. The criterion for a useful CoP is wide variability of $CVE(s)$ in $s$. Moreover, estimation of $r_C(s)$ is of interest in itself, especially in studies without a placebo arm. For estimation of $r_C(s)$, measured confounders can be adjusted for by any regression method that accommodates missing biomarkers, to which we add sensitivity analysis to quantify robustness of CoP evidence to unmeasured confounding. Application to two harmonized phase 3 trials supports that 50% neutralizing antibody titer has value as a controlled vaccine efficacy CoP for virologically confirmed dengue (VCD): in CYD14 the point estimate (95% confidence interval) for $CVE(s)$ accounting for measured confounders and building in conservative margin for unmeasured confounding increases from 29.6% (95% CI 3.5 to 45.9) at titer 1:36 to 78.5% (95% CI 67.9 to 86.8) at titer 1:1200; these estimates are 17.4% (95% CI -14.4 to 36.5) and 84.5% (95% CI 79.6 to 89.1) for CYD15.
△ Less
Submitted 12 July, 2021;
originally announced July 2021.
-
Inference for treatment-specific survival curves using machine learning
Authors:
Ted Westling,
Alex Luedtke,
Peter Gilbert,
Marco Carone
Abstract:
In the absence of data from a randomized trial, researchers often aim to use observational data to draw causal inference about the effect of a treatment on a time-to-event outcome. In this context, interest often focuses on the treatment-specific survival curves; that is, the survival curves were the entire population under study to be assigned to receive the treatment or not. Under certain causal…
▽ More
In the absence of data from a randomized trial, researchers often aim to use observational data to draw causal inference about the effect of a treatment on a time-to-event outcome. In this context, interest often focuses on the treatment-specific survival curves; that is, the survival curves were the entire population under study to be assigned to receive the treatment or not. Under certain causal conditions, including that all confounders of the treatment-outcome relationship are observed, the treatment-specific survival can be identified with a covariate-adjusted survival function. Several estimators of this function have been proposed, including estimators based on outcome regression, inverse probability weighting, and doubly robust estimators. In this article, we propose a new cross-fitted doubly-robust estimator that incorporates data-adaptive (e.g. machine learning) estimators of the conditional survival functions. We establish conditions on the nuisance estimators under which our estimator is consistent and asymptotically linear, both pointwise and uniformly in time. We also propose a novel ensemble learner for combining multiple candidate estimators of the conditional survival estimators. Notably, our methods and results accommodate events occurring in discrete or continuous time (or both). We investigate the practical performance of our methods using numerical studies and an application to the effect of a surgical treatment to prevent metastases of parotid carcinoma on mortality.
△ Less
Submitted 11 June, 2021;
originally announced June 2021.
-
Inference on function-valued parameters using a restricted score test
Authors:
Aaron Hudson,
Marco Carone,
Ali Shojaie
Abstract:
It is often of interest to make inference on an unknown function that is a local parameter of the data-generating mechanism, such as a density or regression function. Such estimands can typically only be estimated at a slower-than-parametric rate in nonparametric and semiparametric models, and performing calibrated inference can be challenging. In many cases, these estimands can be expressed as th…
▽ More
It is often of interest to make inference on an unknown function that is a local parameter of the data-generating mechanism, such as a density or regression function. Such estimands can typically only be estimated at a slower-than-parametric rate in nonparametric and semiparametric models, and performing calibrated inference can be challenging. In many cases, these estimands can be expressed as the minimizer of a population risk functional. Here, we propose a general framework that leverages such representation and provides a nonparametric extension of the score test for inference on an infinite-dimensional risk minimizer. We demonstrate that our framework is applicable in a wide variety of problems. As both analytic and computational examples, we describe how to use our general approach for inference on a mean regression function under (i) nonparametric and (ii) partially additive models, and evaluate the operating characteristics of the resulting procedures via simulations.
△ Less
Submitted 14 May, 2021;
originally announced May 2021.
-
A general framework for inference on algorithm-agnostic variable importance
Authors:
Brian D. Williamson,
Peter B. Gilbert,
Noah R. Simon,
Marco Carone
Abstract:
In many applications, it is of interest to assess the relative contribution of features (or subsets of features) toward the goal of predicting a response -- in other words, to gauge the variable importance of features. Most recent work on variable importance assessment has focused on describing the importance of features within the confines of a given prediction algorithm. However, such assessment…
▽ More
In many applications, it is of interest to assess the relative contribution of features (or subsets of features) toward the goal of predicting a response -- in other words, to gauge the variable importance of features. Most recent work on variable importance assessment has focused on describing the importance of features within the confines of a given prediction algorithm. However, such assessment does not necessarily characterize the prediction potential of features, and may provide a misleading reflection of the intrinsic value of these features. To address this limitation, we propose a general framework for nonparametric inference on interpretable algorithm-agnostic variable importance. We define variable importance as a population-level contrast between the oracle predictiveness of all available features versus all features except those under consideration. We propose a nonparametric efficient estimation procedure that allows the construction of valid confidence intervals, even when machine learning techniques are used. We also outline a valid strategy for testing the null importance hypothesis. Through simulations, we show that our proposal has good operating characteristics, and we illustrate its use with data from a study of an antibody against HIV-1 infection.
△ Less
Submitted 13 September, 2021; v1 submitted 7 April, 2020;
originally announced April 2020.
-
Universal sieve-based strategies for efficient estimation using machine learning tools
Authors:
Hongxiang Qiu,
Alex Luedtke,
Marco Carone
Abstract:
Suppose that we wish to estimate a finite-dimensional summary of one or more function-valued features of an underlying data-generating mechanism under a nonparametric model. One approach to estimation is by plugging in flexible estimates of these features. Unfortunately, in general, such estimators may not be asymptotically efficient, which often makes these estimators difficult to use as a basis…
▽ More
Suppose that we wish to estimate a finite-dimensional summary of one or more function-valued features of an underlying data-generating mechanism under a nonparametric model. One approach to estimation is by plugging in flexible estimates of these features. Unfortunately, in general, such estimators may not be asymptotically efficient, which often makes these estimators difficult to use as a basis for inference. Though there are several existing methods to construct asymptotically efficient plug-in estimators, each such method either can only be derived using knowledge of efficiency theory or is only valid under stringent smoothness assumptions. Among existing methods, sieve estimators stand out as particularly convenient because efficiency theory is not required in their construction, their tuning parameters can be selected data adaptively, and they are universal in the sense that the same fits lead to efficient plug-in estimators for a rich class of estimands. Inspired by these desirable properties, we propose two novel universal approaches for estimating function-valued features that can be analyzed using sieve estimation theory. Compared to traditional sieve estimators, these approaches are valid under more general conditions on the smoothness of the function-valued features by utilizing flexible estimates that can be obtained, for example, using machine learning.
△ Less
Submitted 26 August, 2020; v1 submitted 3 March, 2020;
originally announced March 2020.
-
Combining Biomarkers by Maximizing the True Positive Rate for a Fixed False Positive Rate
Authors:
Allison Meisner,
Marco Carone,
Margaret S. Pepe,
Kathleen F. Kerr
Abstract:
Biomarkers abound in many areas of clinical research, and often investigators are interested in combining them for diagnosis, prognosis, or screening. In many applications, the true positive rate for a biomarker combination at a prespecified, clinically acceptable false positive rate is the most relevant measure of predictive capacity. We propose a distribution-free method for constructing biomark…
▽ More
Biomarkers abound in many areas of clinical research, and often investigators are interested in combining them for diagnosis, prognosis, or screening. In many applications, the true positive rate for a biomarker combination at a prespecified, clinically acceptable false positive rate is the most relevant measure of predictive capacity. We propose a distribution-free method for constructing biomarker combinations by maximizing the true positive rate while constraining the false positive rate. Theoretical results demonstrate desirable properties of biomarker combinations produced by the new method. In simulations, the biomarker combination provided by our method demonstrated improved operating characteristics in a variety of scenarios when compared with alternative methods for constructing biomarker combinations.
△ Less
Submitted 4 October, 2019;
originally announced October 2019.
-
Correcting an estimator of a multivariate monotone function with isotonic regression
Authors:
Ted Westling,
Mark van der Laan,
Marco Carone
Abstract:
In many problems, a sensible estimator of a possibly multivariate monotone function may itself fail to be monotone. We study the correction of such an estimator obtained via projection onto the space of functions monotone over a finite grid in the domain. We demonstrate that this corrected estimator has no worse supremal estimation error than the initial estimator, and that analogously corrected c…
▽ More
In many problems, a sensible estimator of a possibly multivariate monotone function may itself fail to be monotone. We study the correction of such an estimator obtained via projection onto the space of functions monotone over a finite grid in the domain. We demonstrate that this corrected estimator has no worse supremal estimation error than the initial estimator, and that analogously corrected confidence bands contain the true function whenever the initial bands do, at no loss to average or maximal band width. Additionally, we demonstrate that the corrected estimator is uniformly asymptotically equivalent to the initial estimator provided that the initial estimator satisfies a stochastic equicontinuity condition and that the true function is Lipschitz and strictly monotone. We provide simple sufficient conditions for our stochastic equicontinuity condition in the important special case that the initial estimator is uniformly asymptotically linear, and illustrate the use of these results for estimation of a G-computed distribution function. Our stochastic equicontinuity condition is weaker than standard uniform stochastic equicontinuity, which has been required for alternative correction procedures. Crucially, this allows us to apply our results to the bivariate correction of the local linear estimator of a conditional distribution function known to be monotone in its conditioning argument. Our experiments suggest that the projection step can yield significant practical improvements in performance for both the estimator and confidence band.
△ Less
Submitted 4 September, 2019; v1 submitted 21 October, 2018;
originally announced October 2018.
-
Causal isotonic regression
Authors:
Ted Westling,
Peter Gilbert,
Marco Carone
Abstract:
In observational studies, potential confounders may distort the causal relationship between an exposure and an outcome. However, under some conditions, a causal dose-response curve can be recovered using the G-computation formula. Most classical methods for estimating such curves when the exposure is continuous rely on restrictive parametric assumptions, which carry significant risk of model missp…
▽ More
In observational studies, potential confounders may distort the causal relationship between an exposure and an outcome. However, under some conditions, a causal dose-response curve can be recovered using the G-computation formula. Most classical methods for estimating such curves when the exposure is continuous rely on restrictive parametric assumptions, which carry significant risk of model misspecification. Nonparametric estimation in this context is challenging because in a nonparametric model these curves cannot be estimated at regular rates. Many available nonparametric estimators are sensitive to the selection of certain tuning parameters, and performing valid inference with such estimators can be difficult. In this work, we propose a nonparametric estimator of a causal dose-response curve known to be monotone. We show that our proposed estimation procedure generalizes the classical least-squares isotonic regression estimator of a monotone regression function. Specifically, it does not involve tuning parameters, and is invariant to strictly monotone transformations of the exposure variable. We describe theoretical properties of our proposed estimator, including its irregular limit distribution and the potential for doubly-robust inference. Furthermore, we illustrate its performance via numerical studies, and use it to assess the relationship between BMI and immune response in HIV vaccine trials.
△ Less
Submitted 16 December, 2019; v1 submitted 8 October, 2018;
originally announced October 2018.
-
A unified study of nonparametric inference for monotone functions
Authors:
Ted Westling,
Marco Carone
Abstract:
The problem of nonparametric inference on a monotone function has been extensively studied in many particular cases. Estimators considered have often been of so-called Grenander type, being representable as the left derivative of the greatest convex minorant or least concave majorant of an estimator of a primitive function. In this paper, we provide general conditions for consistency and pointwise…
▽ More
The problem of nonparametric inference on a monotone function has been extensively studied in many particular cases. Estimators considered have often been of so-called Grenander type, being representable as the left derivative of the greatest convex minorant or least concave majorant of an estimator of a primitive function. In this paper, we provide general conditions for consistency and pointwise convergence in distribution of a class of generalized Grenander-type estimators of a monotone function. This broad class allows the minorization or majoratization operation to be performed on a data-dependent transformation of the domain, possibly yielding benefits in practice. Additionally, we provide simpler conditions and more concrete distributional theory in the important case that the primitive estimator and data-dependent transformation function are asymptotically linear. We use our general results in the context of various well-studied problems, and show that we readily recover classical results established separately in each case. More importantly, we show that our results allow us to tackle more challenging problems involving parameters for which the use of flexible learning strategies appears necessary. In particular, we study inference on monotone density and hazard functions using informatively right-censored data, extending the classical work on independent censoring, and on a covariate-marginalized conditional mean function, extending the classical work on monotone regression functions. In addition to a theoretical study, we present numerical evidence supporting our large-sample results.
△ Less
Submitted 29 November, 2018; v1 submitted 5 June, 2018;
originally announced June 2018.
-
On-Demand Virtual Research Environments using Microservices
Authors:
Marco Capuccini,
Anders Larsson,
Matteo Carone,
Jon Ander Novella,
Noureddin Sadawi,
Jianliang Gao,
Salman Toor,
Ola Spjuth
Abstract:
The computational demands for scientific applications are continuously increasing. The emergence of cloud computing has enabled on-demand resource allocation. However, relying solely on infrastructure as a service does not achieve the degree of flexibility required by the scientific community. Here we present a microservice-oriented methodology, where scientific applications run in a distributed o…
▽ More
The computational demands for scientific applications are continuously increasing. The emergence of cloud computing has enabled on-demand resource allocation. However, relying solely on infrastructure as a service does not achieve the degree of flexibility required by the scientific community. Here we present a microservice-oriented methodology, where scientific applications run in a distributed orchestration platform as software containers, referred to as on-demand, virtual research environments. The methodology is vendor agnostic and we provide an open source implementation that supports the major cloud providers, offering scalable management of scientific pipelines. We demonstrate applicability and scalability of our methodology in life science applications, but the methodology is general and can be applied to other scientific domains.
△ Less
Submitted 10 May, 2019; v1 submitted 16 May, 2018;
originally announced May 2018.
-
Sequential Double Robustness in Right-Censored Longitudinal Models
Authors:
Alexander R. Luedtke,
Oleg Sofrygin,
Mark J. van der Laan,
Marco Carone
Abstract:
Consider estimating the G-formula for the counterfactual mean outcome under a given treatment regime in a longitudinal study. Bang and Robins provided an estimator for this quantity that relies on a sequential regression formulation of this parameter. This approach is doubly robust in that it is consistent if either the outcome regressions or the treatment mechanisms are consistently estimated. We…
▽ More
Consider estimating the G-formula for the counterfactual mean outcome under a given treatment regime in a longitudinal study. Bang and Robins provided an estimator for this quantity that relies on a sequential regression formulation of this parameter. This approach is doubly robust in that it is consistent if either the outcome regressions or the treatment mechanisms are consistently estimated. We define a stronger notion of double robustness, termed sequential double robustness, for estimators of the longitudinal G-formula. The definition emerges naturally from a more general definition of sequential double robustness for the outcome regression estimators. An outcome regression estimator is sequentially doubly robust (SDR) if, at each subsequent time point, either the outcome regression or the treatment mechanism is consistently estimated. This form of robustness is exactly what one would anticipate is attainable by studying the remainder term of a first-order expansion of the G-formula parameter. We show that a particular implementation of an existing procedure is SDR. We also introduce a novel SDR estimator, whose development involves a novel translation of ideas used in targeted minimum loss-based estimation to the infinite-dimensional setting.
△ Less
Submitted 16 May, 2018; v1 submitted 6 May, 2017;
originally announced May 2017.
-
Toward computerized efficient estimation in infinite-dimensional models
Authors:
Marco Carone,
Alexander R. Luedtke,
Mark J. van der Laan
Abstract:
Despite the risk of misspecification they are tied to, parametric models continue to be used in statistical practice because they are accessible to all. In particular, efficient estimation procedures in parametric models are simple to describe and implement. Unfortunately, the same cannot be said of semiparametric and nonparametric models. While the latter often reflect the level of available scie…
▽ More
Despite the risk of misspecification they are tied to, parametric models continue to be used in statistical practice because they are accessible to all. In particular, efficient estimation procedures in parametric models are simple to describe and implement. Unfortunately, the same cannot be said of semiparametric and nonparametric models. While the latter often reflect the level of available scientific knowledge more appropriately, performing efficient inference in these models is generally challenging. The efficient influence function is a key analytic object from which the construction of asymptotically efficient estimators can potentially be streamlined. However, the theoretical derivation of the efficient influence function requires specialized knowledge and is often a difficult task, even for experts. In this paper, we propose and discuss a numerical procedure for approximating the efficient influence function. The approach generalizes the simple nonparametric procedures described recently by Frangakis et al. (2015) and Luedtke et al. (2015) to arbitrary models. We present theoretical results to support our proposal, and also illustrate the method in the context of two examples. The proposed approach is an important step toward automating efficient estimation in general statistical models, thereby rendering the use of realistic models in statistical analyses much more accessible.
△ Less
Submitted 30 August, 2016;
originally announced August 2016.
-
Second-Order Inference for the Mean of a Variable Missing at Random
Authors:
Iván Díaz,
Marco Carone,
Mark J. van der Laan
Abstract:
We present a second-order estimator of the mean of a variable subject to missingness, under the missing at random assumption. The estimator improves upon existing methods by using an approximate second-order expansion of the parameter functional, in addition to the first-order expansion employed by standard doubly robust methods. This results in weaker assumptions about the convergence rates neces…
▽ More
We present a second-order estimator of the mean of a variable subject to missingness, under the missing at random assumption. The estimator improves upon existing methods by using an approximate second-order expansion of the parameter functional, in addition to the first-order expansion employed by standard doubly robust methods. This results in weaker assumptions about the convergence rates necessary to establish consistency, local efficiency, and asymptotic linearity. The general estimation strategy is developed under the targeted minimum loss-based estimation (TMLE) framework. We present a simulation comparing the sensitivity of the first and second order estimators to the convergence rate of the initial estimators of the outcome regression and missingness score. In our simulation, the second-order TMLE improved the coverage probability of a confidence interval by up to 85%. In addition, we present a first-order estimator inspired by a second-order expansion of the parameter functional. This estimator only requires one-dimensional smoothing, whereas implementation of the second-order TMLE generally requires kernel smoothing on the covariate space. The first-order estimator proposed is expected to have improved finite sample performance compared to existing first-order estimators. In our simulations, the proposed first-order estimator improved the coverage probability by up to 90%. We provide an illustration of our methods using a publicly available dataset to determine the effect of an anticoagulant on health outcomes of patients undergoing percutaneous coronary intervention. We provide R code implementing the proposed estimator.
△ Less
Submitted 26 November, 2015;
originally announced November 2015.
-
An Omnibus Nonparametric Test of Equality in Distribution for Unknown Functions
Authors:
Alexander R. Luedtke,
Marco Carone,
Mark J. van der Laan
Abstract:
We present a novel family of nonparametric omnibus tests of the hypothesis that two unknown but estimable functions are equal in distribution when applied to the observed data structure. We developed these tests, which represent a generalization of the maximum mean discrepancy tests described in Gretton et al. [2006], using recent developments from the higher-order pathwise differentiability liter…
▽ More
We present a novel family of nonparametric omnibus tests of the hypothesis that two unknown but estimable functions are equal in distribution when applied to the observed data structure. We developed these tests, which represent a generalization of the maximum mean discrepancy tests described in Gretton et al. [2006], using recent developments from the higher-order pathwise differentiability literature. Despite their complex derivation, the associated test statistics can be expressed rather simply as U-statistics. We study the asymptotic behavior of the proposed tests under the null hypothesis and under both fixed and local alternatives. We provide examples to which our tests can be applied and show that they perform well in a simulation study. As an important special case, our proposed tests can be used to determine whether an unknown function, such as the conditional average treatment effect, is equal to zero almost surely.
△ Less
Submitted 13 June, 2017; v1 submitted 14 October, 2015;
originally announced October 2015.
-
Large-sample study of the kernel density estimators under multiplicative censoring
Authors:
Masoud Asgharian,
Marco Carone,
Vahid Fakoor
Abstract:
The multiplicative censoring model introduced in Vardi [Biometrika 76 (1989) 751--761] is an incomplete data problem whereby two independent samples from the lifetime distribution $G$, $\mathcal{X}_m=(X_1,...,X_m)$ and $\mathcal{Z}_n=(Z_1,...,Z_n)$, are observed subject to a form of coarsening. Specifically, sample $\mathcal{X}_m$ is fully observed while $\mathcal{Y}_n=(Y_1,...,Y_n)$ is observed i…
▽ More
The multiplicative censoring model introduced in Vardi [Biometrika 76 (1989) 751--761] is an incomplete data problem whereby two independent samples from the lifetime distribution $G$, $\mathcal{X}_m=(X_1,...,X_m)$ and $\mathcal{Z}_n=(Z_1,...,Z_n)$, are observed subject to a form of coarsening. Specifically, sample $\mathcal{X}_m$ is fully observed while $\mathcal{Y}_n=(Y_1,...,Y_n)$ is observed instead of $\mathcal{Z}_n$, where $Y_i=U_iZ_i$ and $(U_1,...,U_n)$ is an independent sample from the standard uniform distribution. Vardi [Biometrika 76 (1989) 751--761] showed that this model unifies several important statistical problems, such as the deconvolution of an exponential random variable, estimation under a decreasing density constraint and an estimation problem in renewal processes. In this paper, we establish the large-sample properties of kernel density estimators under the multiplicative censoring model. We first construct a strong approximation for the process $\sqrt{k}(\hat{G}-G)$, where $\hat{G}$ is a solution of the nonparametric score equation based on $(\mathcal{X}_m,\mathcal{Y}_n)$, and $k=m+n$ is the total sample size. Using this strong approximation and a result on the global modulus of continuity, we establish conditions for the strong uniform consistency of kernel density estimators. We also make use of this strong approximation to study the weak convergence and integrated squared error properties of these estimators. We conclude by extending our results to the setting of length-biased sampling.
△ Less
Submitted 29 May, 2012;
originally announced May 2012.