-
Debiased machine learning for counterfactual survival functionals based on left-truncated right-censored data
Authors:
Eric R. Morenz,
Charles J. Wolock,
Marco Carone
Abstract:
Learning causal effects of a binary exposure on time-to-event endpoints can be challenging because survival times may be partially observed due to censoring and systematically biased due to truncation. In this work, we present debiased machine learning-based nonparametric estimators of the joint distribution of a counterfactual survival time and baseline covariates for use when the observed data a…
▽ More
Learning causal effects of a binary exposure on time-to-event endpoints can be challenging because survival times may be partially observed due to censoring and systematically biased due to truncation. In this work, we present debiased machine learning-based nonparametric estimators of the joint distribution of a counterfactual survival time and baseline covariates for use when the observed data are subject to covariate-dependent left truncation and right censoring and when baseline covariates suffice to deconfound the relationship between exposure and survival time. Our inferential procedures explicitly allow the integration of flexible machine learning tools for nuisance estimation, and enjoy certain robustness properties. The approach we propose can be directly used to make pointwise or uniform inference on smooth summaries of the joint counterfactual survival time and covariate distribution, and can be valuable even in the absence of interventions, when summaries of a marginal survival distribution are of interest. We showcase how our procedures can be used to learn a variety of inferential targets and illustrate their performance in simulation studies.
△ Less
Submitted 13 November, 2024;
originally announced November 2024.
-
Investigating symptom duration using current status data: a case study of post-acute COVID-19 syndrome
Authors:
Charles J. Wolock,
Susan Jacob,
Julia C. Bennett,
Anna Elias-Warren,
Jessica O'Hanlon,
Avi Kenny,
Nicholas P. Jewell,
Andrea Rotnitzky,
Stephen R. Cole,
Ana A. Weil,
Helen Y. Chu,
Marco Carone
Abstract:
For infectious diseases, characterizing symptom duration is of clinical and public health importance. Symptom duration may be assessed by surveying infected individuals and querying symptom status at the time of survey response. For example, in a SARS-CoV-2 testing program at the University of Washington, participants were surveyed at least $28$ days after testing positive and asked to report curr…
▽ More
For infectious diseases, characterizing symptom duration is of clinical and public health importance. Symptom duration may be assessed by surveying infected individuals and querying symptom status at the time of survey response. For example, in a SARS-CoV-2 testing program at the University of Washington, participants were surveyed at least $28$ days after testing positive and asked to report current symptom status. This study design yielded current status data: outcome measurements for each respondent consisted only of the time of survey response and a binary indicator of whether symptoms had resolved by that time. Such study design benefits from limited risk of recall bias, but analyzing the resulting data necessitates tailored statistical tools. Here, we review methods for current status data and describe a novel application of modern nonparametric techniques to this setting. The proposed approach is valid under weaker assumptions compared to existing methods, allows use of flexible machine learning tools, and handles potential survey nonresponse. From the university study, under an assumption that the survey response time is conditionally independent of symptom resolution time within strata of measured covariates, we estimate that 19% of participants experienced ongoing symptoms 30 days after testing positive, decreasing to 7% at 90 days. We assess the sensitivity of these results to deviations from conditional independence, finding the estimates to be more sensitive to assumption violations at 30 days compared to 90 days. Female sex, fatigue during acute infection, and higher viral load were associated with slower symptom resolution.
△ Less
Submitted 17 March, 2025; v1 submitted 4 July, 2024;
originally announced July 2024.
-
SimEngine: A Modular Framework for Statistical Simulations in R
Authors:
Avi Kenny,
Charles J. Wolock
Abstract:
This article describes SimEngine, an open-source R package for structuring, maintaining, running, and debugging statistical simulations on both local and cluster-based computing environments. Several R packages exist for structuring simulations, but SimEngine is the only package specifically designed for running simulations in parallel via job schedulers on high-performance cluster computing syste…
▽ More
This article describes SimEngine, an open-source R package for structuring, maintaining, running, and debugging statistical simulations on both local and cluster-based computing environments. Several R packages exist for structuring simulations, but SimEngine is the only package specifically designed for running simulations in parallel via job schedulers on high-performance cluster computing systems. The package provides structure and functionality for common simulation tasks, such as setting simulation levels, managing seeds for random number generation, and calculating summary metrics (such as bias and confidence interval coverage). SimEngine also brings several unique features, such as automatic calculation of Monte Carlo error and information-sharing across simulation replicates. We provide an overview of the package and demonstrate some of its advanced functionality.
△ Less
Submitted 8 March, 2024;
originally announced March 2024.
-
Assessing variable importance in survival analysis using machine learning
Authors:
Charles J. Wolock,
Peter B. Gilbert,
Noah Simon,
Marco Carone
Abstract:
Given a collection of features available for inclusion in a predictive model, it may be of interest to quantify the relative importance of a subset of features for the prediction task at hand. For example, in HIV vaccine trials, participant baseline characteristics are used to predict the probability of HIV acquisition over the intended follow-up period, and investigators may wish to understand ho…
▽ More
Given a collection of features available for inclusion in a predictive model, it may be of interest to quantify the relative importance of a subset of features for the prediction task at hand. For example, in HIV vaccine trials, participant baseline characteristics are used to predict the probability of HIV acquisition over the intended follow-up period, and investigators may wish to understand how much certain types of predictors, such as behavioral factors, contribute toward overall predictiveness. Time-to-event outcomes such as time to HIV acquisition are often subject to right censoring, and existing methods for assessing variable importance are typically not intended to be used in this setting. We describe a broad class of algorithm-agnostic variable importance measures for prediction in the context of survival data. We propose a nonparametric efficient estimation procedure that incorporates flexible learning of nuisance parameters, yields asymptotically valid inference, and enjoys double-robustness. We assess the performance of our proposed procedure via numerical simulations and analyze data from the HVTN 702 vaccine trial to inform enrollment strategies for future HIV vaccine trials.
△ Less
Submitted 12 August, 2024; v1 submitted 21 November, 2023;
originally announced November 2023.
-
A framework for leveraging machine learning tools to estimate personalized survival curves
Authors:
Charles J. Wolock,
Peter B. Gilbert,
Noah Simon,
Marco Carone
Abstract:
The conditional survival function of a time-to-event outcome subject to censoring and truncation is a common target of estimation in survival analysis. This parameter may be of scientific interest and also often appears as a nuisance in nonparametric and semiparametric problems. In addition to classical parametric and semiparametric methods (e.g., based on the Cox proportional hazards model), flex…
▽ More
The conditional survival function of a time-to-event outcome subject to censoring and truncation is a common target of estimation in survival analysis. This parameter may be of scientific interest and also often appears as a nuisance in nonparametric and semiparametric problems. In addition to classical parametric and semiparametric methods (e.g., based on the Cox proportional hazards model), flexible machine learning approaches have been developed to estimate the conditional survival function. However, many of these methods are either implicitly or explicitly targeted toward risk stratification rather than overall survival function estimation. Others apply only to discrete-time settings or require inverse probability of censoring weights, which can be as difficult to estimate as the outcome survival function itself. Here, we employ a decomposition of the conditional survival function in terms of observable regression models in which censoring and truncation play no role. This allows application of an array of flexible regression and classification methods rather than only approaches that explicitly handle the complexities inherent to survival data. We outline estimation procedures based on this decomposition, empirically assess their performance, and demonstrate their use on data from an HIV vaccine trial.
△ Less
Submitted 31 October, 2023; v1 submitted 6 November, 2022;
originally announced November 2022.