Search | arXiv e-print repository

Debiased machine learning for counterfactual survival functionals based on left-truncated right-censored data

Authors: Eric R. Morenz, Charles J. Wolock, Marco Carone

Abstract: Learning causal effects of a binary exposure on time-to-event endpoints can be challenging because survival times may be partially observed due to censoring and systematically biased due to truncation. In this work, we present debiased machine learning-based nonparametric estimators of the joint distribution of a counterfactual survival time and baseline covariates for use when the observed data a… ▽ More Learning causal effects of a binary exposure on time-to-event endpoints can be challenging because survival times may be partially observed due to censoring and systematically biased due to truncation. In this work, we present debiased machine learning-based nonparametric estimators of the joint distribution of a counterfactual survival time and baseline covariates for use when the observed data are subject to covariate-dependent left truncation and right censoring and when baseline covariates suffice to deconfound the relationship between exposure and survival time. Our inferential procedures explicitly allow the integration of flexible machine learning tools for nuisance estimation, and enjoy certain robustness properties. The approach we propose can be directly used to make pointwise or uniform inference on smooth summaries of the joint counterfactual survival time and covariate distribution, and can be valuable even in the absence of interventions, when summaries of a marginal survival distribution are of interest. We showcase how our procedures can be used to learn a variety of inferential targets and illustrate their performance in simulation studies. △ Less

Submitted 13 November, 2024; originally announced November 2024.

Comments: The first two authors contributed equally to this work. 61 pages (36 main text, 25 supplement). 6 figures (6 main text, 0 supplement)

arXiv:2407.04214 [pdf, other]

doi 10.1097/EDE.0000000000001882

Investigating symptom duration using current status data: a case study of post-acute COVID-19 syndrome

Authors: Charles J. Wolock, Susan Jacob, Julia C. Bennett, Anna Elias-Warren, Jessica O'Hanlon, Avi Kenny, Nicholas P. Jewell, Andrea Rotnitzky, Stephen R. Cole, Ana A. Weil, Helen Y. Chu, Marco Carone

Abstract: For infectious diseases, characterizing symptom duration is of clinical and public health importance. Symptom duration may be assessed by surveying infected individuals and querying symptom status at the time of survey response. For example, in a SARS-CoV-2 testing program at the University of Washington, participants were surveyed at least $28$ days after testing positive and asked to report curr… ▽ More For infectious diseases, characterizing symptom duration is of clinical and public health importance. Symptom duration may be assessed by surveying infected individuals and querying symptom status at the time of survey response. For example, in a SARS-CoV-2 testing program at the University of Washington, participants were surveyed at least $28$ days after testing positive and asked to report current symptom status. This study design yielded current status data: outcome measurements for each respondent consisted only of the time of survey response and a binary indicator of whether symptoms had resolved by that time. Such study design benefits from limited risk of recall bias, but analyzing the resulting data necessitates tailored statistical tools. Here, we review methods for current status data and describe a novel application of modern nonparametric techniques to this setting. The proposed approach is valid under weaker assumptions compared to existing methods, allows use of flexible machine learning tools, and handles potential survey nonresponse. From the university study, under an assumption that the survey response time is conditionally independent of symptom resolution time within strata of measured covariates, we estimate that 19% of participants experienced ongoing symptoms 30 days after testing positive, decreasing to 7% at 90 days. We assess the sensitivity of these results to deviations from conditional independence, finding the estimates to be more sensitive to assumption violations at 30 days compared to 90 days. Female sex, fatigue during acute infection, and higher viral load were associated with slower symptom resolution. △ Less

Submitted 17 March, 2025; v1 submitted 4 July, 2024; originally announced July 2024.

Comments: The first two authors contributed equally to this work. Main text: 22 pages, 2 figure, 4 tables. Supplement: 23 pages, 14 figures, 0 tables. This update (v3) includes sensitivity analysis methodology and results

arXiv:2403.05698 [pdf, other]

SimEngine: A Modular Framework for Statistical Simulations in R

Authors: Avi Kenny, Charles J. Wolock

Abstract: This article describes SimEngine, an open-source R package for structuring, maintaining, running, and debugging statistical simulations on both local and cluster-based computing environments. Several R packages exist for structuring simulations, but SimEngine is the only package specifically designed for running simulations in parallel via job schedulers on high-performance cluster computing syste… ▽ More This article describes SimEngine, an open-source R package for structuring, maintaining, running, and debugging statistical simulations on both local and cluster-based computing environments. Several R packages exist for structuring simulations, but SimEngine is the only package specifically designed for running simulations in parallel via job schedulers on high-performance cluster computing systems. The package provides structure and functionality for common simulation tasks, such as setting simulation levels, managing seeds for random number generation, and calculating summary metrics (such as bias and confidence interval coverage). SimEngine also brings several unique features, such as automatic calculation of Monte Carlo error and information-sharing across simulation replicates. We provide an overview of the package and demonstrate some of its advanced functionality. △ Less

Submitted 8 March, 2024; originally announced March 2024.

arXiv:2311.12726 [pdf, other]

doi 10.1093/biomet/asae061

Assessing variable importance in survival analysis using machine learning

Authors: Charles J. Wolock, Peter B. Gilbert, Noah Simon, Marco Carone

Abstract: Given a collection of features available for inclusion in a predictive model, it may be of interest to quantify the relative importance of a subset of features for the prediction task at hand. For example, in HIV vaccine trials, participant baseline characteristics are used to predict the probability of HIV acquisition over the intended follow-up period, and investigators may wish to understand ho… ▽ More Given a collection of features available for inclusion in a predictive model, it may be of interest to quantify the relative importance of a subset of features for the prediction task at hand. For example, in HIV vaccine trials, participant baseline characteristics are used to predict the probability of HIV acquisition over the intended follow-up period, and investigators may wish to understand how much certain types of predictors, such as behavioral factors, contribute toward overall predictiveness. Time-to-event outcomes such as time to HIV acquisition are often subject to right censoring, and existing methods for assessing variable importance are typically not intended to be used in this setting. We describe a broad class of algorithm-agnostic variable importance measures for prediction in the context of survival data. We propose a nonparametric efficient estimation procedure that incorporates flexible learning of nuisance parameters, yields asymptotically valid inference, and enjoys double-robustness. We assess the performance of our proposed procedure via numerical simulations and analyze data from the HVTN 702 vaccine trial to inform enrollment strategies for future HIV vaccine trials. △ Less

Submitted 12 August, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

Comments: 98 total pages (37 main text, 61 supplementary)

Journal ref: Biometrika 112(2) (2025)

arXiv:2211.03031 [pdf, other]

doi 10.1080/10618600.2024.2304070

A framework for leveraging machine learning tools to estimate personalized survival curves

Authors: Charles J. Wolock, Peter B. Gilbert, Noah Simon, Marco Carone

Abstract: The conditional survival function of a time-to-event outcome subject to censoring and truncation is a common target of estimation in survival analysis. This parameter may be of scientific interest and also often appears as a nuisance in nonparametric and semiparametric problems. In addition to classical parametric and semiparametric methods (e.g., based on the Cox proportional hazards model), flex… ▽ More The conditional survival function of a time-to-event outcome subject to censoring and truncation is a common target of estimation in survival analysis. This parameter may be of scientific interest and also often appears as a nuisance in nonparametric and semiparametric problems. In addition to classical parametric and semiparametric methods (e.g., based on the Cox proportional hazards model), flexible machine learning approaches have been developed to estimate the conditional survival function. However, many of these methods are either implicitly or explicitly targeted toward risk stratification rather than overall survival function estimation. Others apply only to discrete-time settings or require inverse probability of censoring weights, which can be as difficult to estimate as the outcome survival function itself. Here, we employ a decomposition of the conditional survival function in terms of observable regression models in which censoring and truncation play no role. This allows application of an array of flexible regression and classification methods rather than only approaches that explicitly handle the complexities inherent to survival data. We outline estimation procedures based on this decomposition, empirically assess their performance, and demonstrate their use on data from an HIV vaccine trial. △ Less

Submitted 31 October, 2023; v1 submitted 6 November, 2022; originally announced November 2022.

Comments: 52 pages, 13 figures

Journal ref: Journal of Computational and Graphical Statistics 33(3) 1098-1108 (2024)

Showing 1–5 of 5 results for author: Wolock, C J