-
Test-Negative Designs with Multiple Testing Sources
Authors:
Mengxin Yu,
Nicholas P. Jewell
Abstract:
Test-negative designs (TNDs), a form of case-cohort study, are widely used to evaluate infectious disease interventions, notably for influenza and, more recently, COVID-19 vaccines. TNDs rely on recruiting individuals who are tested for the disease of interest and comparing test-positive and test-negative individuals by exposure status (e.g., vaccination). Traditionally, TND studies focused on sym…
▽ More
Test-negative designs (TNDs), a form of case-cohort study, are widely used to evaluate infectious disease interventions, notably for influenza and, more recently, COVID-19 vaccines. TNDs rely on recruiting individuals who are tested for the disease of interest and comparing test-positive and test-negative individuals by exposure status (e.g., vaccination). Traditionally, TND studies focused on symptomatic individuals to minimize confounding from healthcare-seeking behavior. However, during outbreaks such as COVID-19 and Ebola, testing also occurred for asymptomatic individuals (e.g., through contact tracing), introducing potential bias when combining symptomatic and asymptomatic cases. Motivated by a trial evaluating an Ebola virus disease (EVD) vaccine, we study a specific version of this ``multiple reasons for testing" problem. In this setting, symptomatic individuals were tested under the standard TND approach, while asymptomatic close contacts of test-positive cases were also tested. We propose a simple method to estimate the common vaccine efficacy across these groups and assess whether efficacy differs by recruitment pathway. Although the EVD trial ended early due to the cessation of the outbreak, the proposed methodology remains relevant for future vaccine trials with similar designs.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
Investigating symptom duration using current status data: a case study of post-acute COVID-19 syndrome
Authors:
Charles J. Wolock,
Susan Jacob,
Julia C. Bennett,
Anna Elias-Warren,
Jessica O'Hanlon,
Avi Kenny,
Nicholas P. Jewell,
Andrea Rotnitzky,
Stephen R. Cole,
Ana A. Weil,
Helen Y. Chu,
Marco Carone
Abstract:
For infectious diseases, characterizing symptom duration is of clinical and public health importance. Symptom duration may be assessed by surveying infected individuals and querying symptom status at the time of survey response. For example, in a SARS-CoV-2 testing program at the University of Washington, participants were surveyed at least $28$ days after testing positive and asked to report curr…
▽ More
For infectious diseases, characterizing symptom duration is of clinical and public health importance. Symptom duration may be assessed by surveying infected individuals and querying symptom status at the time of survey response. For example, in a SARS-CoV-2 testing program at the University of Washington, participants were surveyed at least $28$ days after testing positive and asked to report current symptom status. This study design yielded current status data: outcome measurements for each respondent consisted only of the time of survey response and a binary indicator of whether symptoms had resolved by that time. Such study design benefits from limited risk of recall bias, but analyzing the resulting data necessitates tailored statistical tools. Here, we review methods for current status data and describe a novel application of modern nonparametric techniques to this setting. The proposed approach is valid under weaker assumptions compared to existing methods, allows use of flexible machine learning tools, and handles potential survey nonresponse. From the university study, under an assumption that the survey response time is conditionally independent of symptom resolution time within strata of measured covariates, we estimate that 19% of participants experienced ongoing symptoms 30 days after testing positive, decreasing to 7% at 90 days. We assess the sensitivity of these results to deviations from conditional independence, finding the estimates to be more sensitive to assumption violations at 30 days compared to 90 days. Female sex, fatigue during acute infection, and higher viral load were associated with slower symptom resolution.
△ Less
Submitted 17 March, 2025; v1 submitted 4 July, 2024;
originally announced July 2024.
-
Test-negative designs with various reasons for testing: statistical bias and solution
Authors:
Mengxin Yu,
Tom Hongyi Liu,
Kendrick Qijun Li,
Nicholas Jewell,
Eric Tchetgen Tchetgen,
Dylan Small,
Xu Shi,
Bingkai Wang
Abstract:
Test-negative designs are widely used for post-market evaluation of vaccine effectiveness, particularly in cases when randomized trials are not feasible. Differing from classical test-negative designs where only healthcare-seekers with symptoms are included, recent test-negative designs have involved individuals with various reasons for testing, especially in an outbreak setting. While including t…
▽ More
Test-negative designs are widely used for post-market evaluation of vaccine effectiveness, particularly in cases when randomized trials are not feasible. Differing from classical test-negative designs where only healthcare-seekers with symptoms are included, recent test-negative designs have involved individuals with various reasons for testing, especially in an outbreak setting. While including these data can increase sample size and hence improve precision, concerns have been raised about whether they introduce bias into the current framework of test-negative designs, thereby demanding a formal statistical examination of this modified design. In this article, using statistical derivations, causal graphs, and numerical demonstrations, we show that the standard odds ratio estimator may be biased if various reasons for testing are not accounted for. To eliminate this bias, we identify three categories of reasons for testing, including symptoms, mandatory screening, and case contact tracing, and characterize associated statistical properties and estimands. Based on our characterization, we show how to consistently estimate each estimand via stratification. Furthermore, we describe when these estimands correspond to the same vaccine effectiveness parameter, and, when appropriate, propose a stratified estimator that can incorporate multiple reasons for testing and improve precision. The performance of our proposed method is demonstrated through simulation studies.
△ Less
Submitted 26 April, 2025; v1 submitted 6 December, 2023;
originally announced December 2023.
-
Randomization Inference for Cluster-Randomized Test-Negative Designs with Application to Dengue Studies: Unbiased estimation, Partial compliance, and Stepped-wedge design
Authors:
Bingkai Wang,
Suzanne M. Dufault,
Dylan S. Small,
Nicholas P. Jewell
Abstract:
In 2019, the World Health Organization identified dengue as one of the top ten global health threats. For the control of dengue, the Applying Wolbachia to Eliminate Dengue (AWED) study group conducted a cluster-randomized trial in Yogyakarta, Indonesia, and used a novel design, called the cluster-randomized test-negative design (CR-TND). This design can yield valid statistical inference with data…
▽ More
In 2019, the World Health Organization identified dengue as one of the top ten global health threats. For the control of dengue, the Applying Wolbachia to Eliminate Dengue (AWED) study group conducted a cluster-randomized trial in Yogyakarta, Indonesia, and used a novel design, called the cluster-randomized test-negative design (CR-TND). This design can yield valid statistical inference with data collected by a passive surveillance system and thus has the advantage of cost-efficiency compared to traditional cluster-randomized trials. We investigate the statistical assumptions and properties of CR-TND under a randomization inference framework, which is known to be robust and efficient for small-sample problems. We find that, when the differential healthcare-seeking behavior comparing intervention and control varies across clusters (in contrast to the setting of Dufault and Jewell, 2020 where the differential healthcare-seeking behavior is constant across clusters), current analysis methods for CR-TND can be biased and have inflated type I error. We propose the log-contrast estimator that can eliminate such bias and improve precision by adjusting for covariates. Furthermore, we extend our methods to handle partial intervention compliance and a stepped-wedge design, both of which appear frequently in cluster-randomized trials. Finally, we demonstrate our results by simulation studies and re-analysis of the AWED study.
△ Less
Submitted 7 February, 2022;
originally announced February 2022.
-
Estimation of population size based on capture recapture designs and evaluation of the estimation reliability
Authors:
Yue You,
Mark van der Laan,
Philip Collender,
Qu Cheng,
Alan Hubbard,
Nicholas P Jewell,
Zhiyue Tom Hu,
Robin Mejia,
Justin Remais
Abstract:
We propose a modern method to estimate population size based on capture-recapture designs of K samples. The observed data is formulated as a sample of n i.i.d. K-dimensional vectors of binary indicators, where the k-th component of each vector indicates the subject being caught by the k-th sample, such that only subjects with nonzero capture vectors are observed. The target quantity is the uncondi…
▽ More
We propose a modern method to estimate population size based on capture-recapture designs of K samples. The observed data is formulated as a sample of n i.i.d. K-dimensional vectors of binary indicators, where the k-th component of each vector indicates the subject being caught by the k-th sample, such that only subjects with nonzero capture vectors are observed. The target quantity is the unconditional probability of the vector being nonzero across both observed and unobserved subjects. We cover models assuming a single constraint (identification assumption) on the K-dimensional distribution such that the target quantity is identified and the statistical model is unrestricted. We present solutions for linear and non-linear constraints commonly assumed to identify capture-recapture models, including no K-way interaction in linear and log-linear models, independence or conditional independence. We demonstrate that the choice of constraint has a dramatic impact on the value of the estimand, showing that it is crucial that the constraint is known to hold by design. For the commonly assumed constraint of no K-way interaction in a log-linear model, the statistical target parameter is only defined when each of the $2^K - 1$ observable capture patterns is present, and therefore suffers from the curse of dimensionality. We propose a targeted MLE based on undersmoothed lasso model to smooth across the cells while targeting the fit towards the single valued target parameter of interest. For each identification assumption, we provide simulated inference and confidence intervals to assess the performance on the estimator under correct and incorrect identifying assumptions. We apply the proposed method, alongside existing estimators, to estimate prevalence of a parasitic infection using multi-source surveillance data from a region in southwestern China, under the four identification assumptions.
△ Less
Submitted 11 May, 2021;
originally announced May 2021.
-
Doubly robust capture-recapture methods for estimating population size
Authors:
Manjari Das,
Edward H. Kennedy,
Nicholas P. Jewell
Abstract:
Estimation of population size using incomplete lists (also called the capture-recapture problem) has a long history across many biological and social sciences. For example, human rights and other groups often construct partial and overlapping lists of victims of armed conflicts, with the hope of using this information to estimate the total number of victims. Earlier statistical methods for this se…
▽ More
Estimation of population size using incomplete lists (also called the capture-recapture problem) has a long history across many biological and social sciences. For example, human rights and other groups often construct partial and overlapping lists of victims of armed conflicts, with the hope of using this information to estimate the total number of victims. Earlier statistical methods for this setup either use potentially restrictive parametric assumptions, or else rely on typically suboptimal plug-in-type nonparametric estimators; however, both approaches can lead to substantial bias, the former via model misspecification and the latter via smoothing. Under an identifying assumption that two lists are conditionally independent given measured covariate information, we make several contributions. First, we derive the nonparametric efficiency bound for estimating the capture probability, which indicates the best possible performance of any estimator, and sheds light on the statistical limits of capture-recapture methods. Then we present a new estimator, and study its finite-sample properties, showing that it has a double robustness property new to capture-recapture, and that it is near-optimal in a non-asymptotic sense, under relatively mild nonparametric conditions. Next, we give a method for constructing confidence intervals for total population size from generic capture probability estimators, and prove non-asymptotic near-validity. Finally, we study our methods in simulations, and apply them to estimate the number of killings and disappearances attributable to different groups in Peru during its internal armed conflict between 1980 and 2000.
△ Less
Submitted 31 July, 2021; v1 submitted 28 April, 2021;
originally announced April 2021.
-
Interval censored recursive forests
Authors:
Hunyong Cho,
Nicholas P. Jewell,
Michael R. Kosorok
Abstract:
We propose the interval censored recursive forests (ICRF) which is an iterative tree ensemble method for interval censored survival data. This nonparametric regression estimator makes the best use of censored information by iteratively updating the survival estimate, and can be viewed as a self-consistent estimator with convergence monitored using out-of-bag samples. Splitting rules optimized for…
▽ More
We propose the interval censored recursive forests (ICRF) which is an iterative tree ensemble method for interval censored survival data. This nonparametric regression estimator makes the best use of censored information by iteratively updating the survival estimate, and can be viewed as a self-consistent estimator with convergence monitored using out-of-bag samples. Splitting rules optimized for interval censored data are developed and kernel-smoothing is applied. The ICRF displays the highest prediction accuracy among competing nonparametric methods in most of the simulations and in an applied example to avalanche data. An R package icrf is available for implementation.
△ Less
Submitted 20 May, 2021; v1 submitted 20 December, 2019;
originally announced December 2019.
-
On a general structure for hazard-based regression models: an application to population-based cancer research
Authors:
Francisco J. Rubio,
Laurent Remontet,
Nicholas P. Jewell,
Aurélien Belot
Abstract:
The proportional hazards model represents the most commonly assumed hazard structure when analysing time to event data using regression models. We study a general hazard structure which contains, as particular cases, proportional hazards, accelerated hazards, and accelerated failure time structures, as well as combinations of these. We propose an approach to apply these different hazard structures…
▽ More
The proportional hazards model represents the most commonly assumed hazard structure when analysing time to event data using regression models. We study a general hazard structure which contains, as particular cases, proportional hazards, accelerated hazards, and accelerated failure time structures, as well as combinations of these. We propose an approach to apply these different hazard structures, based on a flexible parametric distribution (Exponentiated Weibull) for the baseline hazard. This distribution allows us to cover the basic hazard shapes of interest in practice: constant, bathtub, increasing, decreasing, and unimodal. In an extensive simulation study, we evaluate our approach in the context of excess hazard modelling, which is the main quantity of interest in descriptive cancer epidemiology. This study exhibits good inferential properties of the proposed model, as well as good performance when using the Akaike Information Criterion for selecting the hazard structure. An application on lung cancer data illustrates the usefulness of the proposed model.
△ Less
Submitted 22 May, 2018; v1 submitted 21 May, 2018;
originally announced May 2018.